
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/advanced/plot_full_SCA_analysis.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_advanced_plot_full_SCA_analysis.py>`
        to download the full example code. or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_advanced_plot_full_SCA_analysis.py:


============================================================
Perform full SCA analysis on the S1A serine protease dataset
============================================================

This example shows the full process to perform a complete SCA analysis
and detect protein sectors from data importation, MSA filtering.

Finally, we export a fasta file of the residues contributing to the first
sector.

.. GENERATED FROM PYTHON SOURCE LINES 12-18

.. code-block:: Python


    # Author: Margaux Jullien <margaux.jullien@univ-grenoble-alpes.fr>
    #         Nelle Varoquaux <nelle.varoquaux@univ-grenoble-alpes.fr>
    #         Ivan Junier <ivan.junier@univ-grenoble-alpes.fr>
    # License: TBD








.. GENERATED FROM PYTHON SOURCE LINES 19-20

Import necessary

.. GENERATED FROM PYTHON SOURCE LINES 20-30

.. code-block:: Python

    import cocoatree.datasets as c_data
    import cocoatree.io as c_io
    import cocoatree.msa as c_msa
    import cocoatree.statistics.position as c_pos
    import cocoatree.statistics.pairwise as c_pw
    import cocoatree.deconvolution as c_deconv

    import matplotlib.pyplot as plt
    import numpy as np








.. GENERATED FROM PYTHON SOURCE LINES 31-37

Load the dataset
----------------

We start by importing the dataset. In this case, we can directly load the S1
serine protease dataset provided in :mod:`cocoatree`. To work on your on
dataset, you can use the `cocoatree.io.load_msa` function.

.. GENERATED FROM PYTHON SOURCE LINES 37-46

.. code-block:: Python


    serine_dataset = c_data.load_S1A_serine_proteases('rivoire')
    loaded_seqs = serine_dataset["alignment"]
    loaded_seqs_id = serine_dataset["sequence_ids"]
    n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

    print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
          positions.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    The loaded MSA has 1390 sequences and 832       positions.




.. GENERATED FROM PYTHON SOURCE LINES 47-52

MSA filtering
-------------

We clean the loaded MSA by filtering both sequences and positions.


.. GENERATED FROM PYTHON SOURCE LINES 52-59

.. code-block:: Python


    sequences, sequences_id, positions = c_msa.filter_sequences(
        loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
    n_pos = len(positions)
    print(f"After filtering, we have {n_pos} remaining positions.")
    print(f"After filtering, we have {len(sequences)} remaining sequences.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    After filtering, we have 243 remaining positions.
    After filtering, we have 1376 remaining sequences.




.. GENERATED FROM PYTHON SOURCE LINES 60-62

Compute the matrix of pairwise sequence identity
------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 62-73

.. code-block:: Python


    identity_matrix = c_msa.compute_seq_identity(sequences)

    fig, ax = plt.subplots()
    m = ax.imshow(identity_matrix, vmin=0, vmax=1, cmap='inferno')
    ax.set_xlabel("sequences", fontsize=10)
    ax.set_ylabel("sequences", fontsize=10)
    ax.set_title('Matrix of pairwise sequence identity', fontweight="bold")
    cb = fig.colorbar(m)
    cb.set_label("Pairwise sequence identity", fontweight="bold")




.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_001.png
   :alt: Matrix of pairwise sequence identity
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 74-75

Compute sequence weights

.. GENERATED FROM PYTHON SOURCE LINES 75-79

.. code-block:: Python

    seq_weights, m_eff = c_msa.compute_seq_weights(sequences)
    print('Number of effective sequences %d' %
          np.round(m_eff))





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    computing weight of seq 1/1376  
    computing weight of seq 101/1376        
    computing weight of seq 201/1376        
    computing weight of seq 301/1376        
    computing weight of seq 401/1376        
    computing weight of seq 501/1376        
    computing weight of seq 601/1376        
    computing weight of seq 701/1376        
    computing weight of seq 801/1376        
    computing weight of seq 901/1376        
    computing weight of seq 1001/1376       
    computing weight of seq 1101/1376       
    computing weight of seq 1201/1376       
    computing weight of seq 1301/1376       
    Number of effective sequences 1019




.. GENERATED FROM PYTHON SOURCE LINES 80-82

Compute conservation along the MSA
----------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 82-93

.. code-block:: Python

    Di = c_pos.compute_conservation(sequences, seq_weights)

    fig, ax = plt.subplots(figsize=(9, 4))
    xvals = [i+1 for i in range(len(Di))]
    xticks = [0, 50, 100, 150, 200, 250]
    ax.bar(xvals, Di, color='k')
    plt.tick_params(labelsize=11)
    ax.set_xticks(xticks)
    ax.set_xlabel('residues', fontsize=14)
    ax.set_ylabel('Di', fontsize=14)




.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_002.png
   :alt: plot full SCA analysis
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Text(72.97222222222221, 0.5, 'Di')



.. GENERATED FROM PYTHON SOURCE LINES 94-96

Compute the SCA coevolution matrix
----------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 96-107

.. code-block:: Python


    SCA_matrix = c_pw.compute_sca_matrix(sequences, seq_weights)

    fig, ax = plt.subplots()
    im = ax.imshow(SCA_matrix, vmin=0, vmax=1.4, cmap='inferno')

    ax.set_xlabel('residues', fontsize=10)
    ax.set_ylabel(None)
    ax.set_title('SCA matrix')
    fig.colorbar(im, shrink=0.7)




.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_003.png
   :alt: SCA matrix
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.colorbar.Colorbar object at 0x7f07597b0950>



.. GENERATED FROM PYTHON SOURCE LINES 108-112

Extraction of principal components (PCA analysis)
and of independent components (ICA analysis)
(this can take some time because of randomization)
-------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 112-121

.. code-block:: Python


    n_components = 9
    principal_components = c_deconv.extract_principal_components(
        SCA_matrix)
    idpt_components = c_deconv.extract_independent_components(
        SCA_matrix,
        n_components=n_components)









.. GENERATED FROM PYTHON SOURCE LINES 122-124

Plot components
---------------

.. GENERATED FROM PYTHON SOURCE LINES 124-146

.. code-block:: Python


    n_components_to_plot = n_components
    if n_components_to_plot % 2:
        print('Odd number of components: the last one is discarded for \
              visualization')
        n_components_to_plot -= 1

    pairs = [[x, x+1] for x in range(0, n_components_to_plot, 2)]
    ncols = len(pairs)
    plt.rcParams['figure.figsize'] = 14, 8
    fig, axes = plt.subplots(nrows=2, ncols=len(pairs), tight_layout=True)
    for k, [k1, k2] in enumerate(pairs):
        ax = axes[0, k]
        ax.plot(principal_components[k1], principal_components[k2], 'ok')
        ax.set_xlabel("PC %i" % (k1+1), fontsize=16)
        ax.set_ylabel("PC %i" % (k2+1), fontsize=16)

        ax = axes[1, k]
        ax.plot(idpt_components[k1], idpt_components[k2], 'ok')
        ax.set_xlabel("IC %i" % (k1+1), fontsize=16)
        ax.set_ylabel("IC %i" % (k2+1), fontsize=16)




.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_004.png
   :alt: plot full SCA analysis
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Odd number of components: the last one is discarded for           visualization




.. GENERATED FROM PYTHON SOURCE LINES 147-149

Extract sectors
---------------

.. GENERATED FROM PYTHON SOURCE LINES 149-156

.. code-block:: Python


    sectors = c_deconv.extract_sectors(idpt_components, SCA_matrix)

    print('Sector positions on (filtered) sequences:')
    for isec, sec in enumerate(sectors):
        print('sector %d: %s' % (isec+1, sec))





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Sector positions on (filtered) sequences:
    sector 1: [64, 48, 145, 195, 63, 182, 199, 168, 146, 197, 144, 108, 222, 61, 196, 198, 194, 211, 49, 216, 208, 192, 234, 26, 35, 62, 235, 50, 155, 200, 23, 60, 76, 213, 73, 58, 114, 25]
    sector 2: [140, 202, 52, 110, 78, 87, 81, 129, 75, 124, 33, 32, 128, 82, 207, 226, 120, 83, 69, 31]
    sector 3: [225, 190, 212, 223, 180, 183, 184, 220, 224, 193, 172, 210, 142, 176, 160, 177, 217, 219, 161, 227]
    sector 4: [0, 1, 2, 3, 4, 5, 6, 22, 7, 8, 9, 10, 11, 12, 13, 158, 14]
    sector 5: [24, 228, 51, 215, 188]
    sector 6: [97, 100, 98, 101, 95, 91, 107, 126, 102]
    sector 7: [231, 109, 59, 53, 206, 131, 130]
    sector 8: [57, 36, 111, 204, 34, 117]
    sector 9: [46, 39, 37, 38, 156, 152, 41, 201, 149, 148, 143]




.. GENERATED FROM PYTHON SOURCE LINES 157-161

Plot coevolution within and between the sectors
Each white square corresponds to a sector, with the residues ordered in
decreasing contribution to the independent component associated from top to
bottom and from left to right.

.. GENERATED FROM PYTHON SOURCE LINES 161-183

.. code-block:: Python

    fig, ax = plt.subplots(tight_layout=True)

    sector_sizes = [len(sec) for sec in sectors]
    cumul_sizes = sum(sector_sizes)
    sorted_pos = [s for sec in sectors for s in sec]

    im = ax.imshow(SCA_matrix[np.ix_(sorted_pos, sorted_pos)],
                   vmin=0, vmax=2,
                   interpolation='none', aspect='equal',
                   extent=[0, cumul_sizes, 0, cumul_sizes], cmap='inferno')
    ax.set_title('SCA matrix, sorted according to sectors')
    cb = fig.colorbar(im)
    cb.set_label("coevolution level")

    line_index = 0
    for i in range(n_components):
        ax.plot([line_index + sector_sizes[i], line_index + sector_sizes[i]],
                [0, cumul_sizes], 'w', linewidth=2)
        ax.plot([0, cumul_sizes], [cumul_sizes - line_index,
                                   cumul_sizes - line_index], 'w', linewidth=2)
        line_index += sector_sizes[i]




.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_005.png
   :alt: SCA matrix, sorted according to sectors
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_005.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 184-187

Do the same but for the SCA matrix where the global correlation mode has
been removed and, hence, such that sectors are better highlighted.
See Rivoire et al., PLOSCB, 2016

.. GENERATED FROM PYTHON SOURCE LINES 187-212

.. code-block:: Python


    # removing global model (ngm = no global mode),
    # i.e., removing first principal component
    SCA_matrix_ngm = c_deconv.substract_first_principal_component(SCA_matrix)

    # plotting
    fig, ax = plt.subplots(tight_layout=True)
    im = ax.imshow(SCA_matrix_ngm[np.ix_(sorted_pos, sorted_pos)], vmin=0, vmax=1,
                   interpolation='none', aspect='equal',
                   extent=[0, sum(sector_sizes), 0, sum(sector_sizes)],
                   cmap='inferno')
    ax.set_title('SCA matrix without global mode, sorted according to sectors')
    cb = fig.colorbar(im)
    cb.set_label("coevolution level")

    line_index = 0
    for i in range(n_components):
        ax.plot([line_index + sector_sizes[i], line_index + sector_sizes[i]],
                [0, sum(sector_sizes)], 'w', linewidth=2)
        ax.plot([0, sum(sector_sizes)], [sum(sector_sizes) - line_index,
                                         sum(sector_sizes) - line_index],
                'w', linewidth=2)
        line_index += sector_sizes[i]





.. image-sg:: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_006.png
   :alt: SCA matrix without global mode, sorted according to sectors
   :srcset: /auto_examples/advanced/images/sphx_glr_plot_full_SCA_analysis_006.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 213-216

Export sector sequences for all sequences as a fasta file
The file can then be used for visualization along a phylogenetic tree
as implemented in the cocoatree.visualization module

.. GENERATED FROM PYTHON SOURCE LINES 216-257

.. code-block:: Python


    sector_1_pos = list(positions[sectors[0]])
    sector_1 = []
    for sequence in range(len(sequences_id)):
        seq = ''
        for pos in sector_1_pos:
            seq += loaded_seqs[sequence][pos]
        sector_1.append(seq)

    c_io.export_fasta(sector_1, sequences_id, 'sector_1.fasta')

    if False:  # need to be revised
        sector_1_pos = list(positions[sectors[0].items])
        sector_1 = []
        for sequence in range(len(sequences_id)):
            seq = ''
            for pos in sector_1_pos:
                seq += sequences[sequence][pos]
            sector_1.append(seq)

        c_io.export_fasta(sector_1, sequences_id, 'sector_1.fasta')

        # %
        # Export files necessary for Pymol visualization
        # Load PDB file of rat's trypsin
        pdb_seq, pdb_pos = c_io.load_pdb('data/3TGI.pdb', pdb_id='TRIPSIN',
                                         chain='E')
        # Map PDB positions on the MSA sequence corresponding to rat's trypsin:
        # seq_id='14719441'
        pdb_mapping = c_msa.map_to_pdb(pdb_seq, pdb_pos, sequences, sequences_id,
                                       ref_seq_id='14719441')
        # Export lists of the first sector positions and each residue's
        # contribution to the independent component to use for visualization on
        # Pymol.
        # The residues are ordered in the list by decreasing contribution score
        # (the first residue in the list is the highest contributing)
        c_io.export_sector_for_pymol(pdb_mapping, idpt_components.T, axis=0,
                                     sector_pos=sector_1_pos,
                                     ics=sectors,
                                     outpath='color_sector_1_pymol.npy')









.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 3.695 seconds)


.. _sphx_glr_download_auto_examples_advanced_plot_full_SCA_analysis.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/tree-timc/cocoatree/gh-pages?urlpath=lab/tree/notebooks/auto_examples/advanced/plot_full_SCA_analysis.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_full_SCA_analysis.ipynb <plot_full_SCA_analysis.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_full_SCA_analysis.py <plot_full_SCA_analysis.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_full_SCA_analysis.zip <plot_full_SCA_analysis.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
