
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/datasets/plot_s1A_serine_proteases.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_datasets_plot_s1A_serine_proteases.py>`
        to download the full example code. or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_datasets_plot_s1A_serine_proteases.py:


S1A serine proteases
====================

Load the dataset

.. GENERATED FROM PYTHON SOURCE LINES 8-38




.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    halabi
    Number of sequences 1470
    The loaded MSA has 1470 sequences and 832         positions.
    After filtering, we have 242 remaining positions.
    After filtering, we have 1456 remaining sequences.
    computing weight of seq 1/1456  
    computing weight of seq 101/1456        
    computing weight of seq 201/1456        
    computing weight of seq 301/1456        
    computing weight of seq 401/1456        
    computing weight of seq 501/1456        
    computing weight of seq 601/1456        
    computing weight of seq 701/1456        
    computing weight of seq 801/1456        
    computing weight of seq 901/1456        
    computing weight of seq 1001/1456       
    computing weight of seq 1101/1456       
    computing weight of seq 1201/1456       
    computing weight of seq 1301/1456       
    computing weight of seq 1401/1456       
    Number of effective sequences 1053

    rivoire
    Number of sequences 1390
    The loaded MSA has 1390 sequences and 832         positions.
    After filtering, we have 243 remaining positions.
    After filtering, we have 1376 remaining sequences.
    computing weight of seq 1/1376  
    computing weight of seq 101/1376        
    computing weight of seq 201/1376        
    computing weight of seq 301/1376        
    computing weight of seq 401/1376        
    computing weight of seq 501/1376        
    computing weight of seq 601/1376        
    computing weight of seq 701/1376        
    computing weight of seq 801/1376        
    computing weight of seq 901/1376        
    computing weight of seq 1001/1376       
    computing weight of seq 1101/1376       
    computing weight of seq 1201/1376       
    computing weight of seq 1301/1376       
    Number of effective sequences 1019







|

.. code-block:: Python


    import numpy as np
    from cocoatree.datasets import load_S1A_serine_proteases

    import cocoatree.msa as c_msa
    import cocoatree.statistics.position as c_pos


    for paper in ["halabi", "rivoire"]:
        print(paper)
        dataset = load_S1A_serine_proteases(paper=paper)

        print("Number of sequences", len(dataset["alignment"]))
        loaded_seqs = dataset["alignment"]
        loaded_seqs_id = dataset["sequence_ids"]
        n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

        print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
            positions.")

        sequences, sequences_id, positions = c_msa.filter_sequences(
            loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
        n_pos = len(positions)
        print(f"After filtering, we have {n_pos} remaining positions.")
        print(f"After filtering, we have {len(sequences)} remaining sequences.")

        seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
        print('Number of effective sequences %d' %
              np.round(m_eff))
        print()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.729 seconds)


.. _sphx_glr_download_auto_examples_datasets_plot_s1A_serine_proteases.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/tree-timc/cocoatree/gh-pages?urlpath=lab/tree/notebooks/auto_examples/datasets/plot_s1A_serine_proteases.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_s1A_serine_proteases.ipynb <plot_s1A_serine_proteases.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_s1A_serine_proteases.py <plot_s1A_serine_proteases.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_s1A_serine_proteases.zip <plot_s1A_serine_proteases.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
