Metadata-Version: 2.4
Name: phyca
Version: 0.0.3
Summary: Universal ortholog based Phylogenomic toolkit.
Home-page: https://github.com/DeadlineWasYesterday/phyca
Author: Md Nafis Ul Alam
Author-email: deadlinewasyesterday@gmail.com
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Freely Distributable
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas>=2
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: biopython
Requires-Dist: BioNick
Requires-Dist: scipy
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# phyca: **phy**logeny and **c**ollinearity **a**ware assembly evaluation toolkit.

phyca is built around [Compleasm](https://github.com/huangnengCSU/compleasm/) utilizing the [NCBI Genome](https://www.ncbi.nlm.nih.gov/datasets/genome/) database. For a query assembly, phyca improves the precision of BUSCO/Compleasm annotations by up to 7%, makes syntenic comparisons to public reference genomes and rapidly places the assembly on a broad, precomputed phylogeny.

# Rationale
BUSCOs are the most conserved genes. Gene duplication and deletion in parallel branches can confound evolutionary genomic analyses. In our [article](https://link.springer.com/article/10.1186/s12915-025-02328-2), we explored the extent of BUSCO gene misannotations in major eukaryotic lineages. A misannotated gene is a gene that gets annotated by an annotation software when the original gene copy is lost in a lineage. From our survey of 20,000 plant, fungi and animal species genomes, we found that ~10% of BUSCO genes have significantly greater propensity of being misannotated than others. phyca filters out the misannotation-prone genes and outputs annotaitons and stats for curated BUSCO genes or CUSCOs.  

Our original [article](https://link.springer.com/article/10.1186/s12915-025-02328-2) was based on ODB10 orthologs. Phyca has now been updated to run on ODB12. Please view the updated ortholog stats visualized [here]().

# Installation
```
pip install phyca
```

phyca is distributed through PyPI and github. A working installation of Compleasm (including SEPP and pplacer) is necessary to avail all functionality. I recommend creating a conda environment to install Compleasm first and installing phyca in that environment, e.g.,

```
# create environment
conda create -n phyca python=3.9.25
# install compleasm
conda install bioconda::compleasm=0.2.7
# install phyca
pip install phyca
```

<span style="color:red">
Note: Since the compleasm update to ODB12 from version 0.2.7, phylogenetic placement features of phyca are difficult to implement. phyca 0.0.3 with compleasm 0.2.7 will only output CUSCO stats. In theory, version 0.0.2 with compleasm 0.2.6 using ODB10 should still be functional, but compleasm often crashes when trying to run on ODB10. Please create an issue if you intend to use any of the phylogenetic features.
</span style="color:">

<br>
Note that as of 02/03/2025, there is a known issue with pplacer and SEPP on Debian-based systems. A working solution is provided [here](https://github.com/smirarab/sepp/issues/140).

phyca has the following nonexhaustive dependency structure.
```
Python (tested with 3.9.25)
â†“
â”‚â”€â”€â”€numpy (tested with 2.0.1)
â”‚â”€â”€â”€pandas (tested with 2.3.3)
â”‚â”€â”€â”€matplotlib (tested with 3.9.4)
â”‚â”€â”€â”€seaborn (tested with 0.13.2)
â”‚â”€â”€â”€SciPy (tested with 1.13.1)
â”‚â”€â”€â”€BioNick (tested with 0.0.8)
â””â”€â”€â”€Compleasm (tested with 0.2.7)
        â”‚â”€â”€â”€ hmmer (tested with 3.1b2)
        â”‚â”€â”€â”€ miniprot (tested with 0.13-r248)
        â”‚      â””â”€â”€â”€ libgcc (tested with 14.2.0 under conda)
        â””â”€â”€â”€ SEPP (tested with 4.4.0)
               â””â”€â”€â”€ pplacer and guppy (v1.1.alpha19-0-g807f6f3) 
```

# Usage

phyca supports 10 BUSCO lineages: viridiplantae, liliopsida, eudicots, chlorophyta, fungi, ascomycota, basidiomycota, metazoa, arthropoda and vertebrata.

A simple run on a query assembly, would be:
```
phyca -a <assembly_file> -l <lineage>
```
The Compleasm output folder can also be used as input if compleasm output was previously generated:
```
phyca -c <compleasm_direcoty> -l <lineage>
```

The above run will output BUSCO, CUSCO (Curated USCOs with higher precision) and MUSCO (remaining USCOs) statistics and graphs. It will compare the query to chromosome level genome assemblies from NCBI genome and output a table with a measure of synteny against each genome. It will output a Neighbor-Joining tree based on BUSCO synteny. Finally, it will place the assembly on a large precomputed phylogeny for the lineage and graph the observed decay in BUSCO synteny against inferred phylogenetic distance.


# Assembly syntenic comparisons

phyca allows syntenic comparisons between assemblies with compleasm annotations or any set of gene annotations formatted in the same way.

to compute the syntenic distance between two assemblies with the -s flag. 
```
phyca -l <lineage> -s -a <assembly1> -r <assembly2>
```
The same comparison can be done by pointing to the compleasm output directoreis, if already available.
```
phyca -l <lineage> -s -c <assembly1_compdir> -m <assembly2_compdir>
```

Comparisons are done in the following way, adjust for variable query contiguity, and will produce the best results when one of the assemblies is highly contiguous and accurate:

<img src="https://ava.genome.arizona.edu/UniPhy/web/SFig09.png" width=800>



# UniPhyDB
The bulk data used by phyca is hosted by [AGI](https://www.genome.arizona.edu/)'s [AVA cluster](https://www.genome.arizona.edu/services/instrumentation.html). All alignments, precomputed trees, annotations, metadata and more information is available at [phyca.org](http://www.phyca.org).


# Example Output

USCO graph:

<img src="https://ava.genome.arizona.edu/UniPhy/web/USCO_bars.png" width="400">

Synteny decay plot:

<img src="https://ava.genome.arizona.edu/UniPhy/web/syndecay.png" width="400">


Placement tree snippet: 

<img src="https://ava.genome.arizona.edu/UniPhy/web/placement_snippet.png" width=400>



## Citation

Alam, M.N.U., RomÃ¡n-Palacios, C., Copetti, D. et al. Universal orthologs infer deep phylogenies and improve genome quality assessments. BMC Biol 23, 224 (2025). https://doi.org/10.1186/s12915-025-02328-2
