Metadata-Version: 2.1
Name: panCG
Version: 1.0.2
Summary: An integrative pipeline for family-level super-pangenome analysis across coding and noncoding sequences.
Home-page: https://github.com/rejo27
Author: ltan
Author-email: lei.tan.bio@outlook.com
License: MIT
Keywords: python,panCG,windows,mac,linux
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2
Requires-Dist: pandas>=2.2.2
Requires-Dist: pyBigWig>=0.3.23
Requires-Dist: pyranges>=0.1.2
Requires-Dist: ete3>=3.1.3
Requires-Dist: biopython>=1.84
Requires-Dist: networkx>=3.2.1
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: scipy>=1.13.1
Requires-Dist: jcvi>=1.4.21


# panCG pipeline
[![PyPI version](https://badge.fury.io/py/panCG.svg)](https://badge.fury.io/py/panCG)

<img src="/figures/panCG.png">



**Fig.** **1**. Overview of the panCG pipeline including panCNS, pangene, and CG modules. 

**(a)** Workflow of the panCNS module. (1) Multiple genome alignment: Input multiple genomes are used for reference-free multiple genome alignment via Progressive Cactus. (2) CNS identification: Each genome is individually designated as the reference genome to generate whole-genome alignments. PhastCons is then employed to identify conserved sequences; conserved sequences overlapping with CDSs are filtered out, yielding CNS regions for each genome. (3) Homologous group identification: Homologous CNS groups are identified based on the aforementioned multiple genome alignments and pairwise CNS comparisons. (4) Synteny cluster construction: Undirected CNS networks are constructed based on syntenic relationships of CNSs between species; these are connected networks. Rectangular nodes represent CNSs, with different colors indicating different species. Edges represent homologous relationships: green for CMR, gray for synteny, and red for best-hit relationships. (5) Index assignment: Members of each synteny network are assigned a unique index. (6) CNS retrieval: For CNSs missing from the index, their CMR PhastCons scores are evaluated. Those with scores exceeding the threshold and no overlap with CDSs are added to the CNS index and labeled “recall-CNS”; CNSs with scores above the threshold but overlapping with CDS are labeled “recall-CDS”; and those with scores below the threshold are designated as “recall-nonCE”. (7) Index retrieval and reassignment: CNSs retrieved in the previous step are incorporated into the index, and best-hit information is used to reassign indices to singleton CNSs. Finally, each CNS has a unique index and a reference-free panCNS is obtained. 

**(b)** Workflow of the pangene module. (1) Ortholog group identification: OrthoFinder is used to identify homolog groups. Circles represent genes, with colors distinguishing homologous genes from different species. Genes in gray ovals belong to the same gene group. (2) CPM clustering: Synteny networks are constructed for genes in each group, and genes are further clustered using the clique percolation method (CPM). Nodes represent genes, and gray edges represent syntenic relationships between genes; the set of genes enclosed by the dashed line denotes a gene cluster identified via CPM. (3) Network expansion: For genes lacking synteny, best-hit information is used to extend the gene synteny network. Red edges indicate homologous best-hit gene pairs. (4) Index assignment: A unique gene index is assigned to genes in each cluster. (5) Tree based reassignment: For gene indices containing paralogous genes, the phylogenetic relationships between genes are considered to further refine index assignments. Finally, each gene has a unique index, and a reference-free pangene is obtained. 

**(c)** Workflow of the CG module. (1) CNS-gene colocalization analysis: CNSs located in the upstream and downstream regions of each gene are extracted to form a CNS set. Based on CNS-gene colocalization patterns across species, we define Conserved Gene and Noncoding sequence Modules (CGNMs) as sets of co-localized CNSs and genes within the same index that are conserved in at least two species. Closely spaced CGNMs are further grouped into Conserved Gene and Noncoding Blocks (CGNBs). (2) Synteny network construction: CNS and gene sets corresponding to each gene index are independently derived from the panCNS and pangene modules, and used to construct CNS synteny networks and gene synteny networks, respectively. (3) Gene-CNS network construction: A unified network for panCNSs and pangenes is generated by merging CNS and gene synteny networks, which captures both collinearity and potential regulatory relationships among all genes and CNSs.

## Dependencies

1. `halLiftover` in [cactus](https://github.com/ComparativeGenomicsToolkit/cactus/blob/v2.9.3/BIN-INSTALL.md)

2. [phast](https://github.com/CshlSiepelLab/phast)

3. [JCVI](https://github.com/tanghaibao/jcvi)

4. [UCSC](https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/): `mafFilter`, `mafSplit`, `wigToBigWig`
``` shell
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafFilter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafSplit
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
```

5. [orthofinder](https://github.com/davidemms/OrthoFinder)

6. [blast](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/)

7. [diamond](https://github.com/bbuchfink/diamond)

## install
Make sure the above dependencies are installed and added to PATH.
``` shell
pip install panCG
panCG -h
```

## usage
``` shell
usage: panCG [-h] [--version]  ...

    an integrative pipeline for family-level super-pangenome analysis across coding and noncoding sequences.

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit

Commands:

    callCns      Identification of CNS
    pangene      build gene index
    pancns       build CNS index
    GenePavAsso  Associating gene-PAVs with phenotypes between species
    GLSS         Identification of Gene lineage-specific Synteny networks
    CLSS         Identification of CNS lineage-specific Synteny networks
    CnsGeneLink  According to the relative position relationship between CNS and gene and the maximum number of species supported by CNS index and gene index, CNS index and gene index are linked.
    CnsSyntenyNet
                 Used to construct SyntenyNet for filtered pan-CNS
```

## Input file format requirements
1. The chromosome ID of the genome cannot contain special characters such as `":", "-", ","`, etc., and no other characters except numbers, letters and "_".
2. In the gff annotation file, it is best to only have `gene, mRNA, exon, cds, and utr` information. And gene must contain the `ID` field, and others must contain the `Parent` field.
3. The bed file of gene must be a standard 6-column bed file. `<chrID> <start> <end> <geneID> <score/0> <chain>`.

## Output
### cns calling
| Directory               | File suffix        | Describe                            |
| ----------------------- | ------------------ | ----------------------------------- |
| {Workdir}/03-phastCons/ | {species}.all.bw   | PhastCons Conservative Scoring File |
| {Workdir}/03-phastCons/ | {species}.CNSs.bed | CNS file of {species}               |

### panCNS
| Directory             | File suffix        | Describe                                              |
| --------------------- | ------------------ | ----------------------------------------------------- |
| {Workdir}/Ref\_{ref}_ | .panGene.final.csv | The output panCNS file, each line represents an index |

### pangene
| Directory                     | File suffix  | Describe           |
| ----------------------------- | ------------ | ------------------ |
| {Workdir}/Ref\_{ref}_IndexDir | .panGene.csv | The result pangene |



The Group column is the homology group identified by orthofinder. 

| Group column        | Describe                                                     |
| ------------------- | ------------------------------------------------------------ |
| OGxxxxxxx.x         | Indicates the gene index subdivided in the homology group    |
| OGxxxxxxx.x.Un      | The .Un suffix indicates a set of genes that still exist independently in a single species after CPM. |
| OGxxxxxxx.x.tree_x  | Indicates the gene index subdivided by gene evolution relationship based on the gene index |
| OGxxxxxxx.x.tree_Un | The gene set ending with .tree_Un is a gene set that is not classified using evolutionary relationships. |
| UnMapOGXXXXXXX.x    | UnMap prefix is the gene that orthofinder has no clustering  |



## quick start
We provide example data for testing, which can be downloaded at [figshare](https://doi.org/10.6084/m9.figshare.29662034.v1).

### cactus
``` shell
nohup /usr/bin/time -v cactus jobstore species.22way.info.txt Citrus.7ways.test_data.hal \
   --realTimeLogging True \
   --workDir /home/xxx/cactus_dir \
   --maxCores 16 --maxMemory 100G --maxDisk 200G > Citrus.7ways.cactus.log 2>&1 &
   
nohup /usr/bin/time -v cactus-hal2maf jobstore Citrus.7ways.test_data.hal C_sinensis.7ways.maf \
    --refGenome C_sinensis \
    --chunkSize 10000000 \
    --noAncestors \
    --dupeMode single \
    --workDir /home/xxx/cactus_dir > C_sinensis.hal2maf.single.log 2>&1 &
```

### call CNS
```shell
for i in C_sinensis C_limon ponkan C_australasica C_glauca F_hindsii A_buxifolia
do
    /usr/bin/time -v panCG callCns \
        -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/CNScalling.config.yaml \
        -w /home/ltan/Tmp/01-PanCNSGene_test_data/01-callcns/${i} \
        -r ${i} > ${i}.callCns.log 2>&1
done
```

### pangene
```shell
nohup /usr/bin/time -v panCG pangene \
    -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/panCG.config.yaml \
    -w /home/ltan/Tmp/01-PanCNSGene_test_data/02-pangene \
    -r C_sinensis > pangene.log 2>&1 &
```

### panCNS
```shell
nohup /usr/bin/time -v panCG pancns \
    -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/panCG.config.yaml \
    -w /home/ltan/Tmp/01-PanCNSGene_test_data/03-pancns \
    -r C_sinensis \
    -W /home/ltan/Tmp/01-PanCNSGene_test_data/02-pangene \
    > pancns.log 2>&1 &
```

## Citation

