Metadata-Version: 2.4
Name: astrogenomics-sgtree
Version: 1.0
Summary: Species tree construction from marker gene phylogenies
Author-email: "Juan C. Villada" <jvillada@lbl.gov>
Requires-Python: >=3.12,<3.13
Description-Content-Type: text/markdown
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: POSIX :: Linux
Requires-Dist: pandas>=2.2.3,<2.3
Requires-Dist: biopython>=1.86,<1.87
Requires-Dist: numpy>=1.26,<2
Requires-Dist: ete3>=3.1.3,<3.2
Requires-Dist: pyhmmer>=0.12.0,<0.13

# SGTree

SGTree is an end-to-end workflow for phylogenetic tree building. Use the provided sets of HMMs or provide your own HMMs to find the proteins of interest. SGTree then performs gene tree to approximate species tree reconciliation to select the most likely correct copy of a protein in case of duplications (paralogs, contamination). 

## Setup

Install the Pixi environment:

```bash
pixi install
```

The environment is managed through `pixi.toml` only.

## Release and Deploy (AGP)

Use this procedure whenever you change code in this `sgtree` repo and want to publish a new release to PyPI and the `astrogenomics` pixi/prefix channel.

Important naming:

- PyPI project: `astrogenomics-sgtree`
- Conda/pixi package: `sgtree`

### 1. Prerequisites

Ensure these environment variables are set in your shell:

```bash
export PYPI_API_TOKEN="..."
export PREFIX_API_KEY="..."
```

AGP config used for this project:

- `/clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml`

### 2. Run the AGP release

From the AGP repo, run:

```bash
cd /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp
pixi run -q agp \
  --config /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml \
  --project /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree \
  release <VERSION>
```

Example:

```bash
pixi run -q agp \
  --config /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml \
  --project /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree \
  release 2.0.1
```

What AGP does automatically:

- updates version fields in `sgtree/__init__.py` and `pixi.toml`
- builds and uploads `astrogenomics-sgtree` to PyPI
- updates `recipe.yaml` version, source URL, and sha256
- builds conda package `sgtree` and uploads it to `prefix.dev/astrogenomics`
- does not create git tags, pushes, or GitHub releases (disabled for this config)

### 3. Verify the release

Check prefix channel:

```bash
pixi search sgtree -c https://prefix.dev/astrogenomics -c conda-forge
```

Check PyPI:

```bash
python -m pip index versions astrogenomics-sgtree
```

### 4. Install globally with pixi

```bash
pixi global install -c https://prefix.dev/astrogenomics -c conda-forge "sgtree==<VERSION>"
```

### 5. If prefix upload fails but build succeeded

Retry only the prefix upload with the built artifact:

```bash
cd /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp
pixi run -q rattler-build upload prefix \
  --channel astrogenomics \
  --skip-existing \
  /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree/dist/conda/noarch/sgtree-<VERSION>-*.conda
```

## Run

Primary interface (Nextflow):

```bash
pixi run sgtree --help
```
Basic run:

```bash
pixi run sgtree \
  --genomedir <path to dir with protein faa files, one faa file per genome> \
  --modeldir <path to marker set .hmm>
```

Example run:

```bash
pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm
```

Marker-selection run with references and singleton filtering:

```bash
pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --outdir runs/nextflow/manual_full \
  --marker_selection true \
  --ref testgenomes/chlorref \
  --singles yes
```

`pixi run sgtree` writes logs automatically to `runs/nextflow/logs/`.
Marker searches and `--aln hmmalign` are run with `pyhmmer` (HMMER-compatible search output).

Example with IQ-TREE and explicit HMM threshold mode:

```bash
pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --tree_method iqtree \
  --iqtree_fast true \
  --hmmsearch_cutoff cut_ga
```

Second choice (Python implementation without nextflow):

```bash
pixi run sgtree-python testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8
```

Backward-compatible wrapper:

```bash
pixi run python ./bin/sgtree_wrapper.py testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8
```

## Settings

Core method controls:

- `--aln`: `hmmalign`, `mafft`, or `mafft-linsi` (default `hmmalign`).
- `--tree_method`: `fasttree` or `iqtree` (default `fasttree`) for both species tree and per-marker trees.
- `--iqtree_fast`: apply `-fast` when `--tree_method iqtree` (default `true`).
- `--iqtree_model`: IQ-TREE model string (default `LG+F+I+G4`).

HMM search thresholds:

- `--hmmsearch_cutoff cut_ga`: use model gathering cutoffs (recommended for curated marker sets such as UNI56).
- `--hmmsearch_cutoff cut_tc`: use model trusted cutoffs.
- `--hmmsearch_cutoff cut_nc`: use model noise cutoffs.
- `--hmmsearch_cutoff evalue --hmmsearch_evalue <float>`: use a plain E-value threshold.

Genome inclusion/exclusion criteria:

- `--percent_models` (default `10`): minimum fraction of markers detected per genome.
- `--max_sdup` (default `-1`): maximum allowed copies of any single marker in one genome; `-1` disables.
- `--max_dupl` (default `-1`): maximum allowed fraction of markers present in multiple copies; `-1` disables.
- `--lflt` (default `0`): optional per-marker length filter (% of median hit length).
- `--num_nei` (default `0`): optional singleton-removal neighbor count override (`0` keeps auto mode).

nsgtree-style mapping:

- `minmarker` -> `--percent_models` (fraction mapped to percent).
- `maxsdup` -> `--max_sdup`.
- `maxdupl` -> `--max_dupl`.
- `hmmsearch_cutoff` -> `--hmmsearch_cutoff` and `--hmmsearch_evalue`.
- `tmethod` -> `--tree_method`.
- `iq_*` model controls -> `--iqtree_model` (and `--iqtree_fast`).
- `mafftv`/`mafft` -> `--aln mafft` or `--aln mafft-linsi` (or `--aln hmmalign`).

Practical selection guide:

- Curated marker sets (for example UNI56): start with `--hmmsearch_cutoff cut_ga`.
- Less curated/custom marker sets: start with `--hmmsearch_cutoff evalue --hmmsearch_evalue 1e-5`, then tighten if false positives appear.
- `--aln hmmalign` is the fastest stable default and keeps alignment behavior tied to each profile HMM.
- `--aln mafft-linsi` is slower but can help when marker-specific profile alignment is not desired.
- `--tree_method fasttree` is the quick default; `--tree_method iqtree --iqtree_fast true` is a practical higher-accuracy option.
- Typical inclusion presets:
- Balanced: `--percent_models 10 --max_sdup 2 --max_dupl 0.25`
- Strict: `--percent_models 30 --max_sdup 1 --max_dupl 0.10`
- Relaxed: `--percent_models 5 --max_sdup -1 --max_dupl -1`

## Input Requirements

Proteomes must be FASTA (`*.faa`). SGTree now normalizes all inputs internally to:

```text
>IMG2684622718|2685462912
MLCAFAEEEAKIAETVGKVATELKVKKLLSDFATKEGEEHISTYNKIAMTAKAEGYADIEAMLCAFAEEEAKLQKL
```

Normalization behavior:

- Directory input (`--genomedir <dir>`): one proteome per `*.faa`; genome id is derived from filename stem.
- Single FASTA input (`--genomedir <file>`): if headers already contain `genome|protein`, the genome part is preserved.
- Headers and IDs are sanitized to avoid delimiter collisions.
- Malformed header joins (for example `...*>next_header`) are repaired before parsing.
- Invalid amino-acid characters are replaced with `X`; `*` is removed.
- Header mapping is written as `proteomes_header_map_<input>.tsv` in `--outdir`.

## Output Structure

Nextflow output (`--outdir`):

```text
<outdir>/
  tree.nwk
  tree_final.nwk                 # marker-selection mode
  tree_final.png                 # marker-selection mode
  marker_count_matrix.csv
  marker_count.txt               # basic mode
  marker_counts.txt              # marker-selection mode
  marker_selection_rf_values.txt # marker-selection mode
  color.txt
  log_genomes_removed.txt
  proteomes_header_map_<input>.tsv
```

Python output (`--save_dir`):

```text
<save_dir>/
  tree.nwk or tree_final.nwk
  tree_final.png                  # marker-selection mode
  marker_count_matrix.csv
  marker_selection_rf_values.txt  # marker-selection mode
  log_genomes_removed.txt
  logfile_*.txt
  temp/
    *.zip
    itol/
```

## Repository Structure

```text
sgtree/
  sgtree/                 # Python package implementation
  bin/sgtree_wrapper.py   # backward-compatible wrapper
  main.nf                 # Nextflow entrypoint
  workflows/              # DSL2 workflow composition
  modules/                # DSL2 process modules
  bin/                    # helper scripts and launch wrappers
  tests/
    regression_parity.py  # cross-engine parity checks
  resources/
    models/               # combined marker-set HMM files
  testgenomes/            # example query/reference data
  runs/                   # runtime outputs/work/logs (.gitkeep tracked)
  pixi.toml               # reproducible environment + tasks
  nextflow.config         # runtime defaults and CPU settings
```

## Workflow

```text
                            +-------------------+
                            |  Input Proteomes  |
                            |  + HMM Models     |
                            +---------+---------+
                                      |
                                      v
                             +--------+--------+
                             |    HMMSEARCH    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | PARSE_HMMSEARCH |
                             | marker matrix   |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | EXTRACT_SEQS    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ALIGN (hmmalign/|
                             | mafft/linsi)    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ELIM_DUPLICATES |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |     TRIMAL      |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | BUILD_SUPERMATRIX|
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |  TREE_BUILDER   |
                             |   tree.nwk      |
                             +--------+--------+
                                      |
                          marker_selection?
                           /            \
                        no               yes
                        |                 |
                        v                 v
                  +-----+-----+   +-------+--------+
                  | iTOL TXT  |   | per-marker     |
                  | marker_*  |   | TRIMAL+TREEBLD |
                  +-----------+   +-------+--------+
                                         |
                                         v
                                  +------+------+
                                  | RF_SELECTION|
                                  +------+------+
                                         |
                                 singles?|
                                  /      \
                               no         yes
                               |           |
                               v           v
                      +--------+---+   +---+--------+
                      | WRITE_CLEAN |   |REMOVE_     |
                      | ALIGNMENTS  |   |SINGLES     |
                      +--------+----+   +---+--------+
                               \           /
                                \         /
                                 v       v
                               +--+------+
                               |TRIMAL_FINAL
                               +--+------+
                                  |
                                  v
                             +----+-----+
                             |SUPERMATRIX|
                             +----+-----+
                                  |
                                  v
                             +----+-----+
                             |TREE_BUILDER|
                             |tree_final |
                             +----+-----+
                                  |
                                  v
                       +----------+-----------+
                       | tree_final.png       |
                       | marker_counts.txt    |
                       | marker_selection_rf  |
                       +----------------------+
```

## Repository Hygiene

Use this command for a clean runtime workspace between runs:

```bash
pixi run clean-runtime
```

## Authors and Contributors

| Author | Email | Date |
|---|---|---|
| Ewan Whittaker-Walker | ewanww@berkeley.edu | 05/19/2019 |
| Frederik Schulz | fschulz@lbl.gov | Since 2019 |
| Juan C. Villada | jvillada@lbl.gov | Since 2021 |
| Marianne Buscaglia | mbuscaglia@lbl.gov | Since 2022 |

