Metadata-Version: 2.1
Name: lactoscfa
Version: 0.4.1
Summary: Genome-level organic-acid and SCFA pathway profiling for lactobacilli and related bacteria.
Author: LactoSCFA developers
License: MIT
Project-URL: Homepage, https://github.com/Leopluswznn/LactoSCFA
Project-URL: Repository, https://github.com/Leopluswznn/LactoSCFA
Project-URL: Issues, https://github.com/Leopluswznn/LactoSCFA/issues
Keywords: microbiome,lactobacilli,SCFA,genomics,metabolic-pathways,bioinformatics
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE

# LactoSCFA

LactoSCFA is a lightweight command-line tool for genome-level prediction of organic-acid and short-chain fatty acid (SCFA) pathway potential from bacterial genomes or protein FASTA files.

The tool maps protein evidence to curated gene families, scores acid-producing pathway modules, and reports interpretable genome-level calls such as `complete`, `near_complete`, `partial`, and `absent`. It is designed for fast screening, comparative genomics, and thesis-level analysis where transparent pathway evidence is more useful than a black-box phenotype label.

LactoSCFA reports genomic potential. It does not directly measure metabolite concentration, growth-condition-dependent flux, or in vivo acid output.

## Validation

LactoSCFA was tested against an external phenotype-plus-genome dataset from a human gut Bacteroidales culture collection:

- Article: Zhang et al., 2024, *Cell Host & Microbe*, "Comprehensive analyses of a large human gut Bacteroidales culture collection reveal species- and strain-level diversity and evolution"
- DOI: [10.1016/j.chom.2024.08.016](https://doi.org/10.1016/j.chom.2024.08.016)
- Public data/code repository: [DFI-Bioinformatics/DFI_Bacteroidales](https://github.com/DFI-Bioinformatics/DFI_Bacteroidales)
- Phenotype source used here: [`data/metab.quant.matrix.csv`](https://github.com/DFI-Bioinformatics/DFI_Bacteroidales/blob/main/data/metab.quant.matrix.csv), described by the data repository as SCFA production or consumption in mM for 111 isolates measured by quantitative metabolomics.
- Dataset reconstructed for LactoSCFA validation: 111 genomes and 444 measured acid records covering acetate, propionate, butyrate, and succinate.
- Main validation results: butyrate strict prediction reached 6 TP and 105 TN with balanced accuracy 1.00; propionate potential prediction reached 103 TP, 1 FP, 7 TN, and 0 FN with accuracy 0.991.
- Interpretation: the benchmark supports LactoSCFA for strict butyrate prediction and propionate-potential screening. Succinate underperformance is treated as a database/module improvement target rather than a negative biological conclusion.

The article page mainly exposes this phenotype information visually as a heatmap. For the LactoSCFA benchmark, the machine-readable validation table was reconstructed from the public repository matrix rather than copied from a supplementary phenotype table. The construction steps were:

1. Read `metab.quant.matrix.csv`, whose rows are isolate IDs and whose acid columns are `Acetate`, `Propionate`, `Butyrate`, and `Succinate`.
2. Convert the wide matrix to a long table: `111 isolates x 4 acids = 444 records`.
3. Classify measured phenotypes as `producer` when delta mM `> 0.1`, `consumer` when delta mM `< -0.1`, and `neutral` otherwise.
4. Match isolate IDs to public genome assemblies from BioProjects `PRJNA737800` and `PRJNA792599`, download protein FASTA files, run LactoSCFA `db_v2` profile mode, and compare acid-level calls with the reconstructed phenotype classes.

## Installation

Python 3.11 or later is required.

Recommended Linux server installation with conda, DIAMOND, and Prodigal:

```bash
bash install_lactoscfa_linux.sh
./lactoscfa_lab.sh check-env
```

The installer creates or updates a conda environment named `lactoscfa`, installs `diamond` and `prodigal` from Bioconda, installs LactoSCFA, and writes a wrapper script that activates the environment before running the command.

Install from PyPI:

```bash
python -m pip install lactoscfa
lactoscfa --help
```

For offline Linux servers, install the wheel:

```bash
python3 -m pip install --user lactoscfa-0.4.1-py3-none-any.whl --no-deps
python3 -m lactoscfa.cli validate-db
```

If user-site installation is blocked:

```bash
python3 -m pip install --target ./lactoscfa_py --no-index --no-deps lactoscfa-0.4.1-py3-none-any.whl
PYTHONPATH="$PWD/lactoscfa_py" python3 -m lactoscfa.cli validate-db
```

For source installation:

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install .
lactoscfa validate-db
```

If source or `.tar.gz` installation fails with `BackendUnavailable: Cannot import 'setuptools.build_meta'`, install the wheel instead or install `setuptools` in that Python environment.

## Use

DIAMOND is required for protein FASTA input. DIAMOND and Prodigal are required for genome FASTA input.

```bash
lactoscfa check-env
lactoscfa validate-db
```

Run protein FASTA profiling:

```bash
lactoscfa profile \
  --proteins path/to/protein_faa_dir \
  --mode strict \
  --threads 8 \
  --out results/lactoscfa_profile
```

Run genome FASTA profiling:

```bash
lactoscfa profile \
  --genomes path/to/genome_fasta_dir \
  --mode strict \
  --threads 8 \
  --out results/lactoscfa_genomes
```

Explain one genome:

```bash
lactoscfa explain \
  --strain STRAIN_ID \
  --result results/lactoscfa_profile \
  --acid-set core,crossfeed \
  --text-summary \
  --out results/STRAIN_ID_explain
```

The command-line help provides the full option list:

```bash
lactoscfa --help
lactoscfa profile --help
lactoscfa explain --help
```

## Example `pathway_summary.txt`

`lactoscfa explain --text-summary` writes a compact, readable pathway-evidence summary for one genome. Example excerpt:

```text
Genome-level SCFA potential
ATCC_334 | db_v2 | profile mode

acid        status
acetate     complete
propionate  absent
butyrate    absent
lactate     complete
succinate   absent
formate     complete

Butyrate pathway evidence

P1 acetyl-CoA route            absent
metabolites: Acetyl-CoA -> acetoacetyl-CoA -> 3-hydroxybutyryl-CoA -> crotonyl-CoA -> butyryl-CoA -> butyrate
reaction genes:
  1. Acetyl-CoA --[thl detected]--> acetoacetyl-CoA
  2. acetoacetyl-CoA --[hbd missing]--> 3-hydroxybutyryl-CoA
  3. 3-hydroxybutyryl-CoA --[crt missing]--> crotonyl-CoA
  4. crotonyl-CoA --[bcd/etfAB missing]--> butyryl-CoA
  5. butyryl-CoA --[but/ptb-buk/ctfAB detected]--> butyrate
```

## Output Files

Default `profile` and `score` runs write a compact result set:

```text
db_manifest.json
report.md
run_manifest.json
tables/acid_details.tsv
tables/acid_summary.tsv
tables/gene_hits.filtered.tsv
tables/pathway_details.tsv
tables/pathway_summary.tsv
tables/strain_summary.tsv
```

Use `--full-output` only when matrix tables and SVG summary figures are needed.

## Repository Notes

- `lactoscfa/`: Python package and command-line implementation.
- `lactoscfa/data/db_v2/`: packaged curated database used by default.
- `db_v2/`: source-tree copy of the curated database.
- `examples/`: minimal example files.
- `tests/`: regression tests.
- `scripts/`: development and analysis utilities.

For publication or thesis use, report the LactoSCFA version, database version, search mode, thresholds, and validation scope.
