Metadata-Version: 2.4
Name: flavotyper
Version: 0.5.0
Summary: In silico serotyping for Flavobacterium psychrophilum
Author: FlavoTyper contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://forge.inrae.fr/eric.duchaud/flavotyper
Project-URL: Repository, https://forge.inrae.fr/eric.duchaud/flavotyper
Project-URL: Documentation, https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/README.md
Keywords: bioinformatics,genomics,microbiology,serotyping,Flavobacterium psychrophilum
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Pillow>=9.0
Requires-Dist: PyYAML>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=9.0; extra == "dev"
Provides-Extra: release
Requires-Dist: build>=1.2; extra == "release"
Requires-Dist: twine>=5.0; extra == "release"
Dynamic: license-file

# FlavoTyper


[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-lightgrey.svg)](https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/LICENSE) [![PyPI](https://img.shields.io/badge/PyPI-0.4.0-blue.svg)](https://pypi.org/project/flavotyper/0.4.0/) [![Bioconda](https://img.shields.io/conda/vn/bioconda/flavotyper.svg?label=Bioconda&color=green)](https://anaconda.org/bioconda/flavotyper) [![Platforms](https://img.shields.io/badge/platforms-Linux%20%7C%20macOS-blue.svg)](https://anaconda.org/bioconda/flavotyper) [![Anaconda-Server Badge](https://anaconda.org/bioconda/flavotyper/badges/downloads.svg?label=Downloads)](https://anaconda.org/bioconda/flavotyper)

FlavoTyper is a command-line bioinformatics tool that performs in silico serotyping of *Flavobacterium psychrophilum* genome assemblies.

---

## Introduction

Flavobacteriosis is a bacterial disease with significant impact on the global aquaculture industry, particularly affecting salmonids such as rainbow trout and Atlantic salmon. It causes substantial economic losses in fish farms worldwide.

The causative agent is *Flavobacterium psychrophilum*, a Gram-negative, rod-shaped psychrotrophic bacterium belonging to the family Flavobacteriaceae of the phylum Bacteroidota.

Phenotypic characterization of this pathogen (including serotyping based on the structural variations in the O-polysaccharide moiety of cell surface lipopolysaccharide) provides critical information for epidemiological surveillance, outbreak investigation, and the design of effective vaccines. FlavoTyper enables this characterization directly from genome assemblies, making serotyping scalable, reproducible, and independent of wet-lab assays.

FlavoTyper is based on previously published data including a multiplex PCR serotyping scheme by Rochat et al., 2017 ( https://doi.org/10.3389/fmicb.2017.01752 ) and the functional characterization of the O-polysaccharide encoding locus in a subset of strains by Cisar et al. 2019 (https://doi.org/10.3389/fmicb.2019.01041).

---

## Installation

Multiple installation options are available depending on the user context and needs. We recommend **Bioconda**, which installs FlavoTyper and its external tools (BLAST+, fastANI) in a single step. PyPI and from-source installs will require you to install those external tools yourself.

### Option 1 — Bioconda (recommended)

> It is recommeneded to install a conda package manager, create a separate environment, and activate it before installing and running FlavoTyper. New to conda? Follow the [First time with conda?](Troubleshooting.md#first-time-with-conda) walkthrough for detailed steps.

```bash
conda create -n flavotyper -c conda-forge -c bioconda flavotyper
conda activate flavotyper
```

### Option 2 — PyPI

> This option requires a working Python installation, then creating a virtual environment and activating it before installing FlavoTyper. First time with Python/pip? See the [step-by-step setup guide](Troubleshooting.md#first-time-with-pypi).

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install flavotyper
```

### Option 3 — From source

```bash
git clone https://forge.inrae.fr/eric.duchaud/flavotyper.git
cd flavotyper
python3 -m venv .venv
source .venv/bin/activate
pip install .
```

### External dependencies

Required only for the **PyPI** and **from-source** installs:

| Dependency | Minimum version | Purpose |
|---|---|---|
| BLAST+ (`blastn`, `makeblastdb`) | 2.12 | Marker alignment and locus comparison |
| fastANI | 1.3 | Species validation (ANI-based QC) |

The simplest way to get them is conda:

```bash
conda install -c conda-forge -c bioconda blast fastani
```

### Verify the installation

```bash
flavotyper --version
flavotyper data-dir
blastn -version
fastANI --version
```

---

## Quickstart

1. Place the genome assembly FASTA file(s) you want to type in one directory.
2. Run FlavoTyper:

```bash
flavotyper type --genomes path/to/genomes/ --outdir results/
```

3. View results in the output directory — the main output is `results/typing_results.tsv`.

---

## Data input

FlavoTyper accepts genome assemblies for *F. psychrophilum* in FASTA format (`.fa`, `.fna`, `.fasta`, `.fas`, optionally gzip-compressed). Both single-genome and multi-genome runs are supported:

```bash
# Single genome
flavotyper type --genomes genome.fasta --outdir results/

# Multiple genomes from a directory (all supported extensions are discovered automatically)
flavotyper type --genomes genomes/ --outdir results/ --threads 4
```

Sample identifiers are derived automatically from input filename stems.

---

## Data output

All output files are written to the directory specified with `--outdir`.

### 1. Tabular format (TSV)

**`typing_results.tsv`** — the main output table, one row per sample.

Key columns include the assigned serotype, call state (Resolved / Partial / Ambiguous / NotTyped), detected markers, QC metrics, typing warnings, and a reference sentence for known serotypes. For the full column reference see [Results_Dictionary.md](https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/Results_Dictionary.md).

### 2. JSON format

**`typing_results.jsonl`** — one complete JSON record per sample (same data as the TSV, machine-readable).

**`run_metadata.json`** — run-level provenance: tool version, database name and checksums, parameters, run ID, and timestamp.

**`input_manifest.json`** — per-input manifest: source path, file size, and SHA-256 checksum.


*Optionally, when a call is "Resolved" and locus analysis is enabled, the tool produces the following outputs:*

### 3. FASTA format

**`<sample>_locus_sequence.fasta`** — the O-antigen biosynthesis locus sequence extracted from the input genome, generated when locus analysis is enabled and the call is Resolved.

### 4. PNG format

**`<sample>_locus_map.png`** — a two-track locus map showing the reference locus alongside the aligned sample region, with annotated marker positions.

### 5. Text format

**`<sample>_locus_alignment.txt`** — pairwise BLASTN alignment of the sample genome against the reference locus.

Locus analysis outputs are written to a per-sample subdirectory: `<outdir>/<sample>_locus_analysis/`.

Typically, the output directory layout is as follows:

```text
results/
├── typing_results.tsv
├── typing_results.jsonl
├── run_metadata.json
├── input_manifest.json
├── sample1_locus_analysis/          # only when --locus-analysis is enabled
│   ├── sample1_locus_map.png
│   ├── sample1_locus_alignment.txt
│   └── sample1_locus_sequence.fasta
└── sample2_locus_analysis/
    ├── sample2_locus_map.png
    ├── sample2_locus_alignment.txt
    └── sample2_locus_sequence.fasta
```

---

## **FlavoTyper Modules**

### QC module

The purpose of this first module is to ensure that:
1. The input genome corresponds to the species Flavobacterium psychrophilum.
2. The genome assembly quality allows a reliable assignment of the serotype.

Samples that fail QC are recorded as `NotTyped` in the output and skip the typing step.

1. Species check *(enabled by default)*

An ANI-based species validation step using fastANI is run before typing. The input genome is compared against the *F. psychrophilum* type-strain reference genome (NCIMB 1947T). Genomes below the ANI threshold (default: 95 %) are blocked from further typing steps.

This step can be disabled with `--no-species-check` when species identity has been confirmed independently.

2. Assembly quality check

Before typing, FlavoTyper evaluates assembly quality. 

- **Genome size**: flagged if outside the expected interval [2,619,202 – 3,122,663 bp] derived from a curated reference set.
- **Contig count**: advisory warning issued above 300 contigs; high-severity warning above 500 contigs.
- **GC percent**: calculation of the GC content in the provided genome(s).

The assembly quality check is advisory only, and provides informative warnings to the user about metrics that might affect the reliability of serotype assignment.

### Typing module

The core module detects serotype-associated marker genes with BLASTN against the built-in marker database, then applies a declarative rule engine to assign serotype components independently:

- **O-type** — assigned from the exclusive detection of one O-antigen marker (wzy gene or presumably wzy).
- **R-type** — assigned from base-group marker presence (R1, R2, R3 and R4) and optional inter-marker distance rules for variant confirmation (R1V1, R1V2 and R1V3).
- **S-type** — assigned independently from the S1 marker; S0 when absent, S1 when present.

The combined serotype is reported as `O:X-Sy-Rz` (e.g. `O:1-S0-R1V1`).

### Locus analysis module *(optional, `--locus-analysis`)*

When a call is Resolved and the user enabled this module, a second BLASTN is run to align the genome(s) against a full O-antigen biosynthesis locus retrieved from a reference strain. This produces:

- a pairwise alignment text file,
- the extracted locus FASTA sequence,
- a two-track PNG locus map.

Enable with `--locus-analysis`. Novel serotypes (not yet in the reference locus database) are flagged with a warning in `Typing_warnings` but are not blocked from receiving a type call.

---

## **FlavoTyper Databases**

All reference data is integrated in the FlavoTyper package. The built-in data directory can be retrieved with:

```bash
flavotyper data-dir
```

### `Flavotyper_markers.fasta`

This file includes nucleotide sequences for all marker genes used by the typing module. The BLAST database is built from this file at runtime.

A marker is considered present when BLASTN result meets both thresholds: percent identity ≥ 97 % and marker coverage ≥ 94 % (adjustable via `--min-identity` and `--min-coverage`).

Markers currently covered:

**O-type** — each type (O:0–O:7) is detected by a unique wzy gene (`wzy0`–`wzy7`).

**R-type** — R0 is the default assignment when no R markers are detected (yet only detected in O:0). R1 variants share a common `r1_core` marker and are further distinguished by: `wfpF` (R1V1); `Rieske` + `wfpF_p` within a distance of −6 to +6 bp of each other (R1V2); `wfpF_pp` (R1V3). R2, R3, and R4 are each assigned from a single marker: `wfpH`, `wfpI`, and `r4_core` respectively.

**S-type** — S1 is assigned when `s1_core` is detected; S0 when is absent.

### `Flavotyper_reference_loci.fasta`

This file includes full nucleotide sequences of reference O-antigen biosynthesis loci for each known serotype, with embedded metadata (reference strain, genome coordinates, GenBank accession, PMID, and per-marker positions). Used by the locus analysis module.

---

## Command reference

Run `flavotyper type --help` for the full CLI reference.

| Option | Default | Description |
|---|---|---|
| `--genomes` | required | One or more genome FASTA files, or a directory (`.fa`, `.fna`, `.fasta`, `.fas`, optionally `.gz` — discovered automatically) |
| `--outdir` | required | Output directory |
| `--db` | built-in | Path to the serotyping rules YAML |
| `--species-refs` | built-in | Reference FASTA for fastANI species check |
| `--no-species-check` | off | Disable F. psychrophilum species validation |
| `--ani-threshold` | 95.0 | Minimum ANI to pass the species gate |
| `--min-identity` | 97.0 | Minimum BLASTN percent identity for marker hits |
| `--min-coverage` | 94.0 | Minimum marker coverage (%) for marker hits |
| `--threads` | 1 | Threads passed to BLASTN and fastANI |
| `--locus-analysis` | off | Enable locus comparison and PNG map generation |
| `--locus-db` | built-in | Override the built-in reference-locus FASTA |
| `--allow-duplicate-sample-names` | off | Allow duplicate IDs from filename stems |

---

## Interpreting results

| `Call_state` | Meaning |
|---|---|
| `Resolved` | O-type and R-type were both uniquely assigned |
| `Partial` | One of O or R is `Undefined` — check `Typing_warnings` and assembly quality |
| `Ambiguous` | One of O or R matched multiple valid interpretations — check `Alternative_serotypes` |
| `NotTyped` | QC blocked typing — check `QC_warnings` and species fields |

---

## Troubleshooting

For common errors and questions — installation failures, QC warnings, partial or ambiguous calls, locus analysis not running — see [Troubleshooting.md](https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/Troubleshooting.md).

---

## Citation

If you use FlavoTyper in a publication or report, please cite the software metadata in [CITATION.cff](https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/CITATION.cff).

---

## License

Apache-2.0. See [LICENSE](https://forge.inrae.fr/eric.duchaud/flavotyper/-/blob/main/LICENSE).
