Metadata-Version: 2.3
Name: folitools
Version: 0.4.0
Summary: Foli-seq data processing and analysis tools
Requires-Dist: biopython>=1.85
Requires-Dist: cutadapt-folitools==5.3.1.post0
Requires-Dist: cyclopts==3.22.5
Requires-Dist: joblib>=1.5.1
Requires-Dist: matplotlib>=3.10.3
Requires-Dist: matplotlib-venn>=1.1.2
Requires-Dist: natsort>=8.4.0
Requires-Dist: networkx>=3.5
Requires-Dist: numpy>=2.3.1
Requires-Dist: pandas>=2.3.0
Requires-Dist: polars>=1.31.0
Requires-Dist: pyarrow>=21.0.0
Requires-Dist: pysam>=0.23.3
Requires-Dist: scikit-learn>=1.7.0
Requires-Dist: scipy>=1.16.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: sgad>=1.0.1
Requires-Dist: tqdm>=4.67.1
Requires-Dist: umi-tools>=1.1.6
Requires-Dist: wcmatch>=10.1
Requires-Dist: xopen>=2.0.2
Requires-Dist: openpyxl
Requires-Dist: primer3-py>=2.2.0
Requires-Dist: pydeseq2>=0.5.3
Requires-Dist: ipykernel>=6.29.5 ; extra == 'notebook'
Requires-Dist: ipython>=9.3.0 ; extra == 'notebook'
Requires-Dist: ipywidgets>=8.1.7 ; extra == 'notebook'
Requires-Dist: pytest>=8.4.1 ; extra == 'test'
Requires-Dist: pytest-cov>=6.2.1 ; extra == 'test'
Requires-Python: >=3.12
Provides-Extra: notebook
Provides-Extra: test
Description-Content-Type: text/markdown

# folitools

**folitools** is a modular CLI toolkit and python library for Foli-seq data processing.

Foli-seq is a amplicon-based high-throughput sequencing method for profiling host gene expression from feces. Foli-seq is developed to probe amplicons of 320-380bp and sequence on PE150 Illumina sequencers.

---

## Installation

Install from PyPI:

```bash
pip install folitools
```

The dependencies from conda are also required (note: `cutadapt` is intentionally
**not** in this list — see below):

```bash
conda install -c bioconda -c conda-forge \
  fastp samtools bwa-mem2 star seqkit fastqc subread sambamba pigz gcc
```

### Cutadapt fork

Starting with folitools 0.4.0, `cutadapt` is installed from our fork,
published on PyPI as [`cutadapt-folitools`][pypi] (source:
[whatever60/cutadapt@folitools-perf][fork]). The fork installs the same
`cutadapt` Python module and `cutadapt` console script as upstream, so
nothing in your own tooling needs to change. It adds performance
improvements that matter for the `foli assign-probes` workload (hundreds
of i5/i7 primers, paired-end gzipped FASTQ inputs):

- `SeedMultiAdapterFilter` — a seed-and-extend pre-filter for large
  non-anchored multi-adapter groups. Match objects are **bit-identical**
  to upstream cutadapt's per-adapter iteration; only the speed changes.
  Auto-activates above 8 eligible adapters in a group.
- `--input-compression-threads N` — opt-in parallel decompression of
  gzipped inputs via `pigz`.
- `NPrefixIndexedAdapters` — a dormant class that offers a fast path for
  `^N{k}<body>` anchored adapters. Not wired into the auto path; reserved
  for future opt-in use.

End-to-end on a representative 3.1 M read-pair sample with 542×2 primers
(`-j 16`), the combined effect of the fork plus the `xopen`/`pigz`-backed
writer in `add_umi` cuts one sample from ~7:30 wall time to ~2:45
(~2.7× faster).

`cutadapt-folitools` is published as an sdist, so `pip install folitools`
will compile `cutadapt` from source. A C compiler is required — the
`gcc` entry in the conda command above covers it on Linux. On macOS the
Xcode command-line tools suffice.

If you already have the upstream `cutadapt` wheel installed from PyPI
it will be replaced by `cutadapt-folitools` when you install folitools,
because the two distributions install the same `cutadapt` module and
console script. Don't install both side-by-side.

If you would prefer the upstream cutadapt wheel on PyPI, you can skip
the fork with:

```bash
pip install --no-deps folitools
pip install "cutadapt>=5.1" <other-deps>
```

but the `foli assign-probes` stage will be slower.

[pypi]: https://pypi.org/project/cutadapt-folitools/
[fork]: https://github.com/whatever60/cutadapt/tree/folitools-perf

## Usage

Each stage of the pipeline is exposed as a command via `foli <subcommand>`.

**A note on input paths**: All commands that accept input paths through the `--input` parameter allow files locations of absolute paths, relative paths, or glob patterns. Multiple paths/patterns are also allowed.

**A note on paired-end files**: Foli-seq uses paired-end sequencing. When inputing FASTQ files, only R1 path is given to the command and the corresponding R2 files are automatically derived by replacing the R1 pattern with R2 pattern (e.g., `_R1_` → `_R2_`, `_1` → `_2`). Both R1 and R2 files are required for the pipeline to work.

### Expected Inputs

- Raw paired-end FASTQ files (following the naming pattern `*_R1_*.fastq.gz` and `*_R2_*.fastq.gz`)
- Adapter FASTA files
- STAR genome index directory
- GTF annotation file

Output directories are automatically created if they don't exist.

### Step 1. Preprocessing

```bash
foli qc \
    --input "/path/to/fastq/*_R1_*.fastq.gz" \
    --output-dir "/path/to/trimmed_reads"
```

This command performs read trimming using `fastp` and runs quality check on output files with `fastqc` and `seqkit`.

**What happens under the hood:**
1. `fastp` performs quality-based trimming with tail trimming and error correction
2. `fastqc` generates quality control reports for the trimmed reads
3. `seqkit stats` compiles summary statistics for all output files

You can specify input FASTQ files from any location using absolute paths, relative paths, or glob patterns. The paired-end R2 files are automatically derived from R1 files by replacing `_R1_` with `_R2_` (or `_1` with `_2`).

Output files (in the specified output directory):
- `{sample_name}_1.fq.gz` and `{sample_name}_2.fq.gz`: Trimmed paired-end reads
- `{sample_name}.fastp.html` and `{sample_name}.fastp.json`: fastp QC reports
- FastQC reports in `{output_dir}_fastqc/`
- `{output_dir}.stats`: Summary statistics table generated by seqkit

### Step 2. Probe assignment

```bash
foli assign-probes \
    --input "/path/to/trimmed_reads/*_1.fq.gz" \
    --output-dir "/path/to/umi_tagged" \
    --i5 "/path/to/i5_short.fasta" \
    --i7 "/path/to/i7_short.fasta"
```

This command performs gene probe primer assignment and trimming using `cutadapt`. This step extracts UMI sequences from adapter matches and embeds them in read names for downstream processing.

**What happens under the hood:**
1. `cutadapt` identifies i5 and i7 adapters in the reads (minimum 20bp overlap, 10% error rate)
2. Adapter sequences and primer names are added to read headers
3. A custom Python script (`folitools.add_umi`) parses the adapter information and embeds UMI sequences into read names in a structured format
4. `seqkit stats` generates summary statistics for the output files

After probe assignment, you can get read statistics:

```bash
foli get-read-stats \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-dir "/path/to/read_stats" \
    --overwrite
```

This analyzes the UMI-tagged reads to generate detailed statistics about primer assignment and read characteristics.

Output files:
- `{sample_name}_1.fq.gz` and `{sample_name}_2.fq.gz`: UMI-tagged, adapter-trimmed reads with UMI information embedded in read names
- `{output_dir}.stats`: Summary statistics table generated by seqkit
- Read statistics: `{sample_name}.parquet` files in the stats output directory (when using `foli get-read-stats`)

### Step 3. Mapping and Feature Counting

```bash
foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf"
```

This command performs streamlined alignment and feature counting using `STAR` and `featureCounts`.

**What happens under the hood:**
1. **STAR index preloading**: The genome index is loaded into shared memory once for all samples to improve processing speed
2. **Read filtering**: Short reads (< 60bp, considered primer dimers) are removed using `cutadapt`
3. **Alignment**: Filtered reads are aligned to the genome using `STAR` with:
   - Support for up to 1000 multi-mapping locations
   - Unmapped reads included in output (`--outSAMunmapped Within`)
   - Chimeric alignments detected and marked (`--chimOutType WithinBAM`)
4. **Feature assignment**: `featureCounts` assigns aligned reads to genomic features from the GTF file
5. **BAM tagging and sorting**: 
   - UMI sequences are added as BAM tags (`US` for raw UMI, `UC` for filtered UMI)
   - Cell barcodes added as `CB` tag
   - Gene assignments added as `XF` tag
   - BAM files are coordinate-sorted using `sambamba`

**Core allocation**: The `--cores` parameter is automatically split between STAR (70%) and featureCounts (30%) to optimize pipeline throughput.

For fractional counting of reads that overlap multiple features, you can use the `--allow-overlap` and `--allow-multimapping` flags:
```bash
foli map \
    --input "/path/to/umi_tagged/*_1.fq.gz" \
    --output-bam "/path/to/mapped_reads" \
    --output-star "/path/to/star" \
    --star-index "/path/to/star_index" \
    --gtf "/path/to/annotation.gtf" \
    --allow-overlap \
    --allow-multimapping
```

These flags enable fractional counting where reads overlapping multiple features or mapping to multiple locations contribute fractionally to each feature.

Output files:
- STAR alignment files in `{output_star}/{sample_name}/`:
  - `Aligned.out.bam`: Unsorted alignment file
  - `Log.final.out`, `Log.out`, `Log.progress.out`: STAR log files
  - `ReadsPerGene.out.tab`: Gene counts from STAR
- Sorted, tagged BAM files in `{output_bam}/`:
  - `{sample_name}.sorted.bam`: Final sorted BAM with UMI and gene tags
  - `{sample_name}.sorted.bam.bai`: BAM index file
- `featureCounts` results in `{output_bam}/`:
  - `{sample_name}.featureCounts`: Read-feature assignments
  - `{sample_name}.featureCounts.summary`: Assignment statistics

### Step 4. UMI-based Gene Counting

```bash
foli count \
    --input "/path/to/mapped_reads/*.sorted.bam" \
    --output-dir "/path/to/count_results"
```

This command processes BAM files using `umi_tools group` to generate UMI-deduplicated count data.

**What happens under the hood:**
1. **UMI grouping**: `umi_tools group` identifies and groups reads with:
   - The same UMI sequence (extracted from `UC` tag)
   - The same cell barcode (`CB` tag)
   - The same gene assignment (`XF` tag)
2. **Read handling**:
   - Paired-end reads are processed together
   - Unmapped reads, chimeric pairs, and unpaired reads are all retained in the output
   - Only primary alignments are used for grouping (secondary/supplementary alignments are ignored)
3. **Deduplication method**: Uses the "unique" method where only reads with identical UMI sequences are grouped together
4. **Output generation**: Produces detailed grouping information showing which reads were collapsed into each UMI group

Input BAM files should contain UMI tags (`UC`), cell tags (`CB`), and gene assignment tags (`XF`) from the previous mapping step.

After counting, generate the final count matrix:

```bash
foli get-count-mtx \
    --input "/path/to/count_results/*.group.tsv.gz" \
    --output "/path/to/final_counts.tsv" \
    --gtf "/path/to/annotation.gtf"
```

**What happens under the hood:**
1. Reads all UMI grouping tables from `umi_tools group` output
2. Counts unique UMI groups per gene per sample
3. If GTF is provided, maps gene IDs to gene symbols
4. Generates a gene × sample count matrix

You can also use the Python function directly:

```python
from folitools.get_matrix import read_counts
df = read_counts("/path/to/input_files", "/path/to/annotation.gtf")
df.to_csv("/path/to/output.tsv", sep="\t", index_label="gene")
```

Output files:
- `{sample_name}.group.tsv.gz`: UMI grouping tables with detailed information about which reads were grouped together
- `{sample_name}.group.log`: Processing logs showing statistics for each sample
- Final count matrix: Gene × sample matrix in TSV format with gene symbols (if GTF provided) or gene IDs

## Primer Selection Functionality

The primer selection module provides tools for designing and recovering PCR primer sets for targeted amplicon sequencing experiments.

### Workflow

The primer selection workflow provides a complete pipeline for primer design:

```bash
# Run complete primer design workflow using built-in reference
foli-primer workflow \
    --genes "genes.tsv" \
    --species "mouse" \
    --output-dir "primer_design/"

# Or use custom transcriptome FASTA for better coverage
foli-primer workflow \
    --genes "genes.tsv" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --output-dir "primer_design/"
```

**Input**: Gene table TSV with columns `gene` and `group`  
**Output**: Complete primer design including optimized primer sets, amplicon sequences, and IDT-compatible ordering files

**Note**: You can use either `--species` (mouse/human) for built-in references or `--txome-fasta` for custom transcriptome files. Custom FASTA files often provide better gene coverage than the built-in  references.

### Recover

The recover functionality helps validate and analyze primer sets from IDT order files:

```bash
# Recover primer information using built-in reference
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --species "human"

# Or use custom transcriptome FASTA (recommended for better coverage)
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta"
```

This command analyzes primer sequences from an IDT order file, validates them against a reference transcriptome, and generates output files for downstream analysis. The output FASTA files (`i5_short.fasta` and `i7_short.fasta`) will contain the same number of unique primer sequences as found in the input Excel file, ensuring all original primers are represented.

**Note**: Using `--txome-fasta` with a custom transcriptome file is recommended over the built-in `--species` references as it typically provides better gene coverage than the packaged Gencode references.

You can also specify the amplicon length range, and whether the primers have linkers:
```bash
foli-primer recover \
    --order-excel "idt_order.xlsx" \
    --output-dir "recovered_output/" \
    --txome-fasta "/path/to/transcriptome.fasta" \
    --has-linker \
    --amplicon-length-range 300 400
```

**Input**: IDT order Excel file with primer sequences  
**Output**: 
- `summary_primer_to_order.xlsx`: Validated primer summary with amplicon information
- `primer_diagnose.pdf`: PDF report with analysis plots and statistics
- `i5_short.fasta` and `i7_short.fasta`: FASTA files for sequencing adapters

### Dimer Evaluation (through Python)

A primer set can be evaluated by calculating the thermodynamic properties of potential primer dimers for each pair of primers.

```python
from folitools.primer_selection.eval_dimer import dimer_thermo_property

# Evaluate primer-dimer interactions
result_matrix = dimer_thermo_property(
    primer_fwd_fasta="forward_primers.fasta",
    primer_rev_fasta="reverse_primers.fasta", 
    output_dir="dimer_analysis/",
    output_suffix="_analysis"
)
```

**Input**: FASTA files containing forward and reverse primer sequences  
**Output**: 
- Pairwise interaction matrix with thermodynamic properties
- Detailed analysis files in the output directory

## Testing

Run all tests:
```bash
pytest
```

Run only shell script tests:
```bash
pytest tests/test_shell_scripts.py -v
```

Run tests with coverage:
```bash
pytest --cov=src/folitools --cov-report=html
```

## In silico amplification for primer pool evaluation

Given a list of primer sequences used in foli-seq (with transcript-specific sequences, TruSeq R1/R2 overhang and with or without linker sequences), we validate the primer pool with de novo in silico PCR, without prior knowledge of the target genes, so that the upstream gene selection & panel design is decoupled from the evaluation here.

This functionality helps you understand which genes are actually targeted by your primer pool and identify potential issues like off-target binding or missing amplicons.

The workflow consists of the following steps:

1.  **Primer parsing and classification**: The input IDT order Excel file (first two columns: pool name and primer sequence) is parsed to extract primer sequences. Each primer is classified as forward or reverse based on matching against known universal prefixes:
    *   Forward: `ACACTCTTTCCCTACACGACGCTCTTCCGATCT` (TruSeq R1 adapter) + `NNNNNN` (UMI) + optional `ACATCA` linker
    *   Reverse: `GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT` (TruSeq R2 adapter) + `NNNNNN` (UMI) + optional `ATAGTT` linker
    
    After classification, the universal prefix and UMI are stripped to extract the gene-specific targeting sequence for each primer.

2.  **Transcriptome mapping**: Using `seqkit locate`, all gene-specific primer sequences are mapped against the provided transcriptome FASTA to identify where each primer can bind. This uses exact matching (no mismatches) to find all potential binding sites across all transcripts.

3.  **Amplicon enumeration**: Forward and reverse primer hits are systematically paired on each transcript to enumerate all possible amplicons. For a valid amplicon:
    *   Forward primer must match the transcript sequence (sense strand)
    *   Reverse primer must match the reverse complement (antisense strand)
    *   Primers must be on opposite strands with proper orientation
    *   Forward primer position must be upstream of reverse primer position

4.  **Amplicon filtering**: The enumerated amplicons are filtered based on:
    *   Length constraints (default 320-380 bp, configurable via `--amplicon-length-range`)
    *   Primer type validation (must be complementary R1 + R2 pair)
    
    Amplicons outside the target length range or with incorrect primer type combinations are excluded.

5.  **Grouping and summarization**: Valid amplicons are grouped by unique primer pairs. For each pair, the function:
    *   Aggregates all transcripts and genes that can be amplified
    *   Detects multi-mapping primers (primers binding to multiple genes)
    *   Checks for cross-pool amplicons (primer pair from different pools)
    *   Creates a semicolon-separated list of all transcript mappings

6.  **Gene name simplification** (optional, enabled by default): Gene symbols are processed to remove redundant information and improve readability. For example, gene families like "Gm12345" or immunoglobulin segments are consolidated to avoid cluttered names.

### Output Files

The command generates several output files in the specified output directory, saved in the order they are produced during the workflow:

1.  **`locate_df.csv`**: Raw output from `seqkit locate`. Contains all matches of gene-specific primer sequences in the transcriptome. This is the unprocessed mapping result before any processing. Columns:
    *   `seqID`: Transcript identifier from FASTA header (pipe-separated format)
    *   `pattern`: The gene-specific primer sequence that was searched
    *   `strand`: Strand orientation (`+` or `-`)
    *   `start`, `end`: Starting and ending positions of match on transcript (1-based, inclusive)
    *   `matched`: The actual matched sequence from the transcript

2.  **`locate_df_final.csv`**: Processed version of `locate_df` with additional metadata joined from the primer info and parsed transcript/gene information. Columns are:
    *   `primer_seq`, `primer_seq_full`: Gene-specific targeting sequence and complete primer (with universal prefix and UMI)
    *   `primer_type`: Forward (`fwd`) or reverse (`rev`) classification
    *   `transcript_id`, `gene_id`, `gene_symbol`: Transcript/gene identifiers parsed from seqID (indices 0, 1, and 5)
    *   `start`, `end`: Starting and ending positions of match on transcript (1-based, inclusive)
    *   `strand`: Strand orientation (`+` or `-`)
    *   `pool`: Pool assignment from IDT order file
    
    This file shows where each primer can bind across the transcriptome with full biological context.

3.  **`amplicon_all.csv`**: Complete enumeration of all possible amplicons formed by pairing forward and reverse primers on the same transcript. Each row represents one potential amplicon. Columns are:
    *   `primer_seq_fwd`, `primer_seq_rev`: Forward and reverse primer gene-specific sequences
    *   `primer_seq_fwd_type`, `primer_seq_rev_type`: Primer type classifications (should be `fwd` and `rev`)
    *   `transcript_id`, `gene_id`, `gene_symbol`: Transcript and gene identifiers
    *   `start_up`, `end_up`: Upstream (forward) primer binding site positions on transcript (1-based, inclusive)
    *   `start_down`, `end_down`: Downstream (reverse) primer binding site positions on transcript (1-based, inclusive)
    *   `pool_fwd`, `pool_rev`: Pool assignments for forward and reverse primers
    *   `flipped`: Boolean indicating if primer pair orientation was flipped during enumeration
    *   `amplicon_length`: Length of amplicon in base pairs (calculated as `end_down - start_up + 1`)
    *   `pass`: Boolean indicating whether the amplicon passes filtering criteria (length range and complementary primer types)
    
    This includes all enumerated amplicons. Filter by `pass == True` to get only valid amplicons that would be expected in the actual experiment.

4.  **`grouped.csv`**: Summary table where amplicons are grouped by unique primer pairs. Each row represents one primer pair with aggregated information across all its amplicons. Columns are:
    *   `primer_seq_fwd`, `primer_seq_rev`: Forward and reverse primer gene-specific sequences
    *   `transcript_id`, `gene_id`, `gene_symbol`: Representative transcript/gene identifiers (from first amplicon in group sorted by gene_id and transcript_id)
    *   `start_up`, `end_up`: Forward primer binding site positions on representative transcript (1-based, inclusive)
    *   `start_down`, `end_down`: Reverse primer binding site positions on representative transcript (1-based, inclusive)
    *   `pool_fwd`, `pool_rev`: Pool assignments for forward and reverse primers
    *   `num_transcripts`, `num_genes`: Number of unique transcripts and genes amplified by this primer pair
    *   `transcript_id_all`: Semicolon-separated list of all transcript-gene mappings for this pair (format: `transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length`)
    
    This shows the specificity of each primer pair, reveals multi-gene amplification, and provides detailed binding site information for each transcript target.

5.  **`primer_diagnose.pdf`**: A diagnostic PDF report containing a log-log histogram of amplicon lengths across all enumerated amplicons, with vertical dashed lines indicating the target length range. This visualization helps identify the distribution of amplicon sizes and assess whether the specified range captures the intended products.

6.  **`summary_primer_to_order.xlsx`**: Final Excel summary compatible with the primer design workflow format. Contains columns:
    *   `Group`, `geneSymbol`, `geneID`: Default group assignment, gene symbol(s), and gene ID (without version)
    *   `Chosen Index`, `amplicon_index`: Design index (default 0) and unique identifier (transcript ID without version + chosen index)
    *   `L_seq`, `R_seq`: Forward/reverse primer display sequences (with linker, excluding universal prefix and UMI)
    *   `primer_sequence_to_order_forward`, `primer_sequence_to_order_reverse`: Complete primer sequences as they appear in the order file (with universal prefix, UMI, and linker)
    *   `L_pool`, `R_pool`: Pool assignments for forward and reverse primers
    *   `multi_mapping`: Semicolon-separated list of all transcript-gene mappings (format: `transcript_id|gene_id|gene_symbol|start_up|end_up|start_down|end_down|amplicon_length`)

7.  **`i5_short.fasta` & `i7_short.fasta`**: FASTA files containing gene-specific sequences for forward (i5) and reverse (i7) primers respectively. These sequences include any linker sequences but exclude the universal TruSeq adapters and UMIs. The number of records matches the number of unique primer sequences in the input order file. These files are suitable for use in downstream SADDLE analysis or dimer prediction.

8.  **`recover.log`**: Detailed log file capturing all processing steps, sanity checks, and warnings. The log includes both truncated output (for readability) and full details for comprehensive debugging.

### Primer Naming Convention

In the output FASTA files (`i5_short.fasta` and `i7_short.fasta`), primer names (record IDs) are constructed based on the genes they amplify:

*   **Single Gene Mapping**: If a primer sequence maps to only one gene (e.g., `GAPDH`), the record ID is simply the gene symbol: `GAPDH`.

*   **Multi-Gene Mapping**: If a primer sequence maps to multiple genes, the record ID is a pipe-separated list of all unique gene symbols in sorted order. For example, a primer targeting both `GENE_A` and `GENE_B` would have ID: `GENE_A|GENE_B`.

*   **Gene Name Simplification**: When `--simplify-gene-name` is enabled (default), redundant tokens in gene symbols are removed. For instance, if a primer maps to both `Ighv1-1` and `Ighv1-2`, these might be collapsed to a family representative like `Ighv1` to avoid overly long names.

*   **ID Collision Resolution**: If multiple unique primer sequences map to the exact same set of genes (resulting in identical record IDs), a numeric suffix is automatically appended to distinguish them. For example: `GAPDH`, `GAPDH.1`, `GAPDH.2`. This ensures all primers are represented even when they target the same genes.

*   **Missing Mappings**: Primers from the input excel file that don't appear in any valid amplicon after filtering (e.g., no transcriptome match, or all amplicons fail length/type filters) are still included in the output FASTA files with ID `_NA` to ensure completeness. This guarantees that the output FASTA files contain exactly the same number of unique primer sequences as in the input order file.

### Gene Name List Consolidation

When a primer maps to multiple genes, the gene name simplification feature (enabled by default with `--simplify-gene-name`) applies intelligent filtering rules to produce clean, informative primer names. The consolidation follows a two-stage process, controlled by the `collapse_families` argument (default: `True`). When `collapse_families=False`, only Stage 1 (pseudo-like filtering) is applied; otherwise, both stages are applied.

#### Stage 1: Pseudo-like Gene Filtering

The first stage identifies and removes "pseudo-like" gene symbols—names that are typically uninformative placeholders or annotation artifacts. A gene symbol is considered pseudo-like if it matches any of these patterns:

*   **Database identifiers**: Ensembl gene IDs (e.g., `ENSG00000173366`, `ENSMUSG00000026193`)
*   **Mouse gene models**: Placeholder names like `Gm10358`, `Gm3839`
*   **RIKEN cDNA clones**: Mouse placeholders ending in `Rik` (e.g., `1700003D09Rik`)
*   **Long non-coding RNA placeholders**: Patterns like `MMP24OS`, `FOXP3-AS1`, `LINC01234`, `AC123456.1`, `AL123456.1`, `RP11-...`
*   **Explicit pseudogene markers**: Suffixes like `-ps`, `-rs`, or `RNAxxSPx` (e.g., `Clca4c-ps`, `Rn18s-rs5`, `RNA18SP5`)
*   **Pseudogene patterns**: Genes ending in digits followed by `P` (e.g., `DEFA10P`, `CLCA3P`, `KRT8P1`)
    *   Exception list protects real genes: `FOXP1-4`, `GBP1`, `ZBP1`, `GTPBP1`, `SPP1`, `DUSP1`, `AKAP1`, `TRAP1`, `RAMP1`
*   **Paralog-like variants**: Patterns like `Klk1b1` where a base ending in a digit is followed by lowercase letters and more digits

**Filtering rule**: If at least one non-pseudo-like gene symbol exists in the list, all pseudo-like symbols are dropped. If all symbols are pseudo-like, they are all retained to avoid losing all information.

**Special cases**:
*   **Mitochondrial genes** (prefixed with `MT-` or `mt-`) are always considered informative and kept
*   **Readthrough genes** (format `GENE_A-GENE_B`) are dropped only if both component genes (`GENE_A` and `GENE_B`) also appear separately in the list; otherwise the readthrough symbol is kept as informative

#### Stage 2: Family Collapsing

After pseudo-like filtering, the second stage collapses gene family members with trivial suffix variants—but only when a clear base gene exists. This prevents overly verbose names while avoiding arbitrary choices:

*   **Paralog families**: Patterns like `Klk1b1`, `Klk1b3`, `Klk1b11`, `Klk1b14-ps` are recognized as family members of base `Klk1`
    *   Trivial tails must have form: `<base><lowercase_letters><digits>` or `<base><digits><lowercase_letters>`, optionally with `-ps` suffix
    *   Base must be at least 4 characters long
*   **Pseudogene families**: Patterns like `KRT8P1`, `KRT8P2` are recognized as pseudogene variants of base `KRT8`
    *   Pattern: `<base_ending_in_digit>P<digits>`

**Collapsing rule**: Family members are collapsed **only if the base gene is present** in the list. For example:
*   `["Klk1", "Klk1b1", "Klk1b3"]` → `["Klk1"]` (base present, collapse all variants)
*   `["Klk1b1", "Klk1b3"]` → `["Klk1b1", "Klk1b3"]` (base absent, keep both variants separately)
*   `["Selenbp1", "Selenbp2"]` → `["Selenbp1", "Selenbp2"]` (these are distinct paralogs, not trivial variants)

This conservative approach ensures that family members are only merged when there's clear evidence (presence of the base gene) that they represent variants of the same functional unit, avoiding silent loss of information.

#### Examples

*   `["GAPDH", "ENSG00000111640"]` → `["GAPDH"]` (Ensembl ID dropped)
*   `["Ppp1r1b", "1700003D09Rik"]` → `["Ppp1r1b"]` (RIKEN clone dropped)
*   `["DEFA1", "DEFA10P"]` → `["DEFA1"]` (pseudogene dropped)
*   `["Klk1", "Klk1b1", "Klk1b11", "Klk1b3"]` → `["Klk1"]` (family collapsed to base)
*   `["KRT8P1", "KRT8P2"]` → `["KRT8P1", "KRT8P2"]` (no base `KRT8`, so keep both)
*   `["IFNAR2-IL10RB"]` → `["IFNAR2-IL10RB"]` (readthrough kept when components absent)
*   `["KLRC4-KLRK1", "KLRK1"]` → `["KLRK1"]` (readthrough dropped when component present)

This simplification is applied when constructing FASTA record IDs from multi-gene mappings, making primer names more concise and biologically meaningful while preserving essential information.

## Understanding Amplicon Counting Output

Foli-seq's amplicon-based design produces count matrices with feature names that differ from traditional bulk RNA-seq or scRNA-seq outputs. Because the protocol uses targeted primers that can amplify multiple amplicons (due to gene family similarity) and multiple primers can target the same transcript (or transcripts of the same gene), the feature naming reflects both gene identity and multi-mapping relationships.

During the pipeline, feature names evolve through several stages. When `cutadapt` identifies primer adapters in step 2, primer names are embedded into read names (e.g., `@READ_ID_ACGTAC_TGCATG_FGR+FGR`), creating amplicon-specific identifiers. In step 3, `featureCounts` assigns reads to gene features and stores results in the `XT` tag—either single genes (`XT:Z:ENSG00000010030`) or comma-separated lists for multi-mapping reads (`XT:Z:ENSG00000010030,ENSG00000285441`). The pipeline then combines these gene assignments with primer information to create composite features in the `XF` tag, such as `ENSG00000010030,FGR+FGR` (single gene amplified by FGR primers) or `ENSG00000010030,ENSG00000285441,FGR+FGR` (multi-mapping).

`umi_tools group` deduplicates reads by grouping on UMI sequence, cell barcode, and the gene-with-primer assignment from the `XF` tag. Finally, the `read_counts` function aggregates these amplicon-level counts to gene-level by stripping primer suffixes from feature names. When multiple primer pairs target the same transcript or transcripts of the same gene, their counts are averaged. Multi-mapping reads remain as pipe-separated features (e.g., `GENE_A|GENE_B`). If a GTF file is provided, gene IDs are converted to symbols while preserving multi-mapping relationships (e.g., `ENSG00000010030` → `FGR`, or `ENSG00000111|ENSG00000222` → `Klk1b1|Klk1b3`). For multi-gene features, the same gene name simplification logic described in the "Gene Name List Consolidation" section above is applied to remove pseudo-like genes and collapse family variants, ensuring consistent naming between primer pool validation and count matrix generation.

In summary, the resulting count matrix has samples as rows and "enhanced" gene features as columns. Single-mapping genes appear as simple gene symbols (`FGR`, `GAPDH`), while multi-mapping gene families are indicated with pipe-separated names (`BCL9L|CXCR5`). Thus, the values represent UMI-deduplicated read counts where each UMI corresponds to one original mRNA molecule, and multiple amplicons targeting the same gene are averaged. This design handles amplicon multiplicity through count aggregation and preserves biological ambiguity due to sequence similarity, providing gene-level quantification suitable for differential expression analysis.

## TODOs

- Add QC report code that summarizes all metrics into a table (need refactoring)
