Metadata-Version: 2.4
Name: uht-tooling
Version: 0.5.5
Summary: Tooling for ultra-high throughput screening workflows.
Author: Matt115A
License-Expression: MIT
Requires-Python: ==3.10.*
Description-Content-Type: text/markdown
Requires-Dist: biopython==1.85
Requires-Dist: fuzzywuzzy==0.18.0
Requires-Dist: matplotlib==3.10.7
Requires-Dist: pandas==2.3.3
Requires-Dist: python-Levenshtein==0.27.3
Requires-Dist: pyyaml==6.0.3
Requires-Dist: pysam==0.23.3
Requires-Dist: scipy==1.15.3
Requires-Dist: seaborn==0.13.2
Requires-Dist: tabulate==0.9.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: typer==0.20.0
Requires-Dist: mappy==2.30
Requires-Dist: nicegui>=2.0
Provides-Extra: gui
Requires-Dist: nicegui>=2.0; extra == "gui"
Provides-Extra: legacy-gui
Requires-Dist: gradio==5.49.1; extra == "legacy-gui"
Provides-Extra: dev
Requires-Dist: pytest==9.0.0; extra == "dev"
Requires-Dist: black==25.9.0; extra == "dev"
Requires-Dist: ruff==0.14.4; extra == "dev"

# uht-tooling

Automation helpers for ultra-high-throughput molecular biology workflows. The package ships both a CLI and an optional GUI that wrap the same workflow code paths.

---

## Installation

**Python version:** 3.10

### Quick install (recommended, easiest file maintainance)
```bash
pip install "uht-tooling[gui]"

```

This installs the core workflows plus the optional GUI dependency (NiceGUI). NiceGUI is already included in the core dependencies, so `[gui]` is a convenience alias. Omit the `[gui]` extras if you only need the CLI:

```bash
pip install uht-tooling
```

> **Legacy Gradio interface:** The old Gradio GUI is still available via `pip install "uht-tooling[legacy-gui]"` and launched with `uht-tooling gui --legacy`.

### External Tools

Some workflows require external bioinformatics tools:

| Workflow | Required Tools |
|----------|---------------|
| mutation-caller | mafft |
| umi-hunter | mafft |
| ep-library-profile | minimap2, NanoFilt |

Install via conda:
```bash
conda install -c bioconda mafft minimap2 nanofilt
```

The CLI and GUI will validate tool availability before running and provide clear error messages if tools are missing.

### Development install
```bash
git clone https://github.com/Matt115A/uht-tooling-packaged.git
cd uht-tooling-packaged
python -m pip install -e ".[gui,dev]"
```

The editable install exposes the latest sources, while the `dev` extras add linting and test tooling.

---

## Directory layout

- Reference inputs can be found anywhere (you specify in the cli), but we recommend using `data/<workflow>/`.
- Outputs (CSV, FASTA, plots, logs) are written to `results/<workflow>/`.
- All workflows log to `results/<workflow>/run.log` for reproducibility and debugging.

---

## Command-line interface

The CLI is exposed as the `uht-tooling` executable. List the available commands:

```bash
uht-tooling --help
```

Each command mirrors a workflow module. Common entry points:

| Command | Purpose |
| --- | --- |
| `uht-tooling nextera-primers` | Generate Nextera XT primer pairs from a binding-region CSV. |
| `uht-tooling design-slim` | Design SLIM mutagenesis primers from FASTA/CSV inputs. |
| `uht-tooling design-kld` | Design KLD (inverse PCR) mutagenesis primers. |
| `uht-tooling design-gibson` | Produce Gibson mutagenesis primers and assembly plans. |
| `uht-tooling mutation-caller` | Summarise amino-acid substitutions from long-read FASTQ files. |
| `uht-tooling umi-hunter` | Cluster UMIs and call consensus genes. |
| `uht-tooling ep-library-profile` | Measure mutation rates in plasmid libraries without UMIs. |
| `uht-tooling profile-inserts` | Extract and analyse inserts defined by flanking probe pairs. |

Each command provides detailed help, including option descriptions and expected file formats:

```bash
uht-tooling mutation-caller --help
```

### Short Flags

All commands support short flags for common options:

```bash
# Long form
uht-tooling design-slim --gene-fasta gene.fa --context-fasta ctx.fa --mutations-csv mut.csv --output-dir out/

# Short form
uht-tooling design-slim -g gene.fa -c ctx.fa -m mut.csv -o out/
```

| Long Flag | Short | Commands |
|-----------|-------|----------|
| `--gene-fasta` | `-g` | design-slim, design-kld, design-gibson |
| `--context-fasta` | `-c` | design-slim, design-kld, design-gibson |
| `--mutations-csv` | `-m` | design-slim, design-kld, design-gibson |
| `--output-dir` | `-o` | 7 commands |
| `--log-path` | `-l` | 7 commands |
| `--template-fasta` | `-t` | mutation-caller, umi-hunter |
| `--fastq` | `-q` | 4 commands |
| `--threshold` | `-T` | mutation-caller |
| `--config-csv` | `-C` | umi-hunter |
| `--binding-csv` | `-b` | nextera-primers |
| `--probes-csv` | `-P` | profile-inserts |
| `--region-fasta` | `-R` | ep-library-profile |
| `--plasmid-fasta` | `-p` | ep-library-profile |
| `--work-dir` | `-w` | ep-library-profile |
| `--config` | `-K` | global (all commands) |

You can pass multiple FASTQ paths using repeated `--fastq` options or glob patterns. Optional `--log-path` flags redirect logs if you prefer a location outside the default results directory.

---

## Configuration File

uht-tooling supports a YAML configuration file for default options.

**Auto-discovery locations** (in order):
1. `$UHT_TOOLING_CONFIG` environment variable
2. `~/.uht-tooling.yaml`
3. `~/.config/uht-tooling/config.yaml`
4. `.uht-tooling.yaml` (current directory)

Or specify explicitly: `uht-tooling --config my-config.yaml ...`

**Example ~/.uht-tooling.yaml:**
```yaml
paths:
  output_dir: ~/results/uht-tooling

defaults:
  mutation_caller:
    threshold: 15
  umi_hunter:
    umi_identity_threshold: 0.85
    min_cluster_size: 5
```

CLI options always take precedence over config values.

---

## Workflow reference

### Nextera XT primer design

**Inputs:**
- `--binding-csv` — CSV with a `binding_region` column (row 1 = i7 forward region, row 2 = i5 reverse region, both 5'→3')
- `--output-csv` — path for the output primer CSV

**Outputs:**
- Single CSV with columns `[primer_name, sequence]`

1. Prepare `data/nextera_designer/nextera_designer.csv` with a `binding_region` column. Row 1 should contain the forward region, row 2 the reverse region, both in 5'→3' orientation.
2. Optional: supply a YAML overrides file for index lists/prefixes via `--config`.
3. Run:
   ```bash
   uht-tooling nextera-primers \
     --binding-csv data/nextera_designer/nextera_designer.csv \
     --output-csv results/nextera_designer/nextera_xt_primers.csv
   ```
4. Primer CSVs will be written to `results/nextera_designer/`, accompanied by a log file.

The helper is preloaded with twelve i5 and twelve i7 indices, enabling up to 144 unique amplicons.

#### Wet-lab workflow notes

- Perform the initial amplification with an i5/i7 primer pair and monitor a small aliquot by qPCR. Cap thermocycling early so you only generate ~10% of the theoretical yield—this minimizes amplification bias.
- Purify the product with SPRIselect beads at approximately a 0.65:1 bead:DNA volume ratio to remove residual primers and short fragments.
- Confirm primer removal and quantify DNA using electrophoresis (e.g., BioAnalyzer DNA chip) before moving to the flow cell.

### SLIM primer design

- Inputs:
  - `data/design_slim/slim_template_gene.fasta`
  - `data/design_slim/slim_context.fasta`
  - `data/design_slim/slim_target_mutations.csv` (single `mutations` column)
- Run:
  ```bash
  uht-tooling design-slim \
    --gene-fasta data/design_slim/slim_template_gene.fasta \
    --context-fasta data/design_slim/slim_context.fasta \
    --mutations-csv data/design_slim/slim_target_mutations.csv \
    --output-dir results/design_slim/
  ```
**Outputs:**
- `SLIM_primers.csv` — columns `[Primer Name, Sequence]`, 4 primers per mutation (`_Lf`, `_Sr`, `_Lr`, `_Sf`)

Mutation nomenclature examples:
- `A123G` (substitution)
- `T241Del` (deletion)
- `T241TS` (insert Ser after Thr241)
- `L46GP` (replace Leu46 with Gly-Pro)
- `A123:NNK` (library mutation with degenerate codon)

#### Library mutations with degenerate codons

For saturation mutagenesis and library generation, SLIM supports degenerate (IUPAC ambiguity) codons using the format `<WT_AA><position>:<codon>`. The codon must be exactly 3 characters using valid IUPAC nucleotide codes:

| Code | Bases | Mnemonic |
|------|-------|----------|
| A, C, G, T | Single base | Standard |
| R | A, G | puRine |
| Y | C, T | pYrimidine |
| S | G, C | Strong |
| W | A, T | Weak |
| K | G, T | Keto |
| M | A, C | aMino |
| B | C, G, T | not A |
| D | A, G, T | not C |
| H | A, C, T | not G |
| V | A, C, G | not T |
| N | A, C, G, T | aNy |

Common degenerate codon schemes for library construction:

| Scheme | Codons | Amino acids | Stop codons | Notes |
|--------|--------|-------------|-------------|-------|
| NNK | 32 | 20 | 1 (TAG) | Reduced stop codon frequency |
| NNS | 32 | 20 | 1 (TAG) | Equivalent to NNK |
| NNN | 64 | 20 | 3 | All codons, higher stop frequency |
| NDT | 12 | 12 | 0 | F, L, I, V, Y, H, N, D, C, R, S, G only |

Example CSV with mixed mutation types:
```csv
mutations
A123G
T50:NNK
S100:NNS
T241Del
```

The workflow validates that the wild-type amino acid matches the template sequence and logs library coverage information (number of possible codons and amino acids) for each degenerate mutation. Primers are generated with the degenerate bases embedded; reverse primers contain the correct IUPAC reverse complements (e.g., K↔M, R↔Y, S↔S).

#### Experimental blueprint

- Hands-on time is approximately three hours (excluding protein purification), with mutant protein obtainable in roughly three days.
- Conduct two PCRs per mutant set: (A) long forward with short reverse and (B) long reverse with short forward.
- Combine 10 µL from each PCR with 10 µL H-buffer (150 mM Tris pH 8, 400 mM NaCl, 60 mM EDTA) for a 30 µL annealing reaction: 99 °C for 3 min, then two cycles of 65 °C for 5 min followed by 30 °C for 15 min, hold at 4 °C.
- Transform directly into NEB 5-alpha or BL21 (DE3) cells without additional cleanup. The protocol has been validated for simultaneous introduction of dozens of mutations.

### KLD primer design

KLD (Kinase-Ligation-DpnI) is an alternative mutagenesis method using inverse PCR to amplify the entire plasmid with mutations incorporated at the primer junction.

- Inputs: Same as SLIM design
  - `data/design_kld/kld_template_gene.fasta`
  - `data/design_kld/kld_context.fasta`
  - `data/design_kld/kld_target_mutations.csv` (single `mutations` column)
- Run:
  ```bash
  uht-tooling design-kld \
    --gene-fasta data/design_kld/kld_template_gene.fasta \
    --context-fasta data/design_kld/kld_context.fasta \
    --mutations-csv data/design_kld/kld_target_mutations.csv \
    --output-dir results/design_kld/
  ```
**Outputs:**
- `KLD_primers.csv` — columns `[Primer Name, Sequence, Tm (binding), GC%, Length, Notes]`, 2 primers per mutation (`_F`, `_R`)

Mutation nomenclature: Same as SLIM (substitution, deletion, insertion, indel, library).

#### KLD vs SLIM

| Method | Primers | Mechanism | Best for |
|--------|---------|-----------|----------|
| SLIM | 4 per mutation | Overlap assembly | Multiple simultaneous mutations |
| KLD | 2 per mutation | Inverse PCR + ligation | Single mutations, simpler workflow |

#### KLD primer design rules

- Forward primer: Mutation codon at 5' end + downstream template-binding region
- Reverse primer: Reverse complement of upstream region, 5' end adjacent to forward
- Tm calculated on template-binding regions only (50-65°C target)
- Tm difference between primers kept within 5°C
- GC content 40-60%
- Binding region 18-24 bp

#### Experimental workflow

1. PCR amplify entire plasmid with KLD primer pair
2. DpnI digest to remove methylated template
3. T4 PNK phosphorylation of 5' ends
4. T4 DNA ligase to circularize
5. Transform into competent cells

NEB sells a KLD Enzyme Mix (M0554) that combines these steps.

### Gibson assembly primers

- Inputs mirror the SLIM workflow but use `data/design_gibson/`.
- Link sub-mutations with `+` to specify multi-mutation assemblies (e.g., `A123G+T150A`).
- Run:
  ```bash
  uht-tooling design-gibson \
    --gene-fasta data/design_gibson/gibson_template_gene.fasta \
    --context-fasta data/design_gibson/gibson_context.fasta \
    --mutations-csv data/design_gibson/gibson_target_mutations.csv \
    --output-dir results/design_gibson/
  ```
**Outputs:**
- `Gibson_primers.csv` — columns `[Group, Submutation, Primer Name, Sequence]`
- `Gibson_assembly_plan.csv` — columns `[Group, Submutation, PCR_Primer_Forward, PCR_Primer_Reverse, Tm (celsius), Amplicon Size (bp)]`

If mutations fall within overlapping primer windows, design sequential reactions. 

### Mutation caller (no UMIs)

1. Supply:
   - `data/mutation_caller/mutation_caller_template.fasta`
   - `data/mutation_caller/mutation_caller.csv` with `gene_flanks` and `gene_min_max` columns (two rows each).
   - One or more FASTQ files via `--fastq`.
2. Run:
   ```bash
   uht-tooling mutation-caller \
     --template-fasta data/mutation_caller/mutation_caller_template.fasta \
     --flanks-csv data/mutation_caller/mutation_caller.csv \
     --fastq data/mutation_caller/*.fastq.gz \
     --output-dir results/mutation_caller/ \
     --threshold 10
   ```
**Outputs:** per-sample subdirectory containing:
- `{sample}_aa_substitution_frequency.png` — substitution frequency plot with KDE
- `{sample}_frequent_aa_counts.csv` — columns `[AA, Count]` (filtered by `--threshold`)
- `{sample}_cooccurring_AA_baseline.csv` — columns `[AA1, AA2, Both_Count, AA1_Count, AA2_Count]`
- `{sample}_cooccurring_AA_fisher.csv` — columns `[AA1, AA2, p-value]`
- `{sample}_report.txt` — summary report

Co-occurrence matrices are experimental and are not yet to be relied on.

### UMI Hunter

- Inputs: `data/umi_hunter/template.fasta`, `data/umi_hunter/umi_hunter.csv`, and FASTQ reads.
- Command:
  ```bash
  uht-tooling umi-hunter \
    --template-fasta data/umi_hunter/template.fasta \
    --config-csv data/umi_hunter/umi_hunter.csv \
    --fastq data/umi_hunter/*.fastq.gz \
    --output-dir results/umi_hunter/
  ```
- Tunable parameters include `--umi-identity-threshold`, `--consensus-mutation-threshold`, and `--min-cluster-size`.
- `--umi-identity-threshold` (0–1) controls how similar two UMIs must be to fall into the same cluster.
- `--consensus-mutation-threshold` (0–1) is the fraction of reads within a cluster that must agree on a base before it is written into the consensus sequence.
- `--min-cluster-size` sets the minimum number of reads required in a cluster before a consensus is generated (smaller clusters remain listed in the raw UMI CSV but no consensus FASTA is produced).

**Outputs:** per-sample subdirectory containing:
- `{sample}_UMI_clusters.csv` — columns `[Cluster Representative, Total Count, Members]`
- `{sample}_gene_consensus.csv` — columns `[Cluster Representative, Total Count, Consensus Gene, Length Difference, Members]`
- `{sample}_consensuses.fasta` — FASTA with consensus sequences (only for clusters ≥ `--min-cluster-size`)

Please be aware, this toolkit will not scale well beyond around 50k reads/sample. See UMIC-seq pipelines for efficient UMI-gene dictionary generation.

### Profile inserts

- Prepare `data/profile_inserts/sample_probes.csv` with `upstream` and `downstream` columns.
- Run:
  ```bash
  uht-tooling profile-inserts \
    --probes-csv data/profile_inserts/sample_probes.csv \
    --fastq data/profile_inserts/*.fastq.gz \
    --output-dir results/profile_inserts/
  ```
**Outputs:**
- `extracted_inserts.fasta` — all extracted insert sequences
- `qc_report.txt` — summary statistics (lengths, GC, duplicates, probe performance)
- `qc_plots.png` — multi-panel QC figure

Adjust fuzzy matching strictness via `--min-ratio`.

### EP library profiler (no UMIs)

- Inputs:
  - `data/ep-library-profile/region_of_interest.fasta`
  - `data/ep-library-profile/plasmid.fasta`
  - FASTQ inputs (`--fastq` accepts multiple files)
- Run:
  ```bash
  uht-tooling ep-library-profile \
    --region-fasta data/ep-library-profile/region_of_interest.fasta \
    --plasmid-fasta data/ep-library-profile/plasmid.fasta \
    --fastq data/ep-library-profile/*.fastq.gz \
    --output-dir results/ep-library-profile/
  ```
- Safety note: `--output-dir` (and `--work-dir` if used) must live inside a dedicated workspace
  containing a `.uht_tooling_workspace` file. This prevents accidental deletion of unrelated folders.
  Example:
  ```bash
  mkdir -p ~/uht_tooling_workspace
  touch ~/uht_tooling_workspace/.uht_tooling_workspace
  # then use --output-dir ~/uht_tooling_workspace/ep-library-profile/
  ```

**Output structure**

Each sample produces an organized output directory:

```
sample_name/
├── KEY_FINDINGS.txt                    # Lay-user executive summary
├── summary_panels.png                  # Main visualization (PNG)
├── summary_panels.pdf                  # Main visualization (PDF)
├── run.log                             # Analysis log
└── detailed/                           # Technical outputs
    ├── gene_mismatch_rates.csv
    ├── base_distribution.csv
    ├── aa_substitutions.csv            # Protein-coding regions only
    ├── plasmid_coverage.csv
    ├── aa_mutation_distribution.csv
    ├── summary.txt
    └── {sample}_mutation_spectrum.pdf
```

A top-level `master_summary.txt` aggregates findings across all samples when multiple FASTQs are processed.

**Lambda estimate**

The profiler reports a single lambda (mutations per gene copy) derived from the net mismatch rate:

- **Formula**: `(hit_rate - bg_rate) × seq_len`
- **Where it appears**: panel 4 of `summary_panels.png` and the Poisson lambda line in `KEY_FINDINGS.txt`.

The `KEY_FINDINGS.txt` file provides a plain-language summary including:
- Expected AA mutations per gene copy
- Poisson-based interpretation (% wild-type, % 1 mutation, % 2+ mutations)
- Quality assessment (GOOD/ACCEPTABLE/LOW COVERAGE)

**How the mutation rate and AA expectations are derived**

1. Reads are aligned to both the region of interest and the full plasmid. Mismatches in the region define the "target" rate; mismatches elsewhere provide the background.
2. The per-base background rate is subtracted from the target rate to yield a net nucleotide mutation rate, and the standard deviation reflects binomial sampling and quality-score uncertainty.
3. The net rate is multiplied by the CDS length to estimate λ_bp (mutations per copy). Monte Carlo simulations then flip random bases, translate the mutated CDS, and count amino-acid differences across 1,000 trials—these drive the AA mutation mean/variance that appear in the panel plot and summary.

---

## GUI quick start (optional)

The NiceGUI web frontend wraps the same workflows with an Apple-inspired design and sidebar navigation. Launch it with:

```bash
uht-tooling gui
```

The server binds to `http://127.0.0.1:7860` by default. Open that URL in your browser to access the interface.

### Navigation

The sidebar organises workflows into two groups:

**Primer Design**
- Nextera XT (`/`)
- SLIM (`/slim`)
- KLD (`/kld`)
- Gibson (`/gibson`)

**Sequencing Analysis**
- Mutation Caller (`/mutation-caller`)
- UMI Hunter (`/umi-hunter`)
- Profile Inserts (`/profile-inserts`)
- EP Library (`/ep-library`)

### Features

- **Dark mode toggle** — persists across sessions via browser storage.
- **FASTA paste support** — Mutation Caller, UMI Hunter, and EP Library pages accept raw sequence paste in addition to file upload.
- **Slider controls with live value display** — UMI Hunter thresholds, Profile Inserts min-ratio.
- **Download results as ZIP** — output archives mirror the directory structure produced by the CLI.

### Legacy Gradio interface

The old Gradio GUI is still available:

```bash
pip install "uht-tooling[legacy-gui]"
uht-tooling gui --legacy
```

### Workflow tips

- For large FASTQ datasets, the CLI remains the most efficient option (especially for automation or batch processing).

### Troubleshooting

- **Port already bound:** the launcher automatically selects the next free port and logs the chosen URL.
- **Missing dependency:** ensure you installed with `pip install "uht-tooling[gui]"` (or the core package, which already includes NiceGUI).
- **Stopping the server:** press `Ctrl+C` in the terminal session running `uht-tooling gui`.

---

## Logging

Every workflow configures logging to the destination output directory. Inspect `run.log` for command echoes, parameter choices, and any warnings produced during execution. When providing bug reports, include this log file along with input metadata to streamline triage.

---

## Roadmap

- Expand CLI coverage to any remaining legacy scripts that are still invoked via `make`.
- Add documentation for automation pipelines and integrate continuous integration tests.

Contributions in the form of bug reports, pull requests, or feature suggestions are welcome. File issues on GitHub with clear reproduction steps and sample data when possible.
