Metadata-Version: 2.4
Name: dotmatch
Version: 0.1.7
Summary: Deterministic known-target short-DNA assignment for CRISPR guide counting, barcode demultiplexing, and FASTQ workflows
Author: Donncha O'Toole
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/dnncha/dotmatch
Project-URL: Repository, https://github.com/dnncha/dotmatch
Project-URL: Issues, https://github.com/dnncha/dotmatch/issues
Project-URL: Documentation, https://github.com/dnncha/dotmatch#readme
Keywords: bioinformatics,computational biology,CRISPR,FASTQ,known-target assignment,barcode demultiplexing,edit distance
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: C
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tomli; python_version < "3.11"
Dynamic: license-file

# DotMatch

[![CI](https://github.com/dnncha/dotmatch/actions/workflows/ci.yml/badge.svg)](https://github.com/dnncha/dotmatch/actions/workflows/ci.yml)
[![Bioconda](https://img.shields.io/conda/vn/bioconda/dotmatch?label=bioconda)](https://anaconda.org/bioconda/dotmatch)
[![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/dotmatch?label=downloads)](https://anaconda.org/bioconda/dotmatch)
[![Bioconda platforms](https://img.shields.io/conda/pn/bioconda/dotmatch?label=platforms)](https://anaconda.org/bioconda/dotmatch)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Citation](https://img.shields.io/badge/cite-CITATION.cff-green.svg)](CITATION.cff)

DotMatch is a command-line tool for a common sequencing job: you already know
the short DNA sequences you expect, and you need to count or split reads by
those sequences without hiding ambiguous cases.

It is built for CRISPR guide counting, inline barcode demultiplexing,
fixed-window feature/barcode assignment, primer or adapter-prefix checks,
amplicon-panel starts, whitelist-style assays, and barcode panel design for
known-target assignment. It is not a genome aligner, a basecaller, a UMI entropy
generator, or a replacement for downstream screen statistics.

Package scope: the published Bioconda package installs the `dotmatch` command,
Python imports/workflow namespaces, and C header/library artifacts. The
Workbench desktop app is a separate local application and is not part of the
Bioconda recipe. New release features are only described as available from a
public package after the matching package version has passed the install smoke
tests in [Packaging Notes](docs/packaging.md).

![DotMatch workflow: FASTQ reads and a known target table are sliced at the same read position, assigned to known short DNA targets, and written to counts, split FASTQs, QC tables, and reports.](public/dotmatch-read-assignment.svg)

## Assignment Model

DotMatch assigns short read windows against a known target table. Typical
windows are a CRISPR guide segment, an inline sample barcode, a feature barcode,
a primer prefix, or another fixed-position assay sequence. DotMatch extracts the
configured slice, compares it with the target table under the selected edit
model, and records the assignment state.

For each read, DotMatch reports one outcome:

| Outcome | Meaning | Why it matters |
| --- | --- | --- |
| `unique` | exactly one target is compatible | counted or written to the matching FASTQ |
| `ambiguous` | more than one target is compatible | kept out of forced assignments |
| `none` | no target is close enough | available for unmatched-read review |
| `invalid` | the requested read window cannot be extracted | visible in QC instead of disappearing |

Ambiguity is part of the output contract. If a read is compatible with multiple
targets under the configured radius, DotMatch reports the ambiguous assignment
instead of assigning it to an arbitrary target.

Typical outputs include count matrices or demultiplexed FASTQs, `sample_qc.tsv`,
top-unmatched tables, target-library audit files, `summary.json`, and
self-contained HTML reports.

## Barcode Troubleshooting

For barcode runs, DotMatch can inspect the common reasons reads fail assignment:
wrong barcode position, wrong barcode length, duplicate barcodes, unsafe
one-mismatch correction, ambiguous rescue, low-quality correction candidates,
invalid read windows, and high-count unmatched sequences.

```bash
dotmatch barcode autopsy \
  --barcodes barcodes.tsv \
  --reads pooled.fastq.gz \
  --scan-starts 0:12 \
  --k-values 0,1 \
  --out-dir autopsy
```

Open `autopsy/report.html` first. The TSV and JSON files beside it are there for
pipelines and lab handoff: `findings.tsv`, `offset_scan.tsv`,
`correction_safety.tsv`, `top_unmatched.tsv`, and `provenance.json`.

Speed is useful only after the assignment rules are clear. The checked barcode
example documents the exact comparator settings in
[docs/benchmarks/barcode_demux](docs/benchmarks/barcode_demux/README.md).

## Barcode Panel Design

DotMatch can design barcode panels and check assignment-collision risk under the
same semantics used later by demux and counting. A designed panel includes a
machine-checkable assignment report, per-target collision-risk rows, collision
tables, ambiguous-variant examples, plate layout, lab exports, and a report.

```bash
dotmatch panel design \
  --n 96 \
  --length 16 \
  --preset illumina-inline-strict \
  --min-hamming-distance 5 \
  --min-levenshtein-distance 4 \
  --gc-min 0.35 \
  --gc-max 0.65 \
  --max-homopolymer 3 \
  --avoid-rc \
  --seed 42 \
  --out-dir dotmatch_96x16/
```

Important commands:

```bash
dotmatch panel check barcodes.tsv --k 1 --metric hamming --out-dir panel_check/
dotmatch panel optimize vendor_barcodes.tsv --n 24 --out-dir optimized_panel/
dotmatch panel simulate barcodes.tsv --reads 1000000 --out-dir simulation/
dotmatch panel layout barcodes.tsv --plate 96 --out plate_layout.tsv
dotmatch panel export barcodes.tsv --format illumina-samplesheet --out-dir sample_sheet_templates/
```

The assignment report preserves DotMatch outcomes: `unique`, `ambiguous`, `none`, and
`invalid`. It fails a configured correction radius if any query in that radius
can map ambiguously or silently to the wrong barcode. The current exact
report enumerates configured error spheres up to `k=2`; larger radii are
refused rather than partially checked.

Outputs include `barcodes.tsv`, `design_report.json`, `design_trace.tsv`,
`panel_check/panel_summary.json`, `target_safety.tsv`, `collision_pairs.tsv`,
`ambiguous_error_spheres.tsv`, `flanked_sequences.tsv`, `plate_layout.tsv`,
`sample_sheet_templates/SampleSheet.csv`, `report.html`, and
`README_FOR_LAB.md`.

See [Barcode Panel Design](docs/barcode-panel-design.md) and the checked smoke
gate in
[docs/benchmarks/barcode_panel_design](docs/benchmarks/barcode_panel_design/README.md).

## When To Use DotMatch

DotMatch is a good fit when you have a table of expected short sequences and the
biological question is "which known guide, barcode, primer, feature tag, or
panel target did this read contain?"

Common uses include:

- CRISPR pooled-screen guide counting with MAGeCK-compatible output;
- fixed-position barcode demultiplexing from FASTQ/FASTQ.gz;
- per-read assignment of 10x guide-capture or feature-barcode windows;
- primer-start, amplicon-panel, adapter-prefix, or whitelist-style assays;
- designing, optimizing, checking, simulating, and exporting barcode panels;
- target-library audits before allowing one-edit correction;
- validating an indexed assignment run against an exhaustive scan or Edlib.

DotMatch is not a genome aligner or basecaller. It does not produce SAM/BAM,
CIGAR strings, variant calls, cell/UMI quantification, UMI entropy designs,
expression matrices, or screen-level hit-calling statistics. It works on
extracted short windows and known target lists.

## Installation

DotMatch currently supports source builds and local Python package installs on
Linux and macOS. You need a C compiler, `make`, Python 3.9 or newer for the
Python package, and zlib for FASTQ.gz support.

```bash
git clone https://github.com/dnncha/dotmatch.git
cd dotmatch
make

./dotmatch --version
./dotmatch dist ACGT AGGT
./dotmatch leq 1 ACGT AGGT
```

Python install from a checkout:

```bash
python3 -m pip install .
python3 -c "import dotmatch; print(dotmatch.distance('ACGT', 'AGGT'))"
```

Docker build from the repository:

```bash
docker build -t dotmatch:dev .
docker run --rm -v "$PWD:/work" dotmatch:dev dist ACGT AGGT
```

Bioconda install for the published package on platforms visible in Bioconda
repodata. The release recipe opts into `osx-arm64` builds for Apple Silicon,
but only treat that platform as available for a release after Bioconda metadata
and install smoke tests verify it:

```bash
conda create -n dotmatch -c conda-forge -c bioconda dotmatch=0.1.4
conda activate dotmatch
dotmatch --version
```

Package status for PyPI, Bioconda, containers, and release archives is tracked
in [Packaging Notes](docs/packaging.md), the
[Release Process](docs/release-process.md), and the machine-readable
[Distribution Status](docs/distribution-release.json). Only claim a channel as
available for a release after `make distribution-channels` verifies public
metadata and install smoke tests.

The release workflow builds and smoke-tests the source distribution, the native
macOS wheel, and repaired Linux wheels. PyPI trusted publishing is configured
for those artifacts. We will only describe PyPI wheel availability after the
tagged release is visible on PyPI. For Linux wheels, the GitHub release workflow
builds and smoke-tests repaired manylinux/musllinux wheel artifacts before any
wheel is considered for PyPI.

Bioconda provides the `dotmatch` command-line tool, Python workflow namespaces,
Python imports, and C header/library artifacts for the published package
version. The installed `dotmatch` console script exposes the native assignment
commands plus `dotmatch assay ...`, `dotmatch barcode ...`, and
`dotmatch panel ...`. The next release is packaging-ready in this repository,
but do not cite a newer Bioconda version until the channel metadata and install
smoke tests verify it.

Optional local Workbench: DotMatch also includes a desktop Workbench under
`apps/workbench` for local AssaySpec design, inference, planning, running, and
report review. It is separate from the Bioconda recipe and keeps FASTQ, target,
barcode, spec, and output paths inside a user-selected local workspace. See
[Workbench](docs/workbench.md).

## Quick Example

The core operation is many-read versus many-target assignment. Target files and
read files can be simple TSVs with `id<TAB>sequence`.

```bash
cat > targets.tsv <<'EOF'
bc0	ACGT
bc1	AGGT
bc2	ACGA
EOF

cat > reads.tsv <<'EOF'
r0	ACGT
r1	ACGC
r2	TTTT
EOF

./dotmatch assign 1 targets.tsv reads.tsv
```

Expected output:

```text
mode	read_id	read_seq	target_index	target_seq	distance	status	match_count	second_best_distance
assign	r0	ACGT	0	ACGT	0	ambiguous	3	1
assign	r1	ACGC	0	ACGT	1	ambiguous	2	-1
assign	r2	TTTT	-1		-1	none	0	-1
```

`r0` is an exact match to `bc0`, but two other targets are also within the
configured one-edit radius. DotMatch's default `radius` ambiguity policy
therefore reports it as ambiguous instead of forcing an assignment. Use
`--ambiguity-policy best` or Python `policy="best"` only when best-distance
assignment is the intended compatibility mode.

## CRISPR Guide Counting

For pooled CRISPR screens, `crispr-count` wraps the FASTQ counting engine and
writes a MAGeCK-style count matrix.

```bash
cat > samples.tsv <<'EOF'
sample_id	fastq
plasmid	plasmid_R1.fastq.gz
treatment	treatment_R1.fastq.gz
EOF

./dotmatch crispr-count \
  --library guides.csv \
  --samples samples.tsv \
  --guide-start 23 \
  --guide-length 20 \
  --k 1 \
  --metric hamming \
  --ambiguity-policy radius \
  --out counts.mageck.tsv \
  --summary qc.json \
  --ambiguous discard
```

Use `--metric hamming` for one-mismatch/no-indel guide-counter-style counting;
use `--ambiguity-policy best` when intentionally matching guide-counter's
compatibility semantics.
Use `--metric levenshtein --indel-window 1` when one-base insertions and
deletions around the guide window should be considered. Ambiguous reads are not
added to guide counts unless you explicitly request diagnostic reporting.

A small worked example is available in
[examples/crispr_guides](examples/crispr_guides/README.md), and a step-by-step
fixture walkthrough is in
[docs/tutorials/crispr-count-first-run.md](docs/tutorials/crispr-count-first-run.md).
The public Sanson/Brunello paper-data lane used by guide-counter is available
in [examples/crispr_sanson_brunello](examples/crispr_sanson_brunello/README.md).
The reproducible DotMatch-vs-guide-counter comparison report is in
[docs/benchmarks/crispr_comparison](docs/benchmarks/crispr_comparison/README.md).

![CRISPR guide-counting throughput comparison](benchmarks/figures/crispr_comparison_throughput.svg)

![CRISPR Hamming k2/k3 Bowtie 1 comparison](benchmarks/figures/crispr_hamming_k23_comparison.svg)

## GuideCounter-Compatible Counting

DotMatch also has a GuideCounter-compatible command shape for labs that already
have `guide-counter count` scripts. The wrapper delegates assignment to
DotMatch's deterministic CPU count engine and rewrites the result into
GuideCounter-style output files.

```bash
dotmatch guide-counter count \
  --input plasmid.fastq.gz treatment.fastq.gz \
  --samples plasmid treatment \
  --library guides.tsv \
  --output guide_counts
```

Supported entrypoints are `dotmatch guide-counter count`,
`dotmatch guide-counter-count`, and `dotmatch guide-count`. The wrapper accepts
GuideCounter-style flags including `--input/-i`, `--samples/-s`,
`--library/-l`, `--output/-o`, `--exact-match/-x`,
`--offset-sample-size/-N`, `--offset-min-fraction/-f`,
`--essential-genes/-e`, `--nonessential-genes/-n`, `--control-guides/-c`, and
`--control-pattern/-C`.

By default this mode uses GuideCounter-compatible counting semantics: Hamming
matching, one mismatch, no indels, best-distance assignment, automatic
multi-offset guide-window detection, `--offset-sample-size 100000`, and
`--offset-min-fraction 0.0025`. Add `--exact-match` for exact-only counting.
When `--samples` is omitted, sample labels are inferred from input FASTQ file
names.

For `--output guide_counts`, the wrapper writes:

- `guide_counts.counts.txt`: `guide`, `gene`, then one count column per sample;
- `guide_counts.extended-counts.txt`: the same counts with a `guide_type`
  column derived from essential, nonessential, control-guide, or control-pattern
  annotations;
- `guide_counts.stats.txt`: per-sample totals, mapped reads, mapped fraction,
  mean reads by guide class, and zero-read guide counts.

This compatibility mode is an input/output and policy bridge. DotMatch
assignment remains deterministic and CPU-authoritative. GPU benchmark rows and
backend optimizer recommendations do not change which guide is counted.

## General FASTQ Counting

The lower-level `count` command works with arbitrary known targets and one or
more FASTQ/FASTQ.gz inputs.

```bash
./dotmatch count \
  --targets targets.tsv \
  --reads sample_R1.fastq.gz \
  --sample-label sample_1 \
  --target-start 0 \
  --target-length 20 \
  --k 1 \
  --metric levenshtein \
  --indel-window 1 \
  --ambiguity-policy radius \
  --out counts.tsv \
  --target-counts-long target_counts.long.tsv \
  --sample-qc sample_qc.tsv \
  --assignments assignments.tsv \
  --summary summary.json
```

The count table separates exact matches, one-substitution corrections,
one-insertion corrections, one-deletion corrections, and other accepted
corrections. `sample_qc.tsv` records assignment rate, rescue rate, ambiguous and
unmatched fractions, target coverage, zero-count targets, Gini index, and the
number of candidate targets checked after indexing.

Output schemas are documented in [Public Schemas](docs/schemas.md).

## Barcode Demultiplexing

For fixed-position inline barcodes, `demux` writes one FASTQ per uniquely
assigned barcode and can optionally retain ambiguous and unmatched reads.

```bash
./dotmatch demux \
  --barcodes barcodes.tsv \
  --reads pooled.fastq.gz \
  --barcode-start 0 \
  --barcode-length 8 \
  --k 1 \
  --metric hamming \
  --ambiguity-policy radius \
  --max-correction-qual 20 \
  --out-dir demuxed \
  --summary demux.qc.json \
  --assignments demux.assignments.tsv \
  --ambiguous-out ambiguous.fastq \
  --unmatched-out unmatched.fastq
```

Use `--barcode-length auto` when the barcode sheet contains multiple lengths.
Prefix-overlapping exact matches are reported as ambiguous rather than resolved
by length.

DotMatch also includes an early classic per-cycle BCL demultiplexing command for
small RunInfo/SampleSheet/BCL workflows. CBCL and NovaSeq-style inputs are not
part of the current BCL scope.

## Target Library Audit

Before enabling correction, audit the target set for collisions and near
neighbors. For Hamming guide-counting at `k=2` or `k=3`, exact audit reports
whether any same-length target pair is close enough for error spheres to overlap
(`distance <= 2k`). Fast audit keeps the conservative one-edit report and marks
larger Hamming safety fields as not computed.

```bash
./dotmatch audit \
  --targets guides.tsv \
  --k 3 \
  --audit-mode exact \
  --out-dir audit/
```

Use `--audit-mode exact` when you need Hamming `k=2`/`k=3` safety fields.

```bash
./dotmatch audit \
  --targets guides.tsv \
  --k 1 \
  --audit-mode auto \
  --out-dir audit/
```

The audit output includes duplicate targets, nearby target pairs, collision
clusters, per-target safety, and example query variants that would be ambiguous
at `k=1`. In exact mode, `audit_summary.tsv` and `audit_summary.json` also
include `safe_at_hamming_k2`, `safe_at_hamming_k3`,
`risk_pairs_for_hamming_k2`, and `risk_pairs_for_hamming_k3`.

## Python API

The Python package loads the native library through `ctypes`.

```python
import dotmatch

dotmatch.distance("ACGT", "AGGT")
# 1

dotmatch.distance_leq("ACGT", "AGGT", 1)
# True

matcher = dotmatch.Matcher(["ACGT", "AGGT", "ACGA"])
results, stats = matcher.assign_with_stats(["ACGT", "ACGC"], k=1)
```

The Python API also defaults to radius-safe assignment. Pass `policy="best"` to
`assign`, `Matcher.assign`, or `Matcher.assign_with_stats` only for explicit
best-distance compatibility.

When working from a source checkout, build the shared library first:

```bash
make shared
DOTMATCH_LIB=$PWD/libdotmatch.dylib PYTHONPATH=$PWD/python python3
```

On Linux, use `libdotmatch.so` instead of `libdotmatch.dylib`.

The historical `quickdna` Python package, `quickdna` console script, and `qda`
native CLI target remain as compatibility aliases. New workflows should use
`dotmatch`.

## Matching Semantics

DotMatch uses literal-byte DNA matching. `A`, `C`, `G`, `T`, `N`, and IUPAC
ambiguity symbols are ordinary byte symbols; `N` and IUPAC codes are not
expanded as wildcards.

Supported assignment modes include:

- exact matching (`k=0`);
- Hamming matching for fixed-length one-substitution workflows;
- global Levenshtein matching for substitutions, insertions, and deletions;
- fixed-window `k=2` Levenshtein correction with packed A/C/G/T hash-neighborhood
  pruning for windows up to 32 bases and exhaustive fallback for unsupported
  cases;
- radius-safe ambiguity by default, with explicit `best` policy available for
  best-target compatibility.

The public policy string reported by the C and Python APIs is:

```text
literal-byte; A/C/G/T/N/IUPAC symbols are ordinary byte symbols; no wildcard expansion
```

## Checked Examples And Benchmarks

The repository includes native C tests, CLI fixture tests, Python tests,
deterministic fuzz checks against a dynamic-programming oracle, and optional
Edlib validation for assignment runs.

Useful local checks:

```bash
make test
make cli-test
make python-test
make python-package-test
```

Reports with data sources, commands, comparator settings, and checked outputs:

- [Evidence gallery](docs/evidence-gallery/README.md)
- [Benchmark overview](docs/benchmarks/README.md)
- [Native Edlib assignment report](docs/benchmarks/native/README.md)
- [Public CRISPR guide-counting report](docs/benchmarks/public_crispr/README.md)
- [Barcode demultiplexing report](docs/benchmarks/barcode_demux/README.md)
- [Feature-barcode assignment report](docs/benchmarks/feature_barcode/README.md)
- [CRISPR guide-capture assignment report](docs/benchmarks/perturb_seq/README.md)
- [Amplicon/panel primer-start report](docs/benchmarks/amplicon_panel/README.md)
- [Oligo/adapter prefix-assignment report](docs/benchmarks/oligo_adapter/README.md)

For a compact list of what has and has not been checked, see
[Evidence Notes](docs/scientific-claims.md). For methods text and citation
language, see [Methods and Citation](docs/methods-and-citation.md).

## Development

```bash
make
make test
make cli-test
make coverage
```

Workflow-manager examples are included for Galaxy, Nextflow, nf-core-style
modules, Snakemake, and MultiQC custom content under
[examples/workflows](examples/workflows/).

Contributions are welcome. Please read [CONTRIBUTING.md](CONTRIBUTING.md),
[SUPPORT.md](SUPPORT.md), and [SECURITY.md](SECURITY.md) before opening issues
or pull requests.

## Citation

If DotMatch is useful in your work, cite the software release using
[CITATION.cff](CITATION.cff). Installed packages also expose
`dotmatch citation` for a copyable release citation. Suggested methods text is
provided in [docs/methods-and-citation.md](docs/methods-and-citation.md).

## License

DotMatch is released under the [Apache License 2.0](LICENSE).
