Metadata-Version: 2.4
Name: bench3c
Version: 0.0.1
Summary: Synthetic Hi-C / Micro-C / 3C triplet FASTQ benchmark generator and .pairs recovery analyser.
Author: Samir Bertache
License-Expression: AGPL-3.0-or-later
Keywords: bioinformatics,Hi-C,Micro-C,3C,FASTQ,pairs,benchmark,chimeric reads
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.4.2
Requires-Dist: plotnine>=0.15.4
Requires-Dist: polars>=1.38.1
Requires-Dist: pyarrow>=23.0.1
Requires-Dist: pysam>=0.23.3
Requires-Dist: rich>=14.3.2
Requires-Dist: rich-argparse>=1.7.2
Dynamic: license-file

# bench3c

`bench3c` is a synthetic benchmark generator for 3C-derived sequencing workflows. It generates controlled Hi-C-like and Micro-C-like paired-end FASTQ reads containing known triplet structures, then evaluates whether a mapping / splitting / `.pairs` reconstruction pipeline recovers the expected fragments.

The benchmark is mainly designed to test preprocessing tools for chimeric or multiplex reads in Hi-C, Micro-C, Pore-C-like or split-read 3C workflows.

## Model

The benchmark encodes a known triplet of genomic fragments:

```text
R1: AAAAAAABBBBBBBBB
R2: CCCCCCCCCCCCCCCC
````

The read name stores the true genomic coordinates:

```text
@chrA-startA-endA:chrB-startB-endB::chrC-startC-endC
```

This allows downstream analysis to compare the expected fragment lengths with the observed alignments recovered in `.pairsam` files.

## Modes

`bench3c` has three main modes:

* `--hic`: generate Hi-C-like triplet reads from a digested genome.
* `--microc`: generate Micro-C-like triplet reads directly from a FASTA.
* `--analyse`: analyse a `.pairs`, `.pairs.gz` or `.pairsam` file and compare recovered alignments to the encoded truth.

## Installation

From source:

```bash
git clone <repo-url>
cd <repo>
uv sync
uv run bench3c --help
```

Or with pip after packaging:

```bash
pip install bench3c
bench3c --help
```

## Hi-C simulation

Generate Hi-C-like paired-end FASTQ reads:

```bash
bench3c --hic \
  --fasta genome.fa \
  --site GATC \
  --out bench/hic_sim \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --min-piece 50 \
  --max-jump 300
```

If `--fasta` is not provided, `bench3c` can generate a random FASTA. If `--digested` is not provided, the genome is digested internally using `--site`.

Typical outputs:

```text
bench/hic_sim_R1.fq
bench/hic_sim_R2.fq
```

## Micro-C simulation

Generate Micro-C-like paired-end FASTQ reads:

```bash
bench3c --microc \
  --fasta genome.fa \
  --out bench/microc_sim \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --min-piece 50
```

Add non-chimeric pairs:

```bash
bench3c --microc \
  --fasta genome.fa \
  --out bench/microc_mixed \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --prop-nonchimeric 0.2
```

## Analysis mode

Analyse a `.pairsam.gz` or `.pairsam` file after mapping and reconstruction:

```bash
bench3c --analyse \
  --pairs output.pairs.gz \
  --read-len 150 \
  --analyse-out-dir benchmark_results \
  --condition my_pipeline
```

The analyser expects read names following the truth-encoding format:

```text
chrA-startA-endA:chrB-startB-endB::chrC-startC-endC
```

It also expects recoverable alignment information, typically through `sam1` and `sam2` columns in the `.pairs` / `.pairsam` file.

## Analysis outputs

The analysis mode writes summary tables and plots, including:

```text
<condition>_read_recovery.tsv
<condition>_problem_fragments.tsv
<condition>_problem_summary.tsv
<condition>_cut_summary.tsv
<condition>_histogram.pdf
<condition>_tolerance_curve.pdf
```

Depending on the current version, additional outputs may include chimeric-size summaries and chimeric-specific histograms.

## Typical benchmark workflow

```bash
# 1. Generate synthetic reads
bench3c --microc \
  --fasta genome.fa \
  --out bench/microc \
  --number-reads-pairs 10000 \
  --read-len 150

# 2. Map the reads with your mapper
bwa mem -SP genome.fa bench/microc_R1.fq bench/microc_R2.fq > bench/microc.sam

# 3. Convert the mappings to .pairs with your pipeline

# 4. Analyse fragment recovery
bench3c --analyse \
  --pairs bench/output.pairs.gz \
  --read-len 150 \
  --analyse-out-dir bench/results \
  --condition my_pipeline
```

## Interpretation

A perfect read is a read for which all expected fragments are recovered with no extra observed fragment.

Common failure classes:

* `missing`: an expected fragment was not recovered.
* `too_short`: the observed fragment is shorter than the truth.
* `too_long`: the observed fragment is longer than the truth.
* `over_split`: extra observed fragments were recovered.
* `under_split_or_missing`: one or more truth fragments were not recovered.

## Limitations

`bench3c` is a controlled synthetic benchmark. It does not fully model all experimental biases of real Hi-C or Micro-C libraries, such as PCR duplicates, GC bias, mappability bias, restriction efficiency, ligation bias, base-quality degradation, optical duplicates, or complex multi-mapping.

It is intended to test whether a pipeline can recover known chimeric or multiplex structures under controlled conditions.

## License

AGPL-3.0-or-later.
