Metadata-Version: 2.4
Name: fertilizer-genomics
Version: 0.1.0
Summary: Identify fertile ground in genomes for minimal-edit regulatory design.
Project-URL: Homepage, https://github.com/jmschrei/fertilizer
Project-URL: Issues, https://github.com/jmschrei/fertilizer/issues
Author: Jacob Schreiber
License: MIT License
        
        Copyright (c) 2026 Jacob Schreiber
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: joblib>=1.3
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: pybigwig>=0.3.22
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# fertilizer

The **Fertile Ground Hypothesis** is that genomes are full of "almost-regulatory" regions — sequences that do not do anything on their own, but can be minimally edited to achieve subtle and precise activity. Many near-motifs, for example, sit one or two substitutions away from binding a transcription factor and recruiting its downstream regulatory activity. `fertilizer` helps identify the fertile ground in a genome that is most useful for your design task.

> [!WARNING]
> Everything about this package, including the unit tests and this README (except for this line), was generated by Claude. We are in the process of validating everything, but use with caution for now.

## Installation

`fertilizer` is installable with [uv](https://docs.astral.sh/uv/). The PyPI
distribution is named `fertilizer-genomics` (the `fertilizer` name was taken),
but the importable module is still `fertilizer`.

```bash
# install from PyPI
uv pip install fertilizer-genomics

# or install directly from GitHub
uv pip install git+https://github.com/jmschrei/fertilizer.git

# or install from a local clone
git clone https://github.com/jmschrei/fertilizer.git
cd fertilizer
uv pip install -e .
```

For a development environment with test and lint tooling:

```bash
uv venv
uv pip install -e ".[dev]"
pytest tests/
```

## Usage

`fertilizer` exposes two subcommands. Typical workflow:

```bash
fertilizer extract -w A.bw B.bw C.bw -b regions.bed -o signals.tsv
fertilizer diff    -i signals.tsv -c A B C -o diff.tsv
```

### `fertilizer extract` — signal aggregation

Compute the mean signal in each bigWig over each region in the concatenated BED input. One row per region, one column per bigWig. Input row order is preserved.

| flag | description |
| --- | --- |
| `-w`, `--bigwigs` | one or more bigWig signal tracks |
| `-b`, `--beds` | one or more BED region files (first three columns are used; `#` comments are skipped) |
| `-o`, `--output` | path to the output TSV |
| `-s`, `--stat` | per-region summary statistic: `mean` (default), `max`, `min`, `sum`, `std`, `coverage`. Maps directly to pyBigWig's `stats(type=...)` |
| `-j`, `--n-jobs` | parallel workers (default `-1`, all cores) |

Zeros never mean "missing" — the output is always numeric, never `NaN`. A region whose mean signal genuinely resolves to zero (empty bigWig, uncovered span) is reported as `0.0` silently. A region with a locus-level problem (unknown chromosome, coordinates past the end of the chromosome, zero-length interval, negative start) is also reported as `0.0` but triggers a single `FertilizerWarning` — one warning per distinct issue type per run, regardless of how many rows or bigWigs were affected.

Example `signals.tsv`:

```
chrom   start   end     A       B       C
chr1    0       500     2.0     7.0     3.1
chr1    500     1000    6.0     7.0     5.8
chr1    400     600     4.0     7.0     4.4
chr2    0       100     0.0     0.0     0.0
```

Column names come from each bigWig's filename stem, so passing two bigWigs with the same basename (even from different directories) is rejected up front.

### `fertilizer diff` — differential activity

Identify loci that vary across two or more conditions. Takes a TSV whose columns include (at minimum) the non-negative numeric columns named via `-c` — the output of `fertilizer extract`, which also carries `chrom`/`start`/`end`, is the canonical input and is passed through verbatim — names the columns to compare, and writes a **filtered** TSV containing only loci that pass the significance threshold, with effect size, p-value, q-value, and the name of the condition carrying the highest signal. Any columns present in the input beyond the tested conditions are preserved in the output.

The test is a negative-binomial GLM likelihood-ratio test per locus, inspired by [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) (Love, Huber, Anders, 2014) and adapted to the **one-replicate-per-condition** setting this package targets. Steps:

1. **Size factors.** DESeq2's median-of-ratios, computed jointly across all conditions from loci with positive signal in every condition.
2. **Dispersion.** With no within-condition replicates, per-locus variance cannot be estimated from repeated observations. A dispersion parameter `α` is estimated *across* loci under the null-majority assumption. Default (`--fit-type common`) uses the median of per-locus method-of-moments estimates, scaled by a closed-form correction for the small-df median-of-χ² bias; an alternative parametric trend `α(μ) = a/μ + b` (`--fit-type parametric`) is available, fit by robust (MAD-trimmed) weighted least squares and subject to the same scale correction.
3. **NB-GLM LRT + BH.** Per locus, the null (all conditions share one mean `μ₀`) is fit by vectorized Newton iteration on the intercept-only NB score equation; the full model is saturated (`μⱼ = Xⱼ`). The statistic `T = 2·(ℓ_full − ℓ_null)` is distributed as χ²(K−1) asymptotically under the null. Benjamini–Hochberg q-values control the FDR across loci.

> **FDR assumes independent (or positively dependent) loci.** BH controls FDR under independence or positive regression dependency (PRDS). BigWig signal on adjacent windows is correlated — overlapping tiling windows, shared peaks, or bin sizes smaller than the underlying signal's autocorrelation length all introduce dependence. For typical "one row per peak / per gene" inputs this is fine. For dense sliding-window inputs (e.g. 100 bp windows stepped every 50 bp), q-values will be optimistic; prefer non-overlapping windows or thin by autocorrelation length before trusting the FDR.

Columns appended to the output:

| column | meaning |
| --- | --- |
| `effect_size` | max_k \|log2(X_k/s_k + pc) − log2(grand_mean + pc)\|, i.e. the largest fold change of any single condition vs the per-locus grand mean on the size-factor-normalized scale (computed as a log-difference; equivalent to a log2 ratio when pc is small) |
| `p_value` | χ²(K−1) p-value from the NB-GLM LRT |
| `q_value` | Benjamini–Hochberg q-value |
| `max_condition` | name of the condition column with the highest normalized signal |

Size factors, the estimated dispersion fit, its trend coefficients, and the number of kept/total loci are printed to stderr. The dispersion-fit label is one of:

| label | meaning |
| --- | --- |
| `common` | `--fit-type common` succeeded (single α = median of per-locus MoM estimates) |
| `parametric` | `--fit-type parametric` succeeded (fitted `α(μ) = a/μ + b`) |
| `common-fallback` | `--fit-type parametric` was requested but failed; fell back to the common fit |
| `override` | `--dispersion` was supplied; the fixed α was used at every locus |
| `zero` | `--fit-type zero` was supplied; Poisson was forced at every locus |

Size-factor spread, Poisson fallbacks, `--fit-type zero`, `common-fallback`, and K=2 calibration all emit a `FertilizerDiffWarning` on stderr.

| CLI flag | effect |
| --- | --- |
| `-i`, `--input` | input TSV |
| `-c`, `--conditions` | two or more column names to compare |
| `-o`, `--output` | output TSV, filtered to loci passing the threshold, with extra columns appended |
| `--q-threshold` | keep loci with q ≤ this (default `0.05`; set to `1.0` to keep all rows) |
| `--p-threshold` | additionally keep only loci with raw p ≤ this (default: off) |
| `--fit-type` | dispersion model: `common` (default, median of MoM estimates, bias-corrected), `parametric` (fits `α(μ) = a/μ + b`), or `zero` (forces Poisson — diagnostic only, strictly anti-conservative if real overdispersion exists) |
| `--min-signal` | minimum mean normalized signal for loci included in dispersion estimation (default `5.0`) |
| `--dispersion` | override the fitted α with a fixed value applied to every locus; bypasses `--fit-type` and `--min-signal` entirely and sets `dispersion_fit = override`. Useful for sensitivity analyses (e.g. re-run at 0.05 and 0.10 to see how much calls depend on α) |
| `--pseudocount` | pseudocount for the effect-size log2 transform only; does not affect the LRT. Must be > 0 (default `0.5`) |

**Differences from DESeq2** (non-exhaustive):

- **Per-locus dispersion MLE.** DESeq2 uses Cox-Reid adjusted profile likelihood; we use method-of-moments. MoM is less efficient per locus but is consistent and does not fail to converge.
- **Shrinkage.** DESeq2 shrinks per-locus dispersion toward the trend via a log-normal empirical-Bayes prior and retains dispersion outliers. We use the trend value directly for every locus (equivalent to infinite shrinkage, no outlier retention) — robust for the small-K setting, but cannot capture genuinely heterogeneous per-locus dispersion.
- **Bias correction.** Because median(χ²(K−1))/(K−1) < 1 for small K (0.69 at K=3, 0.84 at K=5, 0.91 at K=8), median-of-MoM underestimates α. We apply the closed-form correction.
- **Log2 fold change shrinkage.** DESeq2 optionally shrinks LFC estimates (apeglm / ashr); we report raw `max |log2FC vs grand mean|` as the effect size.
- **Observation-level outliers.** DESeq2 uses Cook's distance to flag and optionally refit without outliers. We don't.
- **Independent filtering.** DESeq2 filters low-signal loci out of multiple-testing correction to maximize power at a given FDR. We don't — use `--min-signal` (dispersion-only) or pre-filter the input TSV if you want this.
- **Arbitrary designs.** DESeq2's LRT supports any `full` vs `reduced` formula. We hard-code the all-conditions-equal null vs per-condition-mean full.
- **Integer counts.** DESeq2 is designed for integer RNA-seq counts; the NB likelihood here is evaluated with `scipy.special.gammaln` and is numerically correct for any non-negative float input. (This matches the common practice of passing fractional RSEM/salmon expected counts to DESeq2 via `tximport`, and is required here because bigWig region means are real-valued.)

**Known calibration behavior.** Our test suite verifies Type-I error at α=0.05 stays under 0.08 (~1.6× nominal) on Poisson nulls for K ∈ {3, 5, 8} and μ ∈ {30, 100, 500}, and under 0.09 (~1.8× nominal) on NB nulls for K ∈ {3, 5, 8}, μ = 100, α_true ∈ {0.05, 0.10}. At low μ (< ~5) the delta-method and the median-bias correction both degrade, and low-μ loci are excluded from dispersion estimation via `--min-signal`. If you care about exact FDR for a specific dataset, cross-validate against DESeq2.

> **Supply a mix of positive and negative loci.** The size-factor and dispersion steps both assume that *most* loci are **not** differential — the "null majority" is what anchors the normalization and the variance estimate. If you run `diff` on a set of regions pre-filtered to be those you expect to change, you will get sub-optimal results: the estimated library-size differences will absorb real biological differences, and the dispersion estimate will be inflated by the true positives. For best results pass a genome-wide or background-matched set of loci containing both putative-differential and expected-stable regions.

Example (three conditions):

```bash
fertilizer diff -i signals.tsv -c A B C -o diff.tsv --q-threshold 0.05
```

Output `diff.tsv` (filtered, numbers illustrative):

```
chrom   start   end     A     B     C       effect_size     p_value     q_value     max_condition
chr1    0       500     2.0   2.1   8.4     2.07            0.0004      0.0032      C
chr3    1200    1700    15.2  3.6   4.1     2.08            0.0011      0.0060      A
chr7    900     1400    8.8   8.9   30.1    1.78            0.0018      0.0090      C
```

## Python API

Both subcommands are thin wrappers around library functions. The same work can be done from Python:

```python
import numpy as np
import pandas as pd

from fertilizer.extract import bigwig_region_means, load_regions, FertilizerWarning
from fertilizer.diff import (
    differential_analysis,
    size_factors,
    bh_qvalues,
    DiffResult,
    FertilizerDiffWarning,
)

# --- extract: region means for one bigWig -------------------------
regions = load_regions(["regions.bed"])                  # chrom/start/end
values, issues = bigwig_region_means(regions, "A.bw")    # np.ndarray, set[str]
# `issues` is a subset of {"missing_chrom", "out_of_bounds", "invalid_region"}

# --- diff: NB-GLM LRT on a (n_loci, n_conditions) array -----------
counts = pd.read_csv("signals.tsv", sep="\t")[["A", "B", "C"]].to_numpy(float)
res: DiffResult = differential_analysis(counts, fit_type="common")
# res.p_value, res.q_value, res.effect_size, res.size_factors,
# res.per_locus_dispersion, res.dispersion_fit, res.dispersion_trend,
# res.max_condition_idx  (all np.ndarray except the two scalars)
```

`FertilizerWarning` (locus-level issues from `extract`) and `FertilizerDiffWarning` (size-factor spread, Poisson fallbacks from `diff`) are both `UserWarning` subclasses — catch them with `warnings.catch_warnings()` or filter them with `warnings.simplefilter(..., FertilizerWarning)`.

## Project layout

```
fertilizer/
├── pyproject.toml             # package metadata + uv/hatchling build config
├── README.md
├── LICENSE
├── src/
│   └── fertilizer/
│       ├── __init__.py
│       ├── cli.py             # top-level argparse dispatcher
│       ├── extract.py         # signal aggregation + `extract` subcommand
│       └── diff.py            # differential-activity analysis + `diff` subcommand
└── tests/
    ├── test_cli.py
    └── test_diff.py
```
