Metadata-Version: 2.4
Name: bunbun
Version: 0.1.1
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Dist: polars>=1.0
Requires-Dist: numpy>=1.24
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: hypothesis ; extra == 'dev'
Requires-Dist: maturin[develop] ; extra == 'dev'
Provides-Extra: dev
Summary: Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats
Home-Page: https://github.com/nahid18/bunbun
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# 🐰 bunbun

**Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.**

bunbun reads genotype data from any major format and gives you back the same three things every time:

| output | type | what |
|---|---|---|
| `.samples` | `polars.DataFrame` | sample metadata (ID, sex, family, phenotype, …) |
| `.variants` | `polars.DataFrame` | variant metadata (chrom, pos, ref, alt, …) |
| `.dosages` | `numpy.ndarray` | genotype matrix — `(n_samples × n_variants)`, float32 |

---

## Install

```bash
uv pip install bunbun
```

If you want to build it from the source

```bash
git clone https://github.com/nahid18/bunbun.git && cd bunbun
uv venv 
source .venv/bin/activate
# or 
source .venv/Scripts/activate
uv pip install maturin numpy pytest polars

# install from source
uv pip install -e .
```

## Quick start

```python
import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)
```

### Format-specific readers

```python
data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")
```

### Operations

```python
# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])
```

### Works with your existing stack

```python
import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))
```

---

## Supported formats

| Format | Extensions | Compression | Dosage |
|--------|-----------|-------------|--------|
| PLINK 1.x | `.bed` `.bim` `.fam` | — | hard calls |
| VCF | `.vcf` `.vcf.gz` | gzip, bgzf | GT or DS field |
| PLINK 2 | `.pgen` `.pvar` `.psam` | — | hard calls |
| BGEN | `.bgen` `.sample` | zlib, zstd | probabilistic |

## Output schema

### `data.samples` — polars DataFrame

| column | dtype | notes |
|--------|-------|-------|
| `sample_id` | Utf8 | IID (PLINK) or sample column (VCF) |
| `family_id` | Utf8 | FID or `"."` |
| `paternal_id` | Utf8 | or `"0"` |
| `maternal_id` | Utf8 | or `"0"` |
| `sex` | Utf8 | `"male"` / `"female"` / `"unknown"` |
| `phenotype` | Float64 | or `NaN` |

### `data.variants` — polars DataFrame

| column | dtype | notes |
|--------|-------|-------|
| `chrom` | Utf8 | string for X/Y/MT compatibility |
| `pos` | UInt32 | 1-based bp position |
| `id` | Utf8 | rsID or `chrom:pos:ref:alt` |
| `ref` | Utf8 | reference allele |
| `alt` | Utf8 | alternate allele(s), comma-separated |
| `cm` | Float64 | genetic distance or `NaN` |
| `qual` | Float64 | quality score or `NaN` |
| `filter` | Utf8 | `"PASS"` / filter string |

### `data.dosages` — numpy ndarray

- Shape: `(n_samples, n_variants)`
- Dtype: `float32`
- Encoding: `0.0` = hom-ref, `1.0` = het, `2.0` = hom-alt, `NaN` = missing
- For imputed data (BGEN, VCF with DS): continuous values in `[0, 2]`

---

## Rust API

```rust
use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();
```

---

- **One output type.** `GenotypeData` is the same regardless of input format.
- **Zero-copy where possible.** PLINK `.bed` is memory-mapped; numpy gets the ndarray directly.
- **Parallel by default.** Genotype decoding uses rayon across variants.
- **Small-string optimization.** The `Allele` type stores ≤7-byte alleles inline (covers >99% of SNPs).
- **Transparent compression.** gzip, bgzf, zstd, bzip2 detected from magic bytes.

---

## License

MIT

