Metadata-Version: 2.4
Name: rrng
Version: 0.1.0
Summary: Reproduce R's native default RNG (set.seed/runif/sample) bit-for-bit in pure Python, no R at runtime
Author: fzhao
License: MIT
Project-URL: Provenance, https://github.com/wch/r-source/blob/trunk/src/main/RNG.c
Keywords: R,RNG,Mersenne-Twister,set.seed,reproducibility,bootstrap
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20
Dynamic: license-file

# `rrng` — reproduce R's native RNG bit-for-bit in pure Python

Run unmodified, *already-published* R randomness (`set.seed()` + `sample()` + `runif()` …)
inside Python and get the **identical** numbers — with **no R installed at runtime**.

```python
from rrng import RRNG

g = RRNG(100)                          # R: set.seed(100)   (sample_kind="rounding" for R < 3.6)
g.unif_rand()                          # one R runif() draw                  -> 0.3077661
g.runif(5)                             # an R runif(5) block (numpy array)
g.rnorm(5)                             # rnorm(5)   (Inversion, R's default)
g.rexp(5, rate=2); g.rpois(5, mu=3)    # rexp / rpois
g.rbinom(5, size=20, prob=0.3)         # rbinom
g.rgamma(5, shape=2.5, scale=2)        # rgamma
g.sample(10, 5, replace=False)         # sample(1:10, 5)            -> 0-based indices
g.sample(10, 8, replace=True, prob=w)  # weighted sample(prob=)
idx = RRNG(100).sample_index(10, 10)   # 0-based sample(1:10, 10, replace=TRUE) -> [9 6 5 2 8 9 6 5 5 3]
```

A `RRNG` is **stateful**, so a single seed stream can be threaded across calls/regions —
essential for reproducing R scripts that `set.seed(s)` *once* and then `map()`/loop, where the
order of consumption must match.

👉 **See [USAGE.md](USAGE.md) for a full guide** (the R→Python cheat-sheet, threaded streams,
0- vs 1-based indexing, common pitfalls).

## Install

Zero hard dependencies beyond NumPy.

```bash
pip install rrng                  # from PyPI (if published)
# or, from a checkout:
pip install -e .                  # editable install of this package
```

It is also importable in place: if the `rrng/` package directory is on your `PYTHONPATH`
(e.g. it sits next to your script), `from rrng import RRNG` just works without installing.

## Quick start

```python
from rrng import RRNG

# --- R ---                          # --- Python (rrng) ---
# set.seed(42)                       g = RRNG(42)
# runif(3)                           g.runif(3)
# sample(1:n, k, replace = TRUE)     g.sample_index(n, k) + 1   # +1 for R's 1-based indices
```

To resample a data array the way R's bootstrap does:

```python
import numpy as np
data = np.array([...])
g = RRNG(123)
resample = data[g.sample_index(data.size, data.size)]   # sample(data, replace=TRUE)
```

## Validation — diff against real R

The library is only worth anything if it is provably identical to R, so the tests diff against
golden vectors generated by **real R** and committed as a fixture:

```bash
# (re)generate fixtures — requires R (any version >= 3.6)
Rscript rrng/tests/generate_golden.R

# run the diff — requires Python + numpy
python rrng/tests/test_rrng.py        # standalone runner
pytest rrng/tests                     # or under pytest
```

`rrng/tests/fixtures/golden_vectors.json` holds `runif` and `sample` (both `sample.kind`s)
across several seeds and `n`/`size`, including `n` large enough to exercise the multi-draw
rejection path. The end-to-end example in `rrng/examples/` reproduces a published PNAS
snow-drought attribution bootstrap (risk ratios, return intervals, confidence bounds) to every
digit, from Python, with no R.

---

## Why this exists (the gap)

Reproducing a published R analysis in Python is common (porting pipelines, checking results,
building Python tooling around an R method). The randomness is almost always the blocker: the
analysis used R's **default** generator via `set.seed(s); sample(...)`, and naive Python ports
silently diverge.

| approach | matches R's default `set.seed`+`sample`? | pure Python (no R)? | reproduces an *existing* published R script unchanged? |
|---|---|---|---|
| **`rpy2`** (embed/call R) | yes (it *is* R) | ❌ needs R installed | yes |
| **`SyncRNG`** (shared Tausworthe RNG in C) | ❌ — a *different* RNG; you must rewrite the R code | ✅ | ❌ |
| **`numpy` MT19937 / `random`** | ❌ — same MT core but different seeding + different `sample()` | ✅ | ❌ |
| **`rrng`** | ✅ | ✅ | ✅ |

The widely-repeated claim "you can't make Python's Mersenne-Twister match R's `set.seed`" is true
for the *naive* approach but not fundamentally true. R's chain is fully specified; three pieces are
usually gotten wrong:

1. **R's seeding scramble** — `set.seed(s)` is *not* the standard MT `init_genrand`. R scrambles the
   seed 50× by `s = 69069*s + 1`, then fills its 625-word state with the same LCG, and sets `mti = 624`.
2. **R's `unif_rand` scaling** — `fixup(MT_genrand() * 1/(2³²−1))`, with R's specific edge `fixup`.
3. **R's `sample()` index method** — since **R 3.6.0** the default is *Rejection* (`R_unif_index` via
   `rbits(ceil(log2 n))` with rejection of draws `≥ n`), **not** the old `floor(n*unif_rand())`
   *Rounding* method. Most ports use Rounding and mismatch on `sample()`.

Get all three right and Python matches R exactly.

## Scope

**Covered (validated bit-for-bit against R 4.5):**
- RNG kind: **Mersenne-Twister** (R's default), exact `set_seed(s)`.
- Uniform: `unif_rand()`, `runif(n)`.
- Normal: `rnorm(n, mean, sd)` — R's default **Inversion** (`qnorm`, Wichura AS 241), 2 draws/value.
- Exponential: `rexp(n, rate)` — Ahrens-Dieter `exp_rand`.
- Poisson: `rpois(n, mu)` — Ahrens-Dieter (both the small-`mu` inversion and big-`mu` rejection branches).
- Binomial: `rbinom(n, size, prob)` — inversion (`np<30`) and **BTPE** (`np≥30`).
- Gamma: `rgamma(n, shape, rate=, scale=)` — **GD** (shape≥1) and **GS** (shape<1).
- Sampling: `sample(n, size, replace=, prob=)` and `sample_index(n, size)` — equal-probability
  with/without replacement, and weighted `prob=` (cumulative, **Walker alias** for >200 heavy
  cells, and sequential no-replacement). Both the **R ≥ 3.6 Rejection** index method
  (`sample_kind="rejection"`, default) and the **R < 3.6 Rounding** method (`sample_kind="rounding"`).
  Returns 0-based indices.

**Not (yet) covered:**
- Other RNG kinds (Wichmann-Hill, Marsaglia-Multicarry, Super-Duper, Knuth-TAOCP, L'Ecuyer-CMRG)
  and non-default `normal.kind`s (Box-Muller, Kinderman-Ramage).
- `rbeta`, `rt`, `rchisq`, `rf`, `rcauchy`, `rlogis`, `rweibull`, `rgeom`, `rnbinom`, `rhyper`, …
- `sample.int(useHash=TRUE)` (only triggers for `n > 1e7` unweighted no-replacement draws).

Honesty about scope is the point: advertise *exactly* what matches R, not "all of R".

## Design

- **Performance:** drive the MT core with NumPy's `MT19937` *seeded to R's exact 624-word state*
  (NumPy's MT == R's MT, so an identical uint32 stream), then apply R's `unif_rand` scaling and the
  rejection `sample` on top, vectorized. Bit-identical to the slow pure-Python reference but ~100×
  faster (a 100-member bootstrap resample runs in <1 s).
- A slow, transparent pure-Python `_genrand` reference is kept in the source for auditability.

## Roadmap

1. MT + `set_seed` + `runif` + `sample_index` (replace=TRUE), both sample kinds. ✅ validated.
2. `rnorm` via **Inversion** (qnorm). ✅ validated.
3. `sample` without replacement; weighted `sample(prob=)` (incl. Walker alias). ✅ validated.
4. `rexp`, `rbinom`, `rpois`, `rgamma`. ✅ validated.
5. Next: more continuous families (`rbeta`, `rchisq`, `rt`, `rf`, `rweibull`, …) and discrete
   (`rgeom`, `rnbinom`, `rhyper`); optional alternate RNG kinds / `normal.kind`s behind flags.

## Positioning

- Need R installed and want everything? → **`rpy2`**.
- Control both sides and just want *a* shared stream? → **`SyncRNG`** (different RNG; rewrite both).
- Need to reproduce an **existing, unmodified, default-RNG R analysis in pure Python**? → **`rrng`**.

## Provenance & license

MIT licensed (see [LICENSE](LICENSE)). The implementation follows R's source
([`src/main/RNG.c`](https://github.com/wch/r-source/blob/trunk/src/main/RNG.c)) and the "Random"
R help page.

*Origin: extracted from a snow-drought attribution project, where it was built to reproduce a
WUS-D3 / PNAS bootstrap bit-for-bit in Python without R.*
