Metadata-Version: 2.4
Name: coxstream
Version: 0.1.0
Summary: Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson
Author-email: Tommy Carstensen <zenodo@tommycarstensen.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/tommycarstensen/coxstream
Project-URL: Repository, https://github.com/tommycarstensen/coxstream
Project-URL: Issues, https://github.com/tommycarstensen/coxstream/issues
Keywords: survival analysis,cox proportional hazards,out-of-core,streaming,epidemiology,statistics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Provides-Extra: parquet
Requires-Dist: pyarrow>=14; extra == "parquet"
Provides-Extra: test
Requires-Dist: pytest>=7; extra == "test"
Dynamic: license-file

# coxstream

**Exact out-of-core Cox proportional hazards regression via streaming
Newton-Raphson.**

[![PyPI](https://img.shields.io/pypi/v/coxstream.svg)](https://pypi.org/project/coxstream/)
<!-- DOI badge: uncomment once the Zenodo record exists.
[![DOI](https://zenodo.org/badge/DOI/TODO.svg)](https://doi.org/TODO)
-->

Standard CoxPH solvers (`lifelines`, `scikit-survival`, R `survival`) load the
full cohort into memory before fitting, so on registry-scale data they exhaust
RAM long before the computation is hard. `coxstream` computes the **exact** Efron
partial-likelihood estimate by streaming a single time-sorted pass over the data
per Newton-Raphson iteration, holding only `O(p^2)` state for `p` covariates.
Working memory is therefore **independent of the number of observations `n`**:
the model fits on a workstation even when the cohort is far larger than RAM.

The streamed estimate *is* the in-memory maximum-likelihood estimate, and the
Efron tie correction is carried across chunk boundaries, so heavily tied data
are handled exactly.

![coxstream holds peak RAM flat as the cohort grows, while in-memory solvers (lifelines, R survival::coxph) scale with n; coefficients agree to machine precision.](https://raw.githubusercontent.com/tommycarstensen/coxstream/main/docs/benchmark.png)

*Memory vs. speed against `lifelines` and R `survival::coxph`: coxstream's peak
RAM stays flat in the number of rows while in-memory solvers grow with the
cohort, at matching coefficients. See the accompanying paper for the full
methodology.*

## Install

```bash
pip install coxstream             # core (numpy only)
pip install coxstream[parquet]    # + out-of-core fit_parquet (pyarrow)
```

The package builds a small Cython kernel, so a C compiler is required.

## Usage

In memory:

```python
import numpy as np
from coxstream import CoxStream

model = CoxStream().fit(durations, events, X, feature_names=names)
print(model.coef_, model.n_iter_)
```

Out of core, from a Parquet file **pre-sorted by descending event time** (never
materialises the cohort):

```python
from coxstream import CoxStream

# The file must already be sorted by duration DESC. `fit_parquet` verifies this
# from the Parquet footer statistics alone (no full pass) and rejects a file
# that is out of order; pass assume_sorted=True to skip the check.
#
# Sort it once with an out-of-core sorter -- both spill to disk, so they handle
# a cohort larger than RAM (a sort-engine benchmark found these the fastest):
#   duckdb:  COPY (SELECT * FROM 'cohort.parquet' ORDER BY duration DESC)
#            TO 'cohort_desc.parquet' (FORMAT PARQUET);
#   polars:  (pl.scan_parquet("cohort.parquet")
#              .sort("duration", descending=True)
#              .sink_parquet("cohort_desc.parquet"))
#   R:       duckdb via its R client runs the same COPY ... ORDER BY DESC.
# If the cohort fits in RAM, skip the file and call .fit, which sorts for you.

model = CoxStream().fit_parquet(
    "cohort_desc.parquet",
    duration_col="duration",
    event_col="event",
    covariate_cols=["age_std", "sex", "treatment"],
)
print(model.coef_)
```

To validate a file's order ahead of time -- a dry run, e.g. a CI or pipeline
gate right after you sort and before a long fit -- call `check_sorted`, which
runs the same footer-only check without fitting and raises on a file that is
provably out of order:

```python
from coxstream import check_sorted

check_sorted("cohort_desc.parquet", duration_col="duration")  # raises if unsorted
```

It doubles as a shell gate -- it exits non-zero on an out-of-order file, so a
pipeline step can fail fast without a bespoke CLI:

```bash
python -c "import coxstream; coxstream.check_sorted('cohort_desc.parquet', 'duration')"
```

## Validation

`coxstream` is verified against `lifelines` and R `survival::coxph`:

- It reproduces the in-memory maximum-likelihood estimate to **machine
  precision** on synthetic data.
- On the heavily tied Synthea 100K cohort (51 % of event times tied) it matches
  `lifelines` to ~`1e-6`.
- Peak resident memory is flat in `n` while in-memory solvers grow with the
  cohort and eventually exhaust RAM.

The package's own test suite is dependency-free: it checks exactness against a
self-contained plain-numpy Cox Newton-Raphson reference. The cross-checks
against `lifelines` and R `survival::coxph` above live in the accompanying
benchmark and paper.

The methodology and full results are in the accompanying paper (see
[Citation](#citation)).

## Scope

`coxstream` implements the exact Efron partial likelihood for large-`n`,
modest-`p` tabular survival data. It is a focused estimator, not a full survival
suite: it does not provide baseline-hazard estimation, time-varying covariates,
or proportional-hazards diagnostics.

## Testing

```bash
pip install -e '.[test]'           # core suite (numpy only)
pip install -e '.[test,parquet]'   # + the out-of-core fit_parquet test
pytest
```

## Citation

If you use `coxstream`, please cite it via the metadata in
[`CITATION.cff`](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).
