Metadata-Version: 2.4
Name: skein-glm
Version: 0.8.0
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: scikit-learn>=1.3
Requires-Dist: snakemake>=8 ; extra == 'bench'
Requires-Dist: pyarrow>=15 ; extra == 'bench'
Requires-Dist: pyyaml>=6 ; extra == 'bench'
Requires-Dist: pandas>=2.1 ; extra == 'bench'
Requires-Dist: matplotlib>=3.8 ; extra == 'bench'
Requires-Dist: seaborn>=0.13 ; extra == 'bench'
Requires-Dist: skglm>=0.3 ; extra == 'bench'
Requires-Dist: celer>=0.7 ; extra == 'bench'
Requires-Dist: statsmodels>=0.14 ; extra == 'bench'
Requires-Dist: lifelines>=0.27 ; extra == 'bench'
Requires-Dist: pytest>=7 ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: maturin>=1.5 ; extra == 'dev'
Requires-Dist: ruff ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: sphinx>=7.4 ; extra == 'docs'
Requires-Dist: furo>=2024.5 ; extra == 'docs'
Requires-Dist: myst-parser>=4 ; extra == 'docs'
Requires-Dist: sphinx-copybutton>=0.5 ; extra == 'docs'
Requires-Dist: sphinx-design>=0.6 ; extra == 'docs'
Provides-Extra: bench
Provides-Extra: dev
Provides-Extra: docs
License-File: LICENSE
Summary: Weighted structured nonconvex sparse models (Python + Rust)
Keywords: sparse,nonconvex,mcp,scad,group-lasso,regression
Author-email: David Villacis <david@villacis.net>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/dvillacis/skein

# skein

Weighted structured nonconvex sparse models. Rust core + Python API.

> **Documentation:** the [docs site](docs/index.md) has the full
> conceptual reference (penalties, datafits, weights, backends),
> porting guides for `glmnet` / `ncvreg` / `grpreg`, worked examples,
> and an auto-generated API reference. Built with Sphinx + Furo and
> hosted on Read the Docs (config in `.readthedocs.yaml`); preview
> locally with `sphinx-build -b html docs docs/_build/html`. CI builds
> it with `-W` (warnings = errors) on every PR.

`skein` targets a niche that's well-served in R (`grpreg`, `ncvreg`) but
missing in Python at production quality: nonconvex group-structured
penalties (group MCP, group SCAD, sparse-group nonconvex) with first-class
support for *weights along three axes* — per-sample, per-feature, and
per-group.

## Status

v0.7. Full nonconvex / group / GLM stack is in place; M5 model
selection + inference + threaded CV folds shipped; M11 graphical lasso
family (single + joint) shipped; M12 hardening pass closed the
penalty / datafit unit-test gaps and added a CI smoke job for the
PyO3 layer. See [ROADMAP.md](ROADMAP.md) for the full plan and the M13
performance findings opened by the `benches/v2` release-profile run.

**Done so far:**

- **Solvers** — production CD core (path solver, strong rule + KKT
  verification, gap-safe screening, Anderson acceleration, M13.1
  saturation bypass); group block-CD with LLA outer loop for nonconvex
  group penalties (M13.4 Phase 2.3 weight-space short-circuit);
  Rayon-parallel group sweeps; operator-norm Lipschitz via power
  iteration.
- **Datafits** — least squares, binomial logistic, Poisson (log link,
  with offsets), Cox PH (Breslow + Efron ties), multinomial softmax,
  Huber. All glued together by a `GlmDatafit` trait that exposes a
  weighted-LS surrogate; the M1/M2 inner solvers absorb every GLM
  unchanged.
- **Penalties** — lasso, MCP, SCAD, elastic net, bridge `|β|^q`, group
  lasso, group MCP, group SCAD, group elastic net, sparse-group lasso,
  sparse-group MCP, sparse-group SCAD. Per-feature and per-group
  weights honored throughout.
- **Design-matrix backends** — `DenseMatrix`, `SparseCSC`, lazy
  `Standardized<D>`, `MmapMatrix` (f64 + f32), row-block `Chunked<C>`,
  `Augmented<D>`, `MultiTaskDesign<D>` — all behind one trait, freely
  composable.
- **Python** — sklearn-compatible estimators for every (datafit ×
  penalty) combination (~150 classes); type stubs; warm-started
  λ-paths; standardization with original-scale `coef_` / `intercept_`
  recovery on dense and sparse.
- **Model selection + inference (M5)** — K-fold CV across every
  `*PathCV` class (threaded folds via PyO3 GIL release, ~2.3–2.5×
  speedup); AIC/BIC/EBIC tuning; stability selection (MB bootstrap);
  debiased / desparsified lasso for LS + binomial + Poisson with Wald
  CIs and p-values.
- **Graphical models (M11)** — sparse precision matrix estimation
  (`GraphicalLasso` / `GraphicalMCP` / `GraphicalSCAD`) and joint
  estimation across `K` related populations (`JointGraphicalLasso` /
  `JointGraphicalMCP`, Danaher–Wang–Witten 2014 group form via ADMM),
  with EBIC tuning and bootnet-style bootstrap edge stability.
  Nonconvex penalties on edges close the shrinkage-bias gap that
  `sklearn.covariance.GraphicalLasso` and R's `glasso` / `qgraph` /
  `bootnet` leave open.
- **Distribution + docs (M8) + hardening (M12)** — CI + cibuildwheel +
  Read the Docs + Sphinx site (concepts + R-porting + extending +
  examples + API ref) + R numerical regression suite vs glmnet /
  ncvreg / grpreg + stable Rust API contract. M12 added penalty +
  datafit unit-test coverage, an integration test directory, a CI
  smoke job for the PyO3 layer, and an R-fixture gate.

**Coming next:** the M13 performance workstream — `group_mcp` LLA
overhead at medium scale (Phase 2.3 shipped, native group-MCP BCD
scoped as M13.4b), per-λ fixed-cost cut for convex Lasso (M13.2), and
the publication benchmark suite at `benches/v2/` that backs the
software paper.

## Layout

```
crates/skein-core/   pure Rust: traits + algorithms (no Python)
crates/skein-py/     PyO3 bindings (cdylib → skein_glm._core)
python/skein_glm/    sklearn-compatible estimators + ABCs for extensions
tests/               pytest suite (Rust extension required)
benches/             v1 cross-package harness (skein vs sklearn / skglm / celer / glmnet / ncvreg / grpreg)
benches/v2/          publication-quality Snakemake suite backing the paper
crates/skein-core/benches/   internal Rust criterion microbenches
paper/               figure + table bundle regenerated by benches/v2
docs/                Sphinx site (Read the Docs)
```

The Rust traits (`DesignMatrix`, `Datafit`, `GlmDatafit`, `Penalty`,
`GroupPenalty`) and their Python ABC mirrors
(`skein_glm.penalties.Penalty`, etc.) are the extension surface for
downstream per-paper projects.

## Quick start

```python
import numpy as np
from skein_glm import MCPPathRegressor, LogisticGroupMCPPathRegressor, CoxMCPRegressor

# Nonconvex sparse least squares with a λ-path.
rng = np.random.default_rng(0)
n, p = 200, 50
X = rng.standard_normal((n, p))
y = X[:, :3] @ np.array([1.5, -2.0, 0.8]) + 0.1 * rng.standard_normal(n)
model = MCPPathRegressor(gamma=3.0, n_lambdas=50, standardize=True).fit(X, y)
print(model.coefs_[-1, :5], model.intercepts_[-1])

# Logistic + group MCP via LLA, with sklearn-style predict/predict_proba.
groups = np.repeat(np.arange(p // 5), 5)  # 5 features per group
y_bin = (X[:, :3].sum(axis=1) > 0).astype(float)
clf = LogisticGroupMCPPathRegressor(groups=groups, gamma=3.0, n_lambdas=20).fit(X, y_bin)
proba = clf.predict_proba(X)  # shape (n, n_lambdas)

# Cox PH with right-censored survival data.
time = rng.exponential(1.0 / np.exp(X[:, :3].sum(axis=1)))
event = rng.uniform(size=n) < 0.7
cox = CoxMCPRegressor(lambda_=0.01, gamma=3.0).fit(X, time, event.astype(float))
risk = cox.predict(X)  # prognostic index η
```

Every regressor follows the same `(datafit) × (penalty)` × `({,Path}Regressor)`
naming scheme. The path variants warm-start across λ; their `coefs_` /
`intercepts_` (where applicable) are 2D arrays indexed by λ.

## Performance

skein is benchmarked against sklearn / skglm / celer / glmnet / ncvreg
on shared λ-grids via the harness under `benches/`. Headline numbers
(Apple M1, 16 GB; median of N timed trials after a warm-up):

Each scenario is run in two regimes that name what the *solution* does
at the tail of the λ-path, not the path geometry:

- **dense** — `λ_min/λ_max = 1e-3`; the active set saturates near the
  smallest λ (typical "I want the full path including the over-fit
  tail" usage).
- **sparse** — `λ_min/λ_max = 5e-2`; the path stops near support
  recovery, support stays small throughout.

| scenario | size | skein | next-fastest comparator |
|---|---|---|---|
| Lasso LS — dense  | medium (n=10k, p=1k)  | 1.17 s     | sklearn 0.125 s |
| Lasso LS — sparse | medium                | 0.78 s     | sklearn 0.099 s |
| MCP   LS — dense  | medium                | **1.37 s** | skglm 3.35 s    |
| MCP   LS — sparse | medium                | **0.75 s** | ncvreg 1.17 s   |
| MCP   LS — dense  | large (n=100k, p=10k) | **510 s**  | skglm 666 s     |
| MCP   LS — sparse | large                 | **497 s**  | skglm 702 s     |
| SCAD  LS — dense  | medium                | **1.78 s** | ncvreg 7.99 s   |
| SCAD  LS — sparse | medium                | **0.90 s** | ncvreg 1.86 s   |

skein is the fastest on every nonconvex row across every size; on
convex lasso/LS the sklearn Cython `lasso_path` remains the floor at
~8–9× faster on the medium bench. See
[`docs/benchmarks/mcp_ls.md`](docs/benchmarks/mcp_ls.md) and
[`docs/benchmarks/scad_ls.md`](docs/benchmarks/scad_ls.md) for the
full nonconvex write-ups (correctness matrices + methodology +
per-size tables) and
[`docs/perf/lasso_ls_profile.md`](docs/perf/lasso_ls_profile.md) for
the lasso/LS profiling work that drove M10.

Reproduce with `python benches/run.py --scenarios mcp_ls
mcp_ls_sparse --sizes small,medium`. The publication-quality
benchmark suite under [`benches/v2/`](benches/v2/README.md) drives the
paper figures and tables; see
[`docs/benchmarks/index.md`](docs/benchmarks/index.md) for the layered
overview.

## Build

```bash
# Rust core only (fast iteration on algorithms)
cargo test -p skein-core --lib

# Full Python package (requires maturin in your env)
maturin develop --release
pytest
```

See [`docs/installation.md`](docs/installation.md) for from-source and
development installs, and [`CLAUDE.md`](CLAUDE.md) for the contributor
quickstart (pre-PR checks, solver-change pre-flight protocol, etc.).

## License

MIT.

