Metadata-Version: 2.4
Name: roost-ai
Version: 1.0.3
Summary: ROOST — deterministic trust layer for AI-written code: calibrated, read-only risk scores for every change + an honest outcome ledger
Project-URL: Homepage, https://github.com/ninoxAI/roost
Project-URL: Repository, https://github.com/ninoxAI/roost
Project-URL: Issues, https://github.com/ninoxAI/roost/issues
Project-URL: Changelog, https://github.com/ninoxAI/roost/blob/main/docs/DECISIONS.md
Author-email: ninoxai <ferbegor@gmail.com>
License: PolyForm-Noncommercial-1.0.0
License-File: LICENSE
License-File: NOTICE
Keywords: calibration,ci,code-review,defect-prediction,devops,github-actions,machine-learning,risk,static-analysis,szz
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Free for non-commercial use
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Version Control :: Git
Classifier: Typing :: Typed
Requires-Python: <3.15,>=3.11
Requires-Dist: duckdb>=1.1
Requires-Dist: lightgbm>=4.3
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: pydantic>=2.7
Requires-Dist: pydriller>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: scikit-learn>=1.4
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: fastapi>=0.110; extra == 'dev'
Requires-Dist: httpx>=0.27; extra == 'dev'
Requires-Dist: pyjwt[crypto]>=2.8; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.2; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: uvicorn>=0.29; extra == 'dev'
Provides-Extra: discovery
Requires-Dist: db-dtypes>=1.2; extra == 'discovery'
Requires-Dist: google-cloud-bigquery>=3.20; extra == 'discovery'
Requires-Dist: pyarrow>=16; extra == 'discovery'
Provides-Extra: llm
Requires-Dist: anthropic>=0.40; extra == 'llm'
Provides-Extra: serve
Requires-Dist: fastapi>=0.110; extra == 'serve'
Requires-Dist: httpx>=0.27; extra == 'serve'
Requires-Dist: pyjwt[crypto]>=2.8; extra == 'serve'
Requires-Dist: uvicorn>=0.29; extra == 'serve'
Description-Content-Type: text/markdown

<h1 align="center">ROOST</h1>

<p align="center"><b>Know which commit is going to break production — before you merge it.</b></p>

<p align="center">
  The deterministic <b>trust layer</b> for the age of AI-written code: a read-only, calibrated<br>
  <b>risk score</b> for every change, plus an honest <b>ledger</b> of what actually happened.<br>
  LLM-free at the core. It never touches production.
</p>

<p align="center">
  <a href="#quick-start"><img src="https://img.shields.io/badge/Quick_start-60s-black" alt="Quick start"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/License-PolyForm_Noncommercial-blue.svg" alt="License: PolyForm Noncommercial 1.0.0"></a>
  <a href="docs/model-card.md"><img src="https://img.shields.io/badge/verdict-PASS-success" alt="Pre-registered bar: PASS"></a>
  <img src="https://img.shields.io/badge/core-LLM--free-success" alt="LLM-free core">
  <img src="https://img.shields.io/badge/boundary-read--only-success" alt="Read-only">
  <img src="https://img.shields.io/badge/python-3.11–3.14-blue" alt="Python 3.11–3.14">
  <a href="https://discord.gg/p8FSZxrBsW"><img src="https://img.shields.io/badge/Discord-chat-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
</p>

---

## Try it on your own repo — one command

No clone, no signup, no CI, no API key. With [`uv`](https://docs.astral.sh/uv/):

```bash
uvx --from roost-ai roost score-repo https://github.com/<you>/<your-repo>
```

`uvx` pulls the `roost-ai` package into a throwaway env, mines your repo's recent
history, and scores its riskiest commits with the baked-in cold-start model —
**deterministic, LLM-free, and nothing leaves your machine.** Prefer a permanent
install? `pipx install roost-ai` (or `pip install roost-ai`), then
`roost score-repo <url>` — the CLI is always just `roost`.

---

## Contents

- [Try it on your own repo — one command](#try-it-on-your-own-repo--one-command)
- [What it looks like](#what-it-looks-like)
- [Why this exists](#why-this-exists)
- [Why you can trust the number](#why-you-can-trust-the-number)
- [Quick start](#quick-start)
- [Reproduce the model from scratch](#reproduce-the-model-from-scratch)
- [How it works](#how-it-works)
- [Risk tiers](#risk-tiers)
- [The Pellet ledger](#the-pellet-ledger)
- [Command reference](#command-reference)
- [The bigger picture](#the-bigger-picture)
- [Project layout](#project-layout)
- [Contributing](#contributing)
- [License](#license)

---

## What it looks like

Open a pull request. Before a human even reads the diff, ROOST posts a verdict:

```
┌─ Augur risk ───────────────────────────────────────────────┐
│  73%   tier: network        ⚠ high risk                     │
│                                                             │
│  Why this scored high:                                      │
│   • change is spread across 5 subsystems (high diffusion)   │
│   • touches files with 4 prior fix-inducing changes         │
│   • large, scattered diff — not a focused edit              │
│                                                             │
│  73% of changes that look like this needed a fix later.     │
└─────────────────────────────────────────────────────────────┘
```

No LLM wrote that. It's a **calibrated probability** from a trained model — reproducible from a fixed seed, byte-for-byte. When ROOST says *73%*, ~73% of changes like it really did get fixed later. That last sentence is the whole point: **the number means something.**

---

## Why this exists

**AI now writes more code than any human can carefully review.** Agents open PRs in minutes; the diffs are bigger, more frequent, and land faster than a reviewer can keep up. So review quietly degrades into skimming — you approve, you merge, you hope. The tests are green, so it's probably fine. Right?

Your CI tells you the tests *passed*. It does **not** tell you which of today's twenty green PRs — half of them machine-written — is the one that quietly induces an incident three weeks from now. That call gets left to gut feeling, reviewer fatigue, and "looks fine to me."

ROOST gives that judgment back a backbone: it puts a **calibrated number** on each change so you can **spend your scarce review attention where the risk actually is.** Skim the 4% it flags `network`/`destructive`; let the 65% it scores low ride through with a lighter touch. Review *smarter*, not by reading every line a robot wrote.

Risk tools that *do* exist tend to fail in one of two ways:

- **Uncalibrated rankers** — they sort changes "risky → safe" but a "0.8" doesn't mean 80% of anything. You can't set a threshold you trust.
- **LLM black boxes** — non-deterministic, unauditable, and they'll happily hallucinate a rationale for a number they made up.

ROOST is the opposite of both. The score is **calibrated** (probabilities you can act on), **deterministic** (same input → same output, forever), and the core is **LLM-free** (a test literally asserts it never imports an LLM SDK). Then it *remembers every prediction* and checks it against what really happened — so the score sharpens on your own history instead of staying a one-shot guess.

> **Read-only, always.** ROOST reads diffs and posts advisory verdicts. It never writes code, never merges, never blocks a build — unless *you* explicitly opt in with `--fail-at`. Any LLM is an optional, swappable explainer that can only *rephrase* the verdict, never change the score. It's off by default.

---

## Why you can trust the number

Most "AI for code" projects ask you to take their metrics on faith. We did the opposite — and this is the part we're proudest of.

**We wrote down the pass/fail bar *before* we saw any results**, committed it to git, and reported against it honestly. A clean FAIL would have been just as publishable as a PASS.

It passed — and these are the figures for the **shipped cold-start model**, trained and
evaluated on **~200 public OSS repositories spanning 8 languages** (Python, TypeScript,
JavaScript, Java, Go, Ruby, C#, Rust; specific sources withheld — see the [model card](docs/model-card.md)):

| What we measured | Result | Bar we set in advance |
|---|---|---|
| Top-20% riskiest changes vs. base rate | **2.9× more fix-inducing** | ≥ 2.0× |
| Beats a "just count lines changed" baseline (PR-AUC) | **+0.103** | ≥ 0.05 |
| Calibration error (Brier) | **0.099** | < 0.125 (base rate) |
| Ranking quality (ROC-AUC) | **0.839** | — |
| Generalizes to a repo it never trained on (leave-one-repo-out) | **2.8× mean** (across 204 held-out repos) | cold-start sanity |
| Holds up under noisy labels | **2.5×** | robustness |

The numbers are a touch lower than an earlier Python/JS-only build (lift 3.2×) — exactly
what you'd expect: a far more diverse, messier 8-language corpus is harder to predict, so
this is the more **honest, more general** number, not a worse one. The point is that the
signal *transfers across languages*, which is what a cold-start model lands on in the wild.

And the caveats we **don't** hide: labels are an SZZ public-OSS proxy, not real production incidents; OSS ≠ your private code; the bespoke "blast-radius" feature honestly *didn't* earn its place, so we dropped it. We hold the same bar for new ideas: an experimental path-signal set (`slim_paths`) improves discrimination (PR-AUC +0.017) and calibration over the noise band, but its effort-aware lift gain stays within noise — a qualified result we report rather than dress up. The full warts-and-all findings log is in **[docs/DECISIONS.md](docs/DECISIONS.md)**; intended use and limits in the **[model card](docs/model-card.md)**.

**Calibration is a first-class output, not a footnote** — the score comes from isotonic-calibrated LightGBM on a strict *time-ordered* split (never shuffled — temporal leakage is a pre-registered failure mode, not a thing we discovered later).

---

## Quick start

Requires Python 3.11–3.14. No API key, no cloud, no LLM. The shipped cold-start model
is baked into the package, so scoring works out of the box.

**Just use it** — install once, score anything:

```bash
pipx install roost-ai               # or: pip install roost-ai  /  uvx --from roost-ai roost <cmd>

# Score an entire repo you've never seen (mines history, scores riskiest commits):
roost score-repo https://github.com/some/repo

# …or score one commit of a local checkout, e.g. in CI:
roost ci --commit HEAD --format md
```

That's it — nothing leaves your machine.

<details>
<summary><b>Drop it into GitHub Actions (advisory, ~10 lines)</b></summary>

```yaml
name: Augur risk
on: [pull_request]
jobs:
  risk:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }      # full history is required
      - uses: ninoxAI/roost@v1
        with:
          fail-at: ""                 # advisory by default; set e.g. 0.9 to block
```

`--format md` drops straight into a PR comment or `$GITHUB_STEP_SUMMARY`; `--format json` pipes to `jq`. GitLab CI, a self-contained Docker image (model baked in, nothing leaves your infra), and a read-only GitHub App are in **[docs/ci.md](docs/ci.md)**, **[docs/github-app.md](docs/github-app.md)**, and **[docs/deploy.md](docs/deploy.md)**.
</details>

---

## Reproduce the pipeline on your own corpus

From a clone of this repo (the pipeline targets need the source tree, not just the package).
Every step is deterministic from your repo list + seed; the LLM is off the whole way. Point
`configs/repos.yaml` at the repos you want (see `configs/repos.example.yaml` for the format) —
the **shipped** cold-start model is trained on a larger, withheld multi-language corpus, but
the method below is exactly what produced it.

```bash
make setup       # uv: pinned py3.12 venv + locked deps
make init        # create the local Pellet ledger
make ingest      # mine the configured OSS repos → 'change'
make label       # SZZ fix-inducing labels → 'outcome'
make features    # 11 language-agnostic Kamei features → features.parquet
make train       # calibrated LightGBM, strict time-ordered split → 'prediction'
make eval        # full report + PASS/FAIL vs the pre-registered bar
make test        # the test suite, LLM disabled
```

Beyond the pipeline targets, the CLI exposes `roost robustness` (multi-seed / rolling-origin CV / ablations), `roost thresholds` (data-driven tiers), and `roost package` (shippable cold-start model + model card). `make ablation-paths` runs the research track: the experimental `slim_paths` set vs the shipping `slim` baseline. See [Command reference](#command-reference) for the full surface.

---

## How it works

```
mine → label (SZZ) → features (Kamei) → calibrate → score + tier → verdict
                                                         │
        every score is kept and later checked against the real outcome
                                                         ▼
        Pellet ledger:  change → prediction → action → outcome → recurrence
```

| Step | What happens |
|---|---|
| **mine** | Read-only PyDriller pass over a repo's git history → diff stats, sanitized messages, parents. |
| **label** | An SZZ-style blame trace marks each past change `clean` or `fix_inducing` — the training signal. |
| **features** | 11 language-agnostic Kamei change metrics: diffusion, size, purpose, history. No import graph, no leakage. |
| **calibrate** | LightGBM + isotonic `CalibratedClassifierCV` on a strict time-ordered split. |
| **score** | A calibrated probability + a risk **tier** on the `read_only → write → execute → network → destructive` scale. |
| **record** | The scored change + its prediction land in the **Pellet** ledger, ready to be graded later. |

---

## Risk tiers

Each tier is a documented operating point you choose between — be conservative or aggressive on purpose, not by accident. The cut points below are the data-driven thresholds from the shipped model card; `roost thresholds` re-derives them for your own data, and `tier_thresholds` in [`configs/default.yaml`](configs/default.yaml) sets the advisory defaults.

| tier | score ≥ | precision | recall | share of changes |
|---|---|---|---|---|
| `write` | 0.086 | 0.31 | 0.99 | 65% |
| `execute` | 0.200 | 0.51 | 0.77 | 31% |
| `network` | 0.750 | 0.89 | 0.17 | 4% |
| `destructive` | 1.000 | 1.00 | 0.09 | 2% |

---

## The Pellet ledger

A prediction nobody checks is a horoscope. **Pellet** is the local system-of-record that closes the loop: every score is stored and later compared to what actually happened, so you build a *verifiable track record* instead of a stream of unaccountable guesses.

```
change  →  prediction  →  action  →  outcome  →  recurrence
(what landed) (Augur's call) (who acted) (what really happened) (did it come back?)
```

- **Built to grow up.** `action` and `recurrence` already exist in the schema (empty for now), so wiring in real incident/rollback signals or autonomous-agent actions later needs **no migration** — the `outcome` label just upgrades from an OSS proxy to production truth.
- **No secrets, no PII.** `author_id` is a salted hash; raw names/emails never land in the ledger; commit messages are sanitized at ingest. Public data only.
- **Zero infra.** It's a local DuckDB file (`data/ledger.duckdb`) — columnar, regenerable, with content-hash keys that make every re-run byte-identical.

---

## Command reference

The CLI is installed as `roost` (`uv run roost <cmd>`). Pipeline commands have matching `make` targets; the rest are run directly.

| Command | Make target | What it does |
|---|---|---|
| `roost init` | `make init` | Create + migrate the Pellet ledger. `--reset` recreates it. |
| `roost info` | `make info` | Show ledger row counts, seed, and explainer status. |
| `roost ingest` | `make ingest` | Mine the configured OSS repos into the `change` table. |
| `roost label` | `make label` | Write SZZ fix-inducing labels into `outcome`. |
| `roost features` | `make features` | Build the Kamei feature matrix → `features.parquet`. |
| `roost train` | `make train` | Train the calibrated LightGBM model → `prediction`. |
| `roost eval` | `make eval` | Honest eval + PASS/FAIL vs the pre-registered bar. |
| `roost robustness` | — | Multi-seed bands, rolling-origin CV, ablations, importance. |
| `roost thresholds` | — | Derive score→tier cut points from calibration-slice targets. |
| `roost package` | — | Build a shippable cold-start model bundle + model card. |
| `roost ci` | — | Score one commit of a **local** checkout for a CI pipeline. |
| `roost score-repo <url>` | — | Score a repo Augur has never seen with the cold-start model. |
| `roost comment` | `make comment` | Render the deterministic risk comment for a change. |
| `roost serve` | — | Local webhook simulator (needs the `serve` extra). |
| `roost version` | — | Print the version. |

`roost ci` is advisory by default (`--warn-at 0.6`); pass `--fail-at` to set a non-zero exit code. The optional LLM explainer is off everywhere unless you pass `--explainer` (or set `explainer.enabled` in config) and install the `llm` extra.

---

## The bigger picture

ROOST is one corner of a deliberate design. Today's release builds and honestly evaluates the first two pieces; the rest are **designed-for, not built**.

| Module | Role | Status |
|---|---|---|
| **AUGUR** | **score** — calibrated risk over change features, before a change lands | here today |
| **PELLET** | **record** — the outcome ledger / system-of-record | here today |
| PARLIAMENT | grade — cross-vendor evaluation of other AI-ops agents | designed |
| TALON | gate — a permissioned write layer, *earned* only once Augur proves the bar on your own history | designed |

**The thesis:** autonomous agents are unreliable, so the layer that measures and bounds them must itself be deterministic and auditable. LLMs only ever show up as bounded, optional, swappable parts — never load-bearing decision logic.

---

## Project layout

```
src/roost/
  ledger/      Pellet schema, migrations, deterministic ids, DuckDB wrapper
  ingest/      repo mining (PyDriller)
  labeling/    SZZ fix-inducing labels
  features/    Kamei change features
  model/       calibrated LightGBM, feature sets, packaging, thresholds
  evaluation/  PR-AUC, calibration, effort-aware lift, leave-one-repo-out, robustness
  render/      deterministic risk comment
  explain/     optional LLM explainer (no-op default)
  serve/       cold-start scoring + local webhook simulator
  models/      shipped cold-start model bundle
configs/       default.yaml, repos.yaml
docs/          spec, decisions, model card, CI / deploy / GitHub-App guides
```

---

## Contributing

**ROOST is young and contributions move it forward fast.** Whether you fix a typo or add a whole language to the feature extractor, you're welcome here — see **[CONTRIBUTING.md](CONTRIBUTING.md)** for the full guide.

Good places to start:

- **Score a new language.** The feature extractor is intentionally language-agnostic — help us validate it on Go, Rust, TypeScript, Java.
- **Add a repo to the evaluation set.** More repos = a more honest, more general model. Mixed sizes/domains/languages especially.
- **Try a calibration method.** Beat isotonic on the reliability diagram without leaking time.
- **Wire up a real outcome source.** A connector that upgrades Pellet's `outcome` from the SZZ proxy to genuine incident/rollback signals.
- **Reproduce a result and tell us if it doesn't hold.** Honest negative findings are first-class here.

Every PR runs the test suite with the LLM **off** — the deterministic core must stay deterministic. The fastest way to get a change merged is a reproducible command and a test.

New contributors are welcome on **[Discord](https://discord.gg/p8FSZxrBsW)** — say hi, ask anything, or bring a repo you want scored.

## License

[PolyForm Noncommercial 1.0.0](LICENSE) — **free for any noncommercial use**: personal
projects, research, education, hobby OSS, evaluation. **Commercial use requires a
commercial license** from ninoxai — for your company's repos, your CI, or your product.
Get in touch and we'll sort it out quickly:

- 📧 [ferbegor@gmail.com](mailto:ferbegor@gmail.com)
- 💼 [LinkedIn — Egor Ferber](https://www.linkedin.com/in/egor-ferber-73932821a/)
- 🐙 [GitHub — @egorferb](https://github.com/egorferb)

Versions up to and including `v1.0.1` were released under Apache 2.0 and stay that way;
everything from this point forward ships under PolyForm Noncommercial.

<p align="center"><i>Predict honestly. Record everything. Touch nothing.</i></p>
