Metadata-Version: 2.4
Name: ir
Version: 0.1.17
Summary: An information-retrieval substrate for agentic systems
Project-URL: Homepage, https://github.com/i2mint/ir
Project-URL: Repository, https://github.com/i2mint/ir
Project-URL: Documentation, https://i2mint.github.io/ir
Author: Thor Whalen
License: MIT
License-File: LICENSE
Keywords: agents,embeddings,information-retrieval,rag,retrieval,semantic-search
Requires-Python: >=3.10
Requires-Dist: argh
Requires-Dist: dol
Requires-Dist: ef
Requires-Dist: numpy
Requires-Dist: vd
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: sphinx-rtd-theme>=1.0; extra == 'docs'
Requires-Dist: sphinx>=6.0; extra == 'docs'
Provides-Extra: llm
Requires-Dist: oa; extra == 'llm'
Description-Content-Type: text/markdown

# ir

**An information-retrieval substrate for agentic systems** — one uniform
"find the relevant things in this corpus" contract that scales from an ad-hoc
search over an ephemeral list to a maintained capability-discovery engine.

Give an agent *one* search tool, not fifty tool schemas. `ir` retrieves
candidates, **commits to a small high-precision subset** (the distractor problem
is the central selection risk — fewer, better candidates beat more), and
**discloses** each committed item's payload only when asked.

```python
import ir

# Define a corpus, build the index (incremental), then discover:
source = ir.CorpusSource.from_skills()       # or from_packages(), from_md_reports(), from_files(...)
corpus = ir.build(source)                     # embed + persist under XDG dirs
result = ir.discover(corpus, "how do I deploy the app to the server")

for item in result.results:
    print(item.score, item.name)              # the committed few (or result.abstained)
print(result.to_dict())                       # JSON-serializable (qh / HTTP ready)
```

## Install

```bash
pip install ir
```

`ir` is **light by default** — `numpy` + [`dol`](https://github.com/i2mint/dol)
for storage, plus [`ef`](https://github.com/thorwhalen/ef) /
[`vd`](https://github.com/i2mint/vd) for embedding and lexical/hybrid retrieval.
Python ≥ 3.10.

Notes for the default (semantic) path:

- The default embedder is `all-MiniLM-L6-v2` (384-dim, via
  `sentence-transformers`), **downloaded on first build** (needs network) and
  cached under `~/.cache/ir`. For a fast, offline, dependency-light run — tests,
  CI, quick experiments — pass `embedder="light"` (a numpy-only hashing
  embedder): `ir.build(source, embedder="light")`.
- `ir` sets `USE_TF=0` on import so `transformers` does not pull in TensorFlow
  (which crashes on some numpy ABIs); import `ir` before anything that imports
  `transformers`.
- Case generation (`ir.eval_gen`) and the optional LLM selector need an LLM via
  [`oa`](https://github.com/thorwhalen/oa) — install the extra,
  `pip install "ir[llm]"`. Scoring and evaluation themselves stay offline.

## The pipeline

`ir` is a five-stage pipeline, each stage a small, swappable seam:

| Stage | Entry point | What it does |
|-------|-------------|--------------|
| **source** | `CorpusSource` | what is in the corpus + what counts as stale |
| **index** | `ir.build` | decompose artifacts into embeddable *surfaces*, embed, persist (incremental, idempotent) |
| **retrieve** | `ir.search` | hard metadata filter + `dense` / `lexical` / `hybrid` ranking |
| **select** | `ir.select` | commit to a distractor-robust subset, or abstain |
| **disclose** | `ir.disclose` | load the heavy payload (SKILL.md body, package pointer, file text) for committed items — append-only |

`ir.discover` chains retrieve → select → disclose into the single agent-callable
(and `qh`-exposable) tool. Pass a **list** of corpus names for single-shot
*federated* discovery across several corpora:

```python
ir.discover(["skills", "packages"], "deploy the app")    # fan-out → fuse → select
ir.discover(["skills", "packages"], q, min_score="auto") # gate each source on its own floor
```

Each source is searched and gated on its **own** calibrated abstention floor
*before* any merging; the survivors then rank-fuse (weighted RRF via
`ir.fuse_hits`) — raw scores never cross a source boundary, because scores from
different corpora / embedders / modes live on incommensurable scales. Every hit
carries its corpus name as `hit.source`, so same-id artifacts from different
corpora stay distinct, attributable results. The caller names the sources;
`ir` never chooses the set (source planning belongs to the agent layer —
see [`raglab`](https://github.com/thorwhalen/raglab)).

### Retrieve

```python
hits = ir.search(corpus, "deploy app", mode="hybrid")   # dense | lexical | hybrid (RRF)
```

Dense is exact brute-force cosine; `lexical` is Okapi BM25; `hybrid` fuses both
by Reciprocal Rank Fusion (the strongest default for short, identifier-heavy
capability text). Lexical/hybrid reuse [`vd`](https://github.com/i2mint/vd);
dense needs only numpy.

Hybrid has a second fusion, `fusion="blend"` — a magnitude-preserving score
blend instead of rank-RRF. RRF discards score magnitude, which is exactly what
abstention calibration needs, so `blend` separates in-scope from out-of-scope
queries far better (and even beats dense); the tradeoff is lower lexical recall
on terse corpora, so RRF stays the default. Use `blend` when abstention matters
— see [`ir_08`](misc/docs/ir_08%20--%20Magnitude-Preserving%20Hybrid%20Fusion%20--%20Trading%20rank-RRF%20for%20abstention%20separability.md).

### Select

```python
sel = ir.select(hits)                      # conservative default: stay within rel of top, cap at max_k
sel = ir.select(hits, min_score=0.4)       # opt in to abstention ("nothing applies")
sel = ir.select(hits, strategy="score_gap")  # elbow cut, or "top_k" / "rel_threshold" / a callable
```

The abstention floor is mode-specific (dense cosine, BM25, and RRF live on
different scales), so rather than guess `min_score`, **calibrate** it from a case
file and let `discover` load it:

```python
ev.calibrate_min_score(corpus, cases, mode="dense", persist=True)  # learn + store the floor
ir.discover(corpus, query, mode="dense", min_score="auto")         # abstain by the calibrated floor
```

Calibration separates in-scope from out-of-scope query top-scores and picks the
floor that best splits them — see
[`ir_07`](misc/docs/ir_07%20--%20Min-Score%20Calibration%20--%20Abstention%20floors%20from%20score-distribution%20separability.md);
it works best on `dense` / `lexical` (hybrid's RRF scores barely separate).
`min_score` defaults to `None` (never abstain), so abstention stays fully opt-in.

The conservative defaults (`max_k=3`, `rel=0.9`) are tuned, not guessed — see
[`ir_06`](misc/docs/ir_06%20--%20Selector%20Tuning%20--%20Picking%20conservative-selector%20defaults%20from%20the%20data.md);
re-tune for your own corpus with `ev.sweep_selector` / `ir sweep-select`.

Selection is *relative* (ratios to the top score), so one selector works across
`dense` / `hybrid` / `lexical` whose absolute scales differ by orders of
magnitude. The result carries auditable `signals` and a `reason` — no opaque
"confidence" float. An optional LLM selector (`make_llm_selector`, lazy on
[`oa`](https://github.com/thorwhalen/oa), injectable for tests) falls back to the
heuristic on any failure.

### Disclose

```python
payloads = ir.disclose(sel, level="body")  # "metadata" (no I/O) | "body" | "bundled"
```

Disclosure is a *pure* read that follows the pointer already stored on each hit
(`skill_path` / `path`); it never mutates the ranked hits and tolerates a stale
pointer. Keeping the agent's context append-only (to protect the prompt cache)
is then the caller's discipline — `ir` hands back additive payloads.

## Evaluation

`ir.eval` scores discovery quality offline (reusing
[`ef`](https://github.com/thorwhalen/ef)'s retrieval metrics):

```python
from ir import eval as ev

cases = ev.load_cases("skills_eval.jsonl")               # query + gold artifact_id(s)
ev.evaluate_discovery(corpus, cases, mode="hybrid")      # recall@k / NDCG@k / MRR / MAP + failure taxonomy
ev.evaluate_selection(corpus, cases, strategy="conservative")  # conditional commit rate + selection P/R/F1
ev.sweep_selector(corpus, cases)                         # tune max_k × rel; .best() / .frontier() / .table()
ev.distractor_robustness_curve(source.scope, probes)     # accuracy vs catalog size
```

`evaluate_selection`'s headline is the **conditional commit rate** — the
selection decision *isolated* from retrieval (did the selector keep the gold,
*given* retrieval surfaced it?). `sweep_selector` scores a whole `max_k × rel`
grid against the cases off **one** retrieval pass, so the selector defaults can
be read off the data (`.best()`) rather than guessed. Generate cases by
back-translation with `ir.eval_gen` (needs an LLM; scoring stays offline).

## CLI

```bash
ir build skills                          # build/update a preset corpus
ir search skills "deploy the app"        # rank candidates (retrieval only)
ir discover skills "deploy the app"      # retrieve -> select
ir discover skills "deploy the app" --disclose       # + load bodies
ir discover skills "deploy the app" --min-score auto # + calibrated abstention
ir ls                                    # list corpora + record counts
ir info skills                           # config, stats, calibrated floors
ir register notes files --root ~/notes --pattern '.*\.md$'  # register a custom corpus
ir rm notes                              # unregister (keeps built data)
ir eval-gen skills skills_eval.jsonl     # generate eval cases (needs oa/LLM)
ir eval skills skills_eval.jsonl         # score retrieval on a case file
ir eval-select skills skills_eval.jsonl  # score the selection stage
ir sweep-select skills skills_eval.jsonl # tune the selector (max_k × rel) on your corpus
ir calibrate-min-score skills skills_eval.jsonl --persist  # calibrate the abstention floor
```

## Design

The design is grounded in a set of capability-discovery research reports and
eval-run findings under `misc/docs/` (`ir_01`–`ir_08`): the single-search-tool
pattern, indexing & embedding strategy, evaluation, the `ef` + `vd` reuse
analysis, the dense-vs-lexical-vs-hybrid eval, selector tuning, abstention-floor
calibration, and magnitude-preserving fusion. `ir` is light by default (numpy /
`dol`) and reuses the ecosystem (`ef`, `vd`, `oa`) only where it composes cleanly.
