Metadata-Version: 2.4
Name: falsify-eval
Version: 0.2.0
Summary: Calibrated falsification harness for retrieval evaluation.
Author-email: Sparsh Sharma <sparshsharma219@gmail.com>
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for describing the origin of the Work and
              reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Support. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or support.
        
           END OF TERMS AND CONDITIONS
        
           Copyright 2026 Sparsh Sharma
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: Homepage, https://github.com/spalsh-spec/falsify-eval
Project-URL: Issues, https://github.com/spalsh-spec/falsify-eval/issues
Project-URL: Repository, https://github.com/spalsh-spec/falsify-eval
Project-URL: Preprint, https://arxiv.org/abs/<TBD-on-submission>
Keywords: retrieval,evaluation,falsification,reproducibility,rag,ir
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Requires-Dist: hypothesis>=6.0; extra == "test"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: hypothesis>=6.0; extra == "dev"
Requires-Dist: mutmut>=2.4; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Dynamic: license-file

<div align="center">

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="assets/hero-dark.svg">
  <img alt="falsify-eval — four nulls, one gate, zero inflation" src="assets/hero-light.svg" width="100%">
</picture>

<br>

## Some search engines pretend to be smart.

They look like they understand your question.<br>
They actually just return whatever's most popular in their database.

A student named **Mira** would do the same on her French exam.<br>
She'd score 80% by always picking "C". She doesn't speak French.

**This is a 30-second test that catches them.**

### → Try it without installing anything

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/spalsh-spec/falsify-eval/blob/main/notebooks/quickstart.ipynb)
[![Play with sliders](https://img.shields.io/badge/play%20with%20sliders-▶-9c4a1a?style=for-the-badge)](https://spalsh-spec.github.io/falsify-eval/play.html)
[![Real-data case study](https://img.shields.io/badge/real%20data-CS01%20NFCorpus-3d7a4a?style=for-the-badge)](case_studies/cs01_nfcorpus/CS01_REPORT.md)

The **Colab** runs the actual library on a synthetic bench (60 seconds, no install).<br>
The **Playground** lets you pick a strategy with sliders and watch the gate verdict update live in your browser.<br>
The **Case study** shows the same gate working on a peer-reviewed BEIR benchmark.

### → Or install and run locally

```bash
pip install git+https://github.com/spalsh-spec/falsify-eval
```

Free. Open source. Runs on your laptop. Works on any search system.

Built for **search engines, recommendation systems, the retrieval side of RAG.**<br>
*Not* built for the part of ChatGPT that writes paragraphs — that's a different problem we haven't built a test for.

<br>

[![CI](https://github.com/spalsh-spec/falsify-eval/actions/workflows/ci.yml/badge.svg)](https://github.com/spalsh-spec/falsify-eval/actions/workflows/ci.yml)
[![Tests](https://img.shields.io/badge/tests-91%20passing-brightgreen)](tests/)
[![Release](https://img.shields.io/github/v/release/spalsh-spec/falsify-eval?color=blue&label=release)](https://github.com/spalsh-spec/falsify-eval/releases/latest)
[![Python ≥ 3.10](https://img.shields.io/badge/python-≥3.10-blue.svg)](https://www.python.org/)
[![Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)

[**30-second demo**](#30-second-demo) · [**The Mira test**](#the-mira-test) · [**How it works**](#how-it-works) · [**Three surfaces**](#three-surfaces) · [**Preprint**](#preprint)

</div>

---

## The Mira test

Imagine a student named Mira who never studied. She noticed that on past exams, *"C"* is the most common correct answer. So she writes *C* every time and scores 80%. She looks smart on paper. She has zero actual knowledge — she gamed the pattern.

A retrieval or ranking system can do the same thing. If the most popular document in a corpus happens to be relevant for most queries, a system that always returns that popular document will score well on aggregate metrics — without using the query at all. (This is not a hypothetical: see the [CS01 NFCorpus case study](case_studies/cs01_nfcorpus/CS01_REPORT.md) where this exact predictor scores nDCG@10 = 0.066 on a published BEIR benchmark while ignoring every query.)

The published number looks great. It does not mean what you think it means.

**falsify-eval is a Mira-check for retrieval and ranking systems.** It compares your system's score against four "fake students" — four null distributions, including one (*Null D*, the marginal-matched random) that is original to this work and that the previous standard nulls miss. If your system can't beat all four by a calibrated margin, the gate fails.

→ **Case studies (real numbers, two public benchmarks):**
  - [CS01 — NFCorpus](case_studies/cs01_nfcorpus/CS01_REPORT.md) (323 BEIR queries, dense relevance ~38 docs/query)
  - [CS02 — SciFact](case_studies/cs02_scifact/CS02_REPORT.md) (300 BEIR queries, sparse relevance ~1.1 docs/query)

  Across both: Mira and popularity-only fail at Δ_D ≈ 0; BM25 and dense MiniLM pass at Δ_D = +0.14 to +0.73. Reproducible in 5 minutes each on M1 laptop. Joint finding: graded metrics (nDCG) on dense-relevance benchmarks can flatten the gate — pair them with single-gold strict metrics (recall@K against top-1).

---

## 30-second demo

```bash
pip install git+https://github.com/spalsh-spec/falsify-eval
python -c "from falsify_eval.demo import run; run()"
```

Three systems graded on a 50-query synthetic bench:

```
═══ constant_predictor (deliberately broken) ═══
  real mean nDCG@5         = 0.20
    Δ_A (gold-permuted)    = +0.000  ✗
    Δ_B (uniform random)   = +0.001  ✗
    Δ_C (random retrieval) = +0.18   ✓
    Δ_D (marginal-matched) = +0.000  ✗  ← the gate that catches Mira
  GATE: ✗ FAIL  (correctly rejected)

═══ mock_engine (plausible retrieval, 70% top-1) ═══
  real mean nDCG@5         = 0.62
    Δ across all 4 nulls   ≥ +0.40   ✓
  GATE: ✓ PASS  (correctly accepted)

═══ oracle (perfect top-1) ═══
  real mean nDCG@5         = 1.00
  GATE: ✓ PASS by maximum margin
```

---

## How it works

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {
    'fontFamily': 'Garamond, EB Garamond, Georgia, serif',
    'primaryColor': '#f3eee5',
    'primaryTextColor': '#1c1611',
    'primaryBorderColor': '#9c4a1a',
    'lineColor': '#9d8147',
    'tertiaryColor': '#faf6ed',
    'tertiaryBorderColor': '#d4c8b2',
    'edgeLabelBackground': '#f3eee5'
}}}%%
flowchart LR
    R([your retriever]) -->|top-K per query| S[real score]
    G([gold labels]) --> S
    G -->|permute π| A[Null A · label-permuted]
    G -->|iid uniform| B[Null B · uniform random]
    P([item pool]) -->|sample K| C[Null C · random retrieval]
    G -->|sample by class freq| D[Null D · marginal-matched ★]
    S --> Δ{Δ ≥ τ on<br/>all four?}
    A --> Δ
    B --> Δ
    C --> Δ
    D --> Δ
    Δ -->|yes| PASS([✓ PASS])
    Δ -->|no| FAIL([✗ FAIL])
    classDef ok    fill:#eef3e8,stroke:#3d7a4a,color:#1a3d22,stroke-width:1.5px;
    classDef fail  fill:#f7e9e3,stroke:#9c4a1a,color:#5a1c0c,stroke-width:1.5px;
    classDef novel fill:#fef9e7,stroke:#9d8147,color:#5a4720,stroke-width:2px;
    classDef gate  fill:#f3eee5,stroke:#1c1611,color:#1c1611,stroke-width:2px;
    class PASS ok
    class FAIL fail
    class D novel
    class Δ gate
```

| Null | What it tests | Catches |
|---|---|---|
| **A — gold-permuted** | bijection π over class labels | systems that learned label distribution shape, not relevance |
| **B — uniform random** | iid uniform draw of gold per query | systems that exploit class-prior assumption |
| **C — random retrieval** | replace engine output with K random items from pool | systems that score by retrieval coverage, not ranking quality |
| **D — marginal-matched** ★ | iid draw of gold from the empirical class frequency | predictors matched to the gold marginal — *new in this work* |

**Null D is the load-bearing contribution.** It correctly rejects the constant-most-frequent predictor that A and B can false-positive. *(Definition 1 of the [preprint](PREPRINT.md).)*

---

## Three surfaces

```python
# 1. Library
from falsify_eval import four_null_gate

result = four_null_gate(
    retrieved_lists, gold_list, rel_list, my_metric,
    item_pool=corpus_ids, k=5, n_trials=50, tau=0.05,
    progress=True,                      # stderr per-stage timing
)
assert result["gate_passes"]
```

```bash
# 2. CLI on JSONL benches — no Python knowledge needed
falsify-eval grade --input bench.jsonl --metric ndcg@5 --pool corpus.txt
falsify-eval doctor                     # end-to-end install verification
falsify-eval quickstart ./demo          # writes a sample bench + pool
```

```json
// 3. MCP server — Claude Code, Cursor, any MCP-compatible client
{
  "mcpServers": {
    "falsify-eval": {
      "command": "python",
      "args": ["-m", "falsify_eval.mcp_server"]
    }
  }
}
```

Claude can then call `grade_retrieval` directly on any retrieval pipeline output you give it — no glue code, no separate scoring service.

---

## What it catches

A non-exhaustive list of failure modes the gate flags:

| Broken predictor | Δ_A | Δ_B | Δ_C | Δ_D | Gate |
|---|:-:|:-:|:-:|:-:|:-:|
| Constant most-frequent class | ≈ 0 | ≈ 0 | + | **≈ 0** | ✗ |
| Marginal-matched random | ≈ 0 | + | + | **≈ 0** | ✗ |
| Popularity-only ranker (no query feature) | + | + | + | small | ✗ |
| Lexical-match-only on bag-of-words | + | + | + | + | ✓ |
| Full retriever (BM25 / dense / hybrid) | + | + | + | + | ✓ |
| Full retriever on **drifted** corpus | varies | varies | varies | varies | ✗ via `verify_state` |

The first three score well on bare aggregate metrics (nDCG, MRR, recall@K). The standard reporting practice publishes those numbers. The four-null gate rejects them.

---

## What the gate does NOT prove

A passing gate is *necessary* for credible reporting, not *sufficient*. It does not prove:

- the engine learned the actual relevance signal (only that it learned *something* beyond the four trivial null classes)
- the engine generalises beyond the evaluation set
- per-feature contribution claims are significant *(handled separately by `bootstrap_ci`, `paired_permutation_p`, `cohens_d_paired`)*
- the bench developer didn't overfit query phrasing to engine behaviour

The library is **calibrated for retrieval and ranking evaluation** — search, recommendation top-K, RAG retrieval-side, classification-as-retrieval. It is **not yet** generalised to LLM free-text generation, summarisation, or open-ended QA. Those domains need their own null distributions and are planned for v0.3+.

---

## Validating an LLM-RAG pipeline

```python
from falsify_eval import four_null_gate

# Replace this with whatever your retriever returns. The library doesn't
# care if it's BM25, FAISS, Pinecone, Weaviate, Vespa, or a homegrown
# bag-of-words. It grades the OUTPUT, not the engine.
def my_rag_retriever(query: str) -> list[str]:
    """Return top-K document IDs for a query."""
    ...

retrieved = [my_rag_retriever(q) for q in queries]

def recall_at_5(r, g, _rel): return 1.0 if g in r[:5] else 0.0

res = four_null_gate(
    retrieved, gold, [3]*len(gold), recall_at_5,
    item_pool=pool, k=5, n_trials=100, tau=0.05, seed=2026,
)
print("GATE:", "PASS" if res["gate_passes"] else "FAIL", res["deltas"])
```

A complete Claude-API worked example with a 50-query bench is in [`examples/llm_rag_validation.py`](examples/llm_rag_validation.py). To adapt it to GPT-4 / Llama / Mistral / Gemini: swap the API call inside `my_rag_retriever`. The gate is identical.

---

## Why is my run taking so long?

The gate calls your `metric_fn` exactly **N × (1 + 4 × n_trials)** times.

| Metric cost / call | N=500, n_trials=50 |
|---|---|
| In-memory check (~1 µs) | 0.1 s |
| Embedding lookup (~1 ms) | 1.7 min |
| LLM-judge call (~200 ms) | **~5.6 hours** |

If your run is taking hours, your *metric* is the bottleneck — not the gate (which finishes N=5,000 × pool=100k × n_trials=50 in under 2 seconds with a fast metric). Pass `progress=True` to see per-stage timing on stderr. Three options to speed up: (1) drop `n_trials` from 50 → 20 — statistically defensible; (2) cache `metric_fn` calls; (3) parallelise the four nulls with multiprocessing — pure CPU, no shared state.

---

## How this compares

| Capability | DVC | MLflow | W&B | Ragas | TruLens | **falsify-eval** |
|---|:-:|:-:|:-:|:-:|:-:|:-:|
| Vendor-free | ✓ | ✓ | ✗ | ✓ | partial | **✓** |
| Pure-text human-readable lock | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |
| Couples artifact hash + verified score | ✗ | ✗ | partial | ✗ | partial | **✓** |
| Falsification gate (CI-enforceable) | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |
| **Marginal-matched null** ★ | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |
| Positive-control self-validation | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |

The tools above solve different problems (versioning, tracking, observability). They complement falsify-eval; they don't replace it.

---

<details>
<summary><b>Where it actually runs</b></summary>

Pure Python ≥ 3.10 + numpy ≥ 1.24. No GPUs, no native extensions, no internet at runtime.

| Environment | One-liner |
|---|---|
| Local laptop | `pip install git+https://github.com/spalsh-spec/falsify-eval` |
| Google Colab | `!pip install git+https://github.com/spalsh-spec/falsify-eval` |
| Kaggle / Sagemaker / Databricks | same as Colab |
| GitHub Actions | add the `pip install` line to your `run:` block |
| Docker (any base image with Python ≥ 3.10) | `RUN pip install git+...` |
| AWS Lambda / Cloud Functions | bundle as a layer; the wheel is < 50 KB |
| Air-gapped / offline | clone the repo to a USB stick; install from local path |

The library is intentionally minimal so the audit surface is small and the deployment surface is large. No network calls, no telemetry, no opinions about your runtime.

</details>

<details>
<summary><b>What the gate proves (Proposition 1)</b></summary>

If the four-null gate PASSes (Δ ≥ τ on all four nulls) at N_trials = 50, τ = 0.05, then with Bonferroni-corrected confidence ≥ 0.95:

- The engine is **not** equivalent to a label-permutation-invariant ranker (rejected by *G_A*).
- The engine is **not** achieving its score solely via the uniform-class-prior assumption (rejected by *G_B*).
- The engine is **not** equivalent to a uniform-random retriever (rejected by *G_C*).
- The engine is **not** equivalent to a gold-marginal-matched predictor (rejected by *G_D — new in this work*).

The full proof is in [`PREPRINT.md`](PREPRINT.md), §3.

</details>

<details>
<summary><b>Why we built it</b></summary>

Most retrieval-system papers report a single aggregate metric (nDCG@k, MRR) and call it a contribution. Three failure modes make this practice unsafe at any benchmark size and dangerous on small ones:

1. **Null-distribution silence.** A learned ranker can absorb gold-label distribution shape without learning underlying query–document relevance. A constant predictor matched to the empirical class marginal can score non-trivially without using the query at all.
2. **Corpus drift between commits.** ALTER TABLE migrations and feedback-loop side effects mutate runtime artifacts without changing source code. A "score-neutral" annotation can be true about the source diff while false about the runnable system.
3. **Small-sample claims masquerading as significance.** A +0.02 metric gain on N < 50 queries usually sits inside the bench's noise floor.

The four-null gate addresses (1). The integrity-check state lock (`lock_state` / `verify_state`) addresses (2). The statistical-reporting helpers (`bootstrap_ci`, `paired_permutation_p`, `cohens_d_paired`, `power_n_required`) address (3). All in <1,000 lines of Python with `numpy` as the only runtime dependency.

</details>

---

## Preprint

- [`PREPRINT.md`](PREPRINT.md) — *Calibrated Falsification Harnesses for Retrieval Evaluation* (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, soundness proposition).
- [`SUPPLEMENTARY.md`](SUPPLEMENTARY.md) — extended tables, ablations, bench-size calibration curve.

Submission to arXiv is pending. The DOI will be added to `CITATION.cff` on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via `lock_state` against the v0.1.0 tag).

```bibtex
@article{sharma2026calibrated,
  title  = {Calibrated Falsification Harnesses for Retrieval Evaluation},
  author = {Sharma, Sparsh},
  year   = {2026},
  eprint = {<arxiv-id-when-published>},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR}
}
```

---

## Companion engine — Vāk-Kaṇaja (public release imminent)

**Vāk-Kaṇaja** is the Sanskrit / Pāṇinian retrieval engine built alongside falsify-eval. It is the first retriever (to my knowledge) adversarially verified by the four-null gate via cross-falsification, and the first to wire the **6 classical Pramāṇas** of Nyāya / Mīmāṃsā into a retrieval engine as a **router** — detecting the query's epistemological type (*Pratyakṣa, Anumāna, Upamāna, Arthāpatti, Anupalabdhi, Śabda*) and routing evidence channels accordingly.

It also implements an **Anupalabdhi (non-perception) confidence floor**: when the corpus does not contain the answer, the engine returns *"corpus does not contain this knowledge"* as a positive verdict, refusing to leak weak chunks. Pairs with falsify-eval's Null A naturally — the silent-failure failure mode that load-bearing AI-safety arguments rely on assuming away.

The engine ships with a **calibrated negative result**: bench expansion N=21 → N=141 falsified the lift from the novel rerankers (Poincaré, topological persistence, fractal affinity), which now ship at production weight 0 and are documented as opt-in research components. The 3-channel φ-RRF baseline is the production default. This is the falsify-eval discipline applied to the authoring engine — same calibration that earned three clean rounds of adversarial review on this library.

Public release imminent at `github.com/spalsh-spec/vak-kanaja`, Apache 2.0, under the **Bhardwaj &amp; Sons** brand. *Priority announcement dated 2026-05-08.*

---

## Status

- **v0.1.6.11** — current. **91 tests** passing on a fresh clone (Mayank-battery 31 + property-based 15 + scipy cross-check 11 + smoke 8 + validation 9 + CLI stdin 4 + Windows-encoding 3 + shell-mangled paths 6 + sundry 4); ~10 s on M1. CI matrix green on Ubuntu × {3.10, 3.11, 3.12} and macOS × {3.10, 3.11, 3.12}.
- **v0.1.6.11** — publish-workflow version-sync guard hardened: previously tried to `import falsify_eval` before the package was installed and failed at the version-check step; now reads `__version__` and `pyproject.toml`'s `version` directly via grep/sed so the tag, source files, and built artefact are cross-checked three ways without requiring an install.
- **v0.1.6.10** — distribution + arXiv build prep (infrastructure-only, no gate behaviour change): added `.github/workflows/publish.yml` for OIDC trusted publishing to PyPI on every `v*` tag push; added `tools/build_arxiv.sh` for converting `PREPRINT.md` to an arXiv-submittable LaTeX bundle via pandoc; added `[tool.mutmut]` config + `docs/MUTATION_TESTING.md` documenting the deferred status; added `[project.optional-dependencies] dev` bucket pinning `mutmut`, `build`, and `twine`.
- **v0.1.6.9** — added CS03 case-study scaffold (`case_studies/cs03_aikosh_rag/`) for the AIKosh internal RAG integration (Jasmeet Singh, in flight); added Tested-platforms log to README; renumbered v0.2 case studies (CS03 = AIKosh, CS04 = FiQA, CS05 = Quora).
- **v0.1.6.8** — empirical equivariance certificate: PREPRINT §5.9 + property tests proving the gate is strongly equivariant under order-preserving label-set bijections and Null C / `real_mean` are exactly equivariant under arbitrary bijections.
- **v0.1.6.7** — declared `hypothesis>=6.0` as a test dep so CI installs it. (Caught by CI matrix the moment v0.1.6.6 landed.)
- **v0.1.6.6** — Hypothesis property-based test suite for the four-null gate: 13 universally-true properties (algebraic, deterministic, metric, gate-semantics, validation), each fuzzed against ~80 random benches per CI run.
- **v0.1.6.5** — cross-platform path-mangling hint: when `--input my-bench\bench.jsonl` is copy-pasted into zsh and the backslash gets eaten, the CLI now suggests the corrected forward-slash path instead of a bare `FileNotFoundError`.
- **v0.1.6.4** — Windows console UTF-8 / ASCII output hardening (closes Jasmeet's cp1252 `UnicodeEncodeError` on the `Δ` glyph): reconfigure stdout to UTF-8 with `errors='replace'` at CLI entry, with auto-fallback to ASCII glyphs (`Δ→d`, `τ→tau`, `✓→[ok]`) when the post-reconfigure stream still can't encode them. Also `--ascii` flag and `FALSIFY_ASCII=1` env var.
- **v0.1.6.3** — public priority announcement of companion engine **Vāk-Kaṇaja**.
- **v0.1.6.2** — Mayank round-3 polish: negative-seed validation in `_validate_inputs`.
- **v0.1.6.1** — Mayank round-2: CLI `--input -` now reads from stdin (was `FileNotFoundError: '-'`).
- **v0.1.6** — bonferroni helper, scipy cross-check tests, property-based tests, CS02 SciFact case study, PREPRINT scope-honesty rewrite, AI/retrieval conflation strike across surfaces.
- **v0.1.5.2** — added `progress=True` flag to `four_null_gate` after Mayank's 5-hour AIKosh silent-run incident.
- **v0.1.5.1** — closed `null_a` defect class for tuple / dataclass labels.
- **v0.1.5** — fixed all 14 defects from the Mayank Singh adversarial battery; full credit in [`CHANGELOG.md`](CHANGELOG.md).
- **v0.2 (next)** — PyPI publish; case studies CS03 (AIKosh internal RAG, scaffolded — see [`case_studies/cs03_aikosh_rag/`](case_studies/cs03_aikosh_rag/)), CS04 (FiQA) and CS05 (Quora) for metric-sensitivity triangulation; broken-predictor zoo as a public artifact; `label_order_seed` parameter to break dependency on adversarial label ordering (see PREPRINT §5.9).
- **v0.3+ (planned)** — extension to LLM free-text and summarisation; pre-registration tooling. *(Not yet shipped — do not claim coverage.)*

### Tested platforms

External-verification log. Each entry is a real run by a real person who is
not the package author, dated, with the exact version they ran. New entries
go at the top.

| Date | Tester | OS | Python | Shell | Version | Notes |
|---|---|---|---|---|---|---|
| 2026-05-08 | Jasmeet Singh (AIKosh) | Windows 10 (19045) | 3.14.3 | PowerShell | 0.1.6.7 | install / upgrade 0.1.6.2→0.1.6.7 / `doctor` / `quickstart` / `grade` all clean; original cp1252 defect closed. CS03 integration with AIKosh's internal RAG retriever in flight. |
| 2026-05-07 | Mayank Singh | macOS 14 (M1) | 3.12 | zsh | 0.1.5 → 0.1.6.2 | adversarial 14-defect battery; all closed. |

Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.

---

<div align="center">

*A house of standards.*  Released by **[Bhardwaj &amp; Sons](https://bhardwajandsons.com)** under Apache 2.0.<br>
The methodology is free, public, and citable so it can become a standard rather than a product.

</div>
