Metadata-Version: 2.4
Name: frontierlag
Version: 1.0.0
Summary: Audit the capability gap between frontier AI models and the models tested in academic papers.
Author: David Gringras, Misha Salahshoor
License: MIT License
        
        Copyright (c) 2026 David Gringras and Misha Salahshoor
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://frontierlag.org
Project-URL: Pre-registration, https://osf.io/7xm3d/
Keywords: AI evaluation,bibliometric audit,LLM,frontier models,research methodology
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Requires-Dist: pyyaml>=6.0
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Requires-Dist: pytest-cov>=4.0; extra == "test"
Provides-Extra: dev
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Dynamic: license-file

# frontierlag

**Audit the capability gap between frontier AI and the models tested in academic papers.**

Paste a DOI. Get a report: what model the paper tested, where it sat relative to the frontier at evaluation date, what configuration the paper disclosed, and whether the paper fails all three audit dimensions at the pre-registered thresholds from the companion study.

```
$ pip install frontierlag
$ frontierlag check 10.1038/s41591-024-03425-5
```

This package is the software companion to *Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation* (Gringras and Salahshoor, 2026). The audit dataset embedded here is the frozen snapshot used in that paper; updates ship as point releases.

---

## What it does

`frontierlag` classifies published AI-capability evaluations against the pre-registered audit dimensions from Gringras (2026). Three primary dimensions (the H5 compound-failure outcome), one secondary magnitude (capability-elicitation shortfall), and one tertiary transparency vector (temporal, tier, elicitation):

| Dimension | What it captures |
|---|---|
| **Capability failure** | `eci_gap ≥ 12 ECI`, anchored to the mean observed within-family major-generation jump on the frozen April-2026 Epoch snapshot. |
| **Elicitation failure** | OR-of-three: reasoning-mode undisclosed for a reasoning-capable model, OR tool-use undisclosed for a tool-capable model, OR scaffolding undisclosed where a scaffolded baseline existed at evaluation date. AND-of-three reported alongside as a strict-conjunction sensitivity. |
| **Interpretive failure** | AND-of-two (pre-registered primary): no human comparator AND `conclusion_framing = ai_generic`. OR-of-two reported alongside as the inclusive sensitivity. Admissibility filter: tasks with machine-verifiable references (oracle code tests, MATH, exact-match QA) have the comparator signal suppressed. |

A paper flagged on all three at the pre-registered thresholds is a **compound failure** (pre-reg §2.2 H5).

The package also returns:

- **`capability_elicitation_shortfall`** — the secondary magnitude `eci_gap × (1 - config_elicitation_index)`, capturing the interaction between capability distance and configuration under-disclosure.
- **Three-component vector** `(temporal_gap_months, tier_gap_count, elicitation_gap_fraction)` — readers do their own weighting.

The package does **not** estimate counterfactual capability; it does not claim "the paper's conclusion would have been X if they had used Y." Descriptive, not normative: the audit documents structural lag, it does not rank authors or score papers as "bad research."

---

## Quick start

```python
import frontierlag as fl

# By DOI (hits the frozen corpus if the paper is in the audit; otherwise
# resolves publication date via CrossRef and leaves you to supply the model).
report = fl.check("10.1038/s41591-024-03425-5")
print(report.to_text())

# Override / supply fields for a paper not in the frozen corpus.
report = fl.check(
    "10.1000/your-doi",
    primary_model="GPT-4",
    evaluation_date="2024-06-01",
    configuration_disclosures={
        "model_version_exact": True,
        "access_date": True,
        "reasoning_mode": None,
        "tool_use": False,
    },
)

# Audit already-extracted metadata.
from frontierlag import audit, PaperMetadata
m = PaperMetadata(
    primary_model="GPT-3.5",
    publication_date="2025-07-01",
    evaluation_date="2025-05-01",
    configuration_disclosures={"reasoning_mode": False, "tool_use": False},
    human_comparator_present=False,
    conclusion_framing="ai_generic",
    task_admissibility="expected",
    domain="medicine",
)
report = audit(m)  # default: AND-of-two pre-registered primary
print(report.compound_failure)                 # pre-registered binary
print(report.capability_elicitation_shortfall) # secondary magnitude
print((report.temporal_gap_months, report.tier_gap_count, report.elicitation_gap_fraction))

# Provenance for false-positive diagnosis.
diag = audit(m, return_provenance=True).provenance
print(diag["classifications"]["compound_failure_prereg"])
print(diag["inputs"])

# Individual lookups.
fl.lookup_model("claude-3.5-sonnet")
fl.get_frontier_at_date("2025-06-01")
fl.list_known_models()
```

## CLI

```
frontierlag check <DOI>               audit a paper
frontierlag lookup <MODEL>            single-model metadata
frontierlag frontier <YYYY-MM-DD>     frontier at a date
frontierlag models                    list known canonical names
frontierlag info                      version + data-freeze date
```

Every command accepts `--json` for machine-readable output. `frontierlag check` accepts `--model`, `--eval-date`, and `--config-file` to override or supply fields a paper does not otherwise provide.

---

## Data freeze

The embedded dataset is frozen at `FREEZE_DATE = 2026-04-01`. Every report prints this at the top so readers know how stale the comparison is. Updates ship as point releases.

| File | Source |
|---|---|
| `data/eci_scores.csv` | Epoch AI Capabilities Index snapshot (Epoch AI, 2026) |
| `data/monthly_frontier_trajectory.csv` | Derived from ECI + model release dates |
| `data/model_version_lookup.json` | Maintainer-curated, cross-checked against Epoch AI model tracker |
| `data/frozen_audit.json` | Audit-dataset DOI lookup index |

---

## Install

```
pip install frontierlag
```

Requires Python ≥ 3.9. Runtime dependencies are `requests` and `pyyaml`; no heavy scientific stack.

---

## Companion artefacts

- **Empirical audit paper** — *Frontier Lag* (Gringras and Salahshoor, 2026).
- **Reporting checklist** — VERSIO-AI v1.2.
- **Pre-registration** — Open Science Framework, `10.17605/OSF.IO/7XM3D`.
- **Live web tool** — `https://frontierlag.org`.

---

## Citation

```bibtex
@software{gringras2026frontierlag,
  author  = {Gringras, David and Salahshoor, Misha},
  title   = {frontierlag: A {Python} package for auditing the capability gap of published {AI} evaluations},
  year    = {2026},
  version = {1.0.0},
  url     = {https://frontierlag.org}
}
```

## License

MIT. See `LICENSE`.
