Metadata-Version: 2.4
Name: lodlina
Version: 0.2.2
Summary: A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.
Project-URL: Homepage, https://github.com/Lodlina/Lodlina
Project-URL: Repository, https://github.com/Lodlina/Lodlina
Author: Lodlina contributors
License: MIT
License-File: LICENSE
Keywords: ai-evaluation,evals,government,inspect,llm,public-sector
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: inspect-ai>=0.3.50
Requires-Dist: pyyaml>=6.0
Requires-Dist: textstat>=0.7.3
Provides-Extra: all
Requires-Dist: aioboto3>=13.0; extra == 'all'
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: boto3>=1.34; extra == 'all'
Requires-Dist: openai>=1.40; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: bedrock
Requires-Dist: aioboto3>=13.0; extra == 'bedrock'
Requires-Dist: boto3>=1.34; extra == 'bedrock'
Provides-Extra: dev
Requires-Dist: aioboto3>=13.0; extra == 'dev'
Requires-Dist: anthropic>=0.40; extra == 'dev'
Requires-Dist: boto3>=1.34; extra == 'dev'
Requires-Dist: openai>=1.40; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == 'openai'
Description-Content-Type: text/markdown

# Lodlina

**A plumb line for government AI.**

*Lodlina* is Swedish for **plumb line** — the weighted cord builders have used for
millennia to check whether a wall is true. This project is the same idea for AI: a
fair, reproducible, **auditable** way to check whether an AI model does government
work **correctly, fairly, and honestly** — and to compare models on equal footing.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with
**defensible automated graders**, built on
[Inspect](https://inspect.aisi.org.uk) (the open evaluation framework from the UK
AI Safety Institute). You point it at a model, it runs the tasks, and it produces
scores and a model-comparison leaderboard you can put in front of leadership, an
inspector general, or an ATO board.

> A plumb line doesn't argue about which wall is prettier — it tells you, without
> opinion, whether the wall is true. Lodlina aims for the same: **measurement you
> can defend, not vibes.**

## Who this is for

- **Agency AI leads and program owners** choosing a model for records, eligibility,
  public Q&A, or plain-language work — and needing evidence, not a vendor demo.
- **Vendors and integrators selling AI to government** who want to show, on neutral
  ground, that their model is correct, fair, and grounded.
- **Inspectors general, auditors, and eval practitioners** who need numbers they can
  reproduce and defend — every score traces to a label or an exact string check.
- **Researchers** studying how LLMs behave on real public-sector tasks.

## What makes it different

- **Defensible graders.** Most scores are deterministic — computed from labeled
  ground truth or exact string operations, auditable by someone who isn't an ML
  expert. Where judgment is unavoidable, the grader uses a strict rubric **backed by
  a deterministic check**.
- **All data is synthetic.** No real PII or CUI anywhere — safe to run on any
  network. (SSNs use the never-issued `900–999` range, etc.)
- **In-boundary by default.** Lodlina is **Bedrock-first**: prompts stay in your AWS
  boundary unless you *explicitly* opt into a commercial API. Every result is labeled
  with where it sent its data.
- **Runs anywhere.** No telemetry, no phone-home; designed to work air-gapped.

---

## Quickstart (about a minute)

**Recommended — with [`uv`](https://docs.astral.sh/uv/).** uv brings its own
Python, so you don't have to install or manage one (macOS ships an old 3.9 that
won't work):

```bash
# Install uv if you don't have it: https://docs.astral.sh/uv/getting-started/installation/
uv tool install "lodlina[bedrock]"     # isolated install; fetches a modern Python
uv tool update-shell                   # puts `lodlina` on your PATH (then restart the terminal)
```

Point it at AWS Bedrock (Claude models enabled in `us-east-1`) and run:

```bash
export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1

lodlina validate                                              # no credentials needed
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5   # one task (~a few cents)
lodlina leaderboard --models claude-sonnet-4-6 claude-haiku-4-5 --limit 5 --html
```

`lodlina leaderboard --html` writes a shareable table to
**`leaderboard/results.html`** in the current directory.

<details>
<summary><b>Prefer pip / a virtualenv?</b> (needs Python ≥ 3.10 yourself)</summary>

```bash
python3.12 -m venv .venv && source .venv/bin/activate   # MUST be 3.10+; macOS's default python3 is 3.9
pip install "lodlina[bedrock]"
```

If you see `ERROR: ... lodlina[bedrock] (from versions: none)` or `Ignored the
following versions that require a different python version`, your `python3` is too
old (< 3.10). Use `uv` above, or install a newer Python (`brew install python@3.12`).
</details>

**Notes**
- The bare `lodlina leaderboard` compares the full default line-up (Opus, Sonnet,
  Haiku — plus GPT-5.x if you've configured Bedrock Mantle, see below). Use
  `--models <alias> …` to pick exactly who to compare and keep cost predictable;
  unconfigured models show as `—` rather than failing.
- **No AWS yet?** `lodlina list` and `lodlina validate` work with no credentials.
  To run a model, add a provider and credentials — see
  [Models & credentials](#models--credentials).

## Start with your AI assistant (Claude Code, Codex, Cursor, …)

If you use an AI coding assistant with shell access, you don't have to read the
docs first. Open it in an empty folder and paste the contents of
**[`docs/ai-quickstart-prompt.md`](docs/ai-quickstart-prompt.md)**. The assistant
will install Lodlina, check your AWS Bedrock access, run a first evaluation, and
generate a leaderboard — explaining each step as it goes.

The prompt is **self-contained and transparent** — it's plain markdown, so read it
before you paste it; everything it does is right there. It only installs into a
local virtual environment and only spends money when *you* approve a model run.

---

## What it measures

Four tasks, each a real public-sector job where being wrong is concrete. Each ships
a synthetic dataset (input + labeled ground truth) and a defensible scorer. Full
definitions of "correct" are in [`docs/methodology.md`](docs/methodology.md).

### 1. `records-redaction` — *don't leak personal privacy info*
A synthetic government document mixes **must-redact** items (SSNs, personal email,
home address, date of birth — FOIA **Exemption 6** information) with clearly
**releasable** content. The model returns what it would redact.
- **`leak_rate`** *(headline, deterministic)* — fraction of must-redact items the
  model missed. A miss is a leak: the most serious failure.
- **`over_redaction_rate`** — releasable items wrongly redacted (over-redacting
  defeats the purpose of FOIA disclosure).

### 2. `eligibility-fairness` — *correct, and consistent under irrelevant changes*
A synthetic case file plus a policy-manual excerpt with clear rules.
- **`accuracy`** *(deterministic)* — determination vs. the rule-derived answer.
- **`flip_rate`** *(headline, metamorphic)* — each case is re-run with only the
  applicant's **name** changed (a legally-irrelevant attribute). Any case whose
  decision **flips** is flagged. This turns "is it biased?" into an objective "did
  the decision change when it must not have?" — no subjective bias grader.

### 3. `grounded-qa` — *answer, and cite faithfully*
A policy document plus a question; the model answers **and** cites supporting
passages verbatim.
- **`hallucinated_citation_rate`** *(headline, deterministic)* — fraction of cited
  passages **not found verbatim** in the source (a fabricated quote).
- **`answer_correctness`**, **`citation_support`** — model-graded, each backed by a
  deterministic check (a labeled reference answer; the verbatim gate).

### 4. `plain-language` — *rewrite simply without changing the meaning*
A dense bureaucratic paragraph.
- **`readability_improvement`** *(deterministic)* — Flesch-Kincaid grade-level drop.
- **`meaning_preservation`** — model-graded **two-way entailment** (no facts added
  or dropped), reported *alongside* readability so "simplify by deleting content"
  can't look like success.

## Why you can trust the numbers

1. **Deterministic first.** Wherever there's a ground truth, grading is computed from
   labels or exact string operations — no model judgment.
2. **Metamorphic for fairness.** Change only a legally-irrelevant attribute and check
   whether the output flips. No subjective "bias" grader ships.
3. **Constrained, backed model-grading only where unavoidable.** Citation support and
   meaning preservation use a strict rubric and a deterministic gate (e.g. a quote
   must pass the verbatim check before a model is asked whether it supports a claim).
4. **By default, no model grades itself.** On the leaderboard a single neutral grader
   scores every candidate, so model-graded columns are comparable.

Every task's definition of "correct" and exactly how its scorer works is in
[`docs/methodology.md`](docs/methodology.md) — written to be read by an inspector
general, not just an ML engineer.

---

## Models & credentials

Lodlina is **Bedrock-first**. You pick a model by a short **alias**
(`claude-sonnet-4-6`, `claude-opus-4-8`, `gpt-5.5`, …) and it resolves to that
model's **Amazon Bedrock** route by default, keeping prompts **in-boundary**. The
direct OpenAI / Anthropic APIs are secondary routes, used only when you ask for them
with `--provider` — there is **no silent cross-boundary fallback**.

`lodlina list` shows every alias and where it resolves. Pick the provider extras you
need at install time:

| Install | Enables |
|---|---|
| `pip install lodlina[bedrock]` | Claude on Bedrock (Converse) — the default path |
| `pip install lodlina[bedrock,openai]` | + OpenAI GPT-5.x (direct API **and** Bedrock Mantle) |
| `pip install lodlina[anthropic]` | direct Anthropic API |
| `pip install lodlina[all]` | every provider |

### Claude on Bedrock (default, in-boundary)

Standard AWS credentials with Bedrock access; the Claude line-up and the neutral
grader both run here:

```bash
export AWS_PROFILE=your-bedrock-profile   # or AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
export AWS_DEFAULT_REGION=us-east-1
```

### OpenAI GPT-5.x on Bedrock Mantle (in-boundary)

GPT-5.4 / GPT-5.5 are served on Bedrock's separate **Mantle** endpoint (OpenAI
Responses API), authenticated with a **Bedrock long-term API key** and available in
`us-east-2` / `us-west-2` / `us-gov-west-1`:

```bash
export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term
```

Aliases (`gpt-5.4`, `gpt-5.5`) route here automatically. If the Mantle environment
isn't set, those rows simply render as `—` instead of failing.

### Direct OpenAI / Anthropic (off-boundary)

Secondary routes that send prompts to the commercial APIs — selected explicitly with
`--provider openai|anthropic`, reading `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`. The
leaderboard always labels each model's provider and **data boundary**, so a reviewer
can see at a glance where every run sent its data.

### Claude on a Pro/Max subscription (no per-token billing)

For development and heavy iteration you can run the Claude models on your **Claude
Pro/Max subscription** instead of paying per token — Inspect's `anthropic` provider
supports OAuth bearer auth. Mint a long-lived token once with the Claude CLI, then:

```bash
claude setup-token                         # opens a browser; prints an OAuth token
export ANTHROPIC_AUTH_TOKEN=<that token>   # subscription auth (not an API key)
export LODLINA_GRADER_MODEL=anthropic/claude-sonnet-4-6   # keep the grader on the subscription too

lodlina run grounded-qa --model claude-sonnet-4-6 --provider anthropic
```

This routes both the candidate and the grader through your subscription, so a full
run needs **no AWS credentials and no per-token charges** (subscription rate limits
apply). Note this is the **direct Anthropic API** (off-boundary) — great for building;
for an in-boundary government leaderboard, use the Bedrock route. The deterministic
tasks (`records-redaction`, `eligibility-fairness`) have no model grader, so they run
on the subscription with nothing else configured.

Copy [`.env.example`](.env.example) to `.env.local` for a fill-in-the-blanks setup.

---

## The `lodlina` command

```bash
lodlina list                                   # eval packs + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4               # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic   # off-boundary
lodlina leaderboard --html                     # full model-comparison board
lodlina validate                               # check every eval pack is sound
lodlina new-pack my-pack --task-type records-redaction     # author your own eval set
```

`run` resolves the model Bedrock-first, binds a neutral grader, and is air-gap safe.

## Bring your own eval sets (packs)

An **eval pack** is a shareable evaluation: a `manifest.yaml` + a synthetic
`dataset.jsonl` that **reuses Lodlina's vetted graders** by referencing a curated
task type. Packs are **data + configuration only — no third-party code is executed**,
so every grader stays auditable.

```bash
lodlina new-pack ssn-heavy --task-type records-redaction   # scaffold a valid starter
# ...drop your synthetic records into ssn-heavy/dataset.jsonl...
lodlina validate --pack ./ssn-heavy
lodlina run --pack ./ssn-heavy --model claude-sonnet-4-6
```

Third parties can distribute packs as pip-installable packages (discovered via the
`lodlina_packs` entry point). See [`src/lodlina/packs/builtin/README.md`](src/lodlina/packs/builtin/README.md).

---

## Synthetic data & honest limitations

- **All data is synthetic.** No real PII/CUI. Identifiers are unmistakably fake
  (SSNs in the never-issued `900–999` range, `555-01xx` phones, `example.com`
  emails). Generators are seeded; committed seed sets (~12–18 samples/task) let it
  run out of the box.
- **Synthetic ≠ representative.** Templated documents are cleaner than real agency
  records; scores indicate capability on a controlled proxy, not certified production
  performance.
- **Model-graded components inherit grader limits.** They're constrained and
  deterministically backed where possible, but not infallible; read them as
  grader-relative. The deterministic headlines are the grader-independent signal.
- **U.S. federal framing, English.** A starting point, not a complete map of
  government work.
- **Not legal advice or an authorization to deploy.** Lodlina is an evaluation
  instrument, not a compliance certification.

## Roadmap & backlog

The plan of record (phases, design decisions, eval-pack format) is in
[`docs/ROADMAP.md`](docs/ROADMAP.md). Tasks still on the methodology bench:
political-neutrality (symmetric paired prompts), Section-508 alt-text, FOIA
exemption-reasoning, and abstention on unanswerable questions.

## Develop & contribute

```bash
git clone https://github.com/Lodlina/Lodlina && cd Lodlina
uv venv && uv pip install -e ".[dev]"
pytest          # 50+ tests, fully offline
```

The test suite runs the **full Inspect pipeline offline** — each task is driven
end-to-end by a mock model, so all grading logic is verified without credentials or
network. Conventions mirror
[`inspect_evals`](https://github.com/UKGovernmentBEIS/inspect_evals).

## License

[MIT](LICENSE).
