Metadata-Version: 2.4
Name: lodlina
Version: 0.2.0
Summary: A plumb line for government AI: realistic U.S. public-sector tasks and automated graders for evaluating LLMs, built on Inspect.
Project-URL: Homepage, https://github.com/Lodlina/Lodlina
Project-URL: Repository, https://github.com/Lodlina/Lodlina
Author: Lodlina contributors
License: MIT
License-File: LICENSE
Keywords: ai-evaluation,evals,government,inspect,llm,public-sector
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: inspect-ai>=0.3.50
Requires-Dist: pyyaml>=6.0
Requires-Dist: textstat>=0.7.3
Provides-Extra: all
Requires-Dist: aioboto3>=13.0; extra == 'all'
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: boto3>=1.34; extra == 'all'
Requires-Dist: openai>=1.40; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: bedrock
Requires-Dist: aioboto3>=13.0; extra == 'bedrock'
Requires-Dist: boto3>=1.34; extra == 'bedrock'
Provides-Extra: dev
Requires-Dist: aioboto3>=13.0; extra == 'dev'
Requires-Dist: anthropic>=0.40; extra == 'dev'
Requires-Dist: boto3>=1.34; extra == 'dev'
Requires-Dist: openai>=1.40; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == 'openai'
Description-Content-Type: text/markdown

# Lodlina

**A plumb line for government AI.**

*Lodlina* is Swedish for **plumb line** — the weighted cord builders have used for
millennia to check whether something is true and upright. That is exactly what
this project is: a fair, reproducible standard for checking whether AI systems do
government work correctly, fairly, and honestly.

Lodlina is an open-source suite of realistic U.S. public-sector tasks paired with
**automated, defensible graders**, built on
[Inspect](https://inspect.aisi.org.uk) (the open evaluation framework from the UK
AI Safety Institute). It scores how well any LLM performs real government work and
produces a model-comparison leaderboard.

> A plumb line doesn't argue about which wall is prettier — it tells you, without
> opinion, whether the wall is true. Lodlina aims for the same: measurement you can
> defend, not vibes.

---

## Why this exists

Public-sector agencies are under real pressure to adopt AI for tasks like
processing records, making eligibility determinations, answering the public from
policy manuals, and communicating plainly. These tasks have a property most LLM
benchmarks ignore: **the cost of being wrong is asymmetric and concrete.** Leaking
a citizen's Social Security number is not a rounding error. Flipping an
eligibility decision because an applicant's name "sounds" a certain way is not a
style preference. Inventing a citation in a determination letter is not a minor
hallucination.

Lodlina measures the things that actually matter for government adoption, with
graders that an evaluation practitioner — or an inspector general — could audit.
The quality of the tasks and graders matters far more than breadth: **a few
defensible tasks beat many shallow ones.**

---

## The tasks (v1)

Each task ships with a synthetic dataset (input + labeled ground truth), a solver,
and a defensible scorer. Every definition of "correct" is documented below and in
[`docs/methodology.md`](docs/methodology.md).

### 1. `records-redaction` — *don't leak personal privacy info*
A synthetic government document mixes **must-redact** items (SSNs, personal email,
home address, date of birth — FOIA **Exemption 6** personal-privacy information)
with clearly **releasable** content (program descriptions, public statistics,
officials acting in their official capacity, office contact info).

- **Task:** return a JSON list of the exact substrings to redact (every occurrence
  is treated as redacted).
- **Scorer (deterministic):** matches predictions against the labeled gold spans
  with normalized equals-or-contains matching.
  - **`leak_rate`** *(headline)* — fraction of must-redact items the model missed.
    A miss is a leak, the most serious failure.
  - **`over_redaction_rate`** — fraction of clearly-releasable items the model
    wrongly redacted (a label-grounded false-positive rate; over-redacting defeats
    the purpose of FOIA disclosure).
- FOIA exemption categories are stored in the gold data but **not scored in v1**
  (category accuracy is a documented backlog item).

### 2. `eligibility-fairness` — *correct, and consistent under irrelevant changes*
A synthetic case file plus a synthetic-but-realistic policy-manual excerpt with
clear eligibility rules.

- **Task:** return an eligibility determination (`eligible` / `not eligible`) and
  the rule the determination relied on.
- **Two scorers:**
  - **`accuracy`** *(deterministic)* — determination vs. the rule-derived correct
    answer.
  - **`fairness` via metamorphic pairs** — for each case we auto-generate variants
    that are **identical except for a legally-irrelevant attribute** (the
    applicant's name is swapped across demographically-associated names). The
    score flags any case whose determination **flips** across variants. This
    measures inconsistency on irrelevant changes — concrete and defensible, not a
    subjective "bias vibe". Headline: **`flip_rate`**.

### 3. `grounded-qa` — *answer, and cite faithfully*
A policy document plus a question.

- **Task:** answer the question **and** cite the supporting passage(s), quoted
  verbatim from the source.
- **Two scorers:**
  - **`answer_correctness`** — model-graded against the reference answer with a
    strict rubric.
  - **`citation_faithfulness`** — every cited passage must appear **verbatim** in
    the source (deterministic substring check) **and** must actually support the
    claim (model-graded, strict rubric, only applied to citations that pass the
    verbatim check). Headline: **`hallucinated_citation_rate`** — the fraction of
    cited passages that are not verbatim in the source.

### 4. `plain-language` — *rewrite simply without changing the meaning*
A dense bureaucratic paragraph.

- **Task:** rewrite it at roughly an 8th-grade reading level while preserving
  meaning.
- **Two scorers:**
  - **`readability_improvement`** *(deterministic)* — Flesch-Kincaid grade-level
    drop via [`textstat`](https://pypi.org/project/textstat/), credited when the
    rewrite lands near the target grade.
  - **`meaning_preservation`** — model-graded **two-way entailment** with a strict
    rubric (the rewrite must entail the original and the original must entail the
    rewrite — no added or dropped facts).

---

## Grading philosophy (the heart of the project)

1. **Prefer deterministic, defensible measurement.** Redaction, eligibility
   accuracy, the verbatim-citation check, and readability are all computed from
   labeled ground truth or exact string operations — no model judgment.
2. **For fuzzy dimensions, use counterfactual / metamorphic pairs.** Fairness is
   measured by changing only a legally-irrelevant attribute and checking whether
   the output flips. We do **not** ship subjective "bias" graders.
3. **Where a model-grader is unavoidable** (citation support, meaning
   preservation), it gets a **strict rubric** and is **backed by a deterministic
   check** wherever possible (e.g. a passage must pass the verbatim check before a
   model is asked whether it supports the claim).
4. **If a grader can't be made defensible, the task goes to the backlog** rather
   than shipping weak.

Full detail — every task's definition of "correct" and exactly how its scorer
works — is in [`docs/methodology.md`](docs/methodology.md).

---

## Synthetic data & limitations

- **All data is synthetic.** No real PII or CUI is used anywhere. Personal
  identifiers are deliberately fake: SSNs use the never-issued `900–999` area
  range, phone numbers use the reserved `555-01xx` block, personal emails use
  `example.com`, and names/addresses are fabricated. Generators live in
  [`src/lodlina/datagen/`](src/lodlina/datagen/) and are seeded for
  reproducibility; small seed sets (~15–20 samples/task) are committed so the repo
  runs out of the box.
- **Synthetic ≠ representative.** Templated synthetic documents are cleaner and
  more regular than real agency records. Scores here indicate capability on a
  controlled proxy, not certified performance on production records.
- **Model-graded components inherit grader limitations.** Where we must use a model
  grader, results depend on the grader model and rubric; we constrain and
  deterministically back these wherever possible, but they are not infallible.
- **English / U.S. federal framing.** Tasks reflect U.S. federal concepts (e.g.
  FOIA Exemption 6). They are a starting point, not a complete map of government
  work.
- **Not legal advice or an authorization to deploy.** Lodlina is an evaluation
  instrument, not a compliance certification.

---

## Backlog (future work, not yet built)

Listed here deliberately — these need methodology care before they're defensible:

- **political-neutrality** — requires symmetric paired prompts and measuring
  response symmetry; the methodology needs care to avoid a subjective grader.
- **Section-508 alt-text** — accessibility alt-text quality.
- **FOIA exemption-reasoning** — justify *which* exemption applies and why
  (extends redaction with category accuracy on correctly-caught items).
- **abstention on unanswerable policy questions** — reward declining to answer when
  the policy doesn't contain the answer.

---

## Install

Lodlina uses [`uv`](https://docs.astral.sh/uv/) and Python ≥ 3.10.

The core install is provider-agnostic (the eval framework + the deterministic
graders, no cloud SDKs). Add a **provider extra** to actually run models —
**Amazon Bedrock is the primary, in-boundary provider**:

```bash
uv venv
uv pip install -e ".[bedrock]"     # AWS Bedrock (Claude via Converse)
uv pip install -e ".[bedrock,openai]"   # + OpenAI (direct API and Bedrock Mantle/GPT-5.x)
uv pip install -e ".[anthropic]"   # direct Anthropic API
uv pip install -e ".[all]"         # every provider
uv pip install -e ".[dev]"         # tests + linter + all providers
```

| Extra | Pulls in | Enables |
|---|---|---|
| `bedrock` | `boto3`, `aioboto3` | Claude on Bedrock (Converse) |
| `openai` | `openai` | direct OpenAI **and** Bedrock Mantle (GPT-5.x) |
| `anthropic` | `anthropic` | direct Anthropic API |
| `all` | all of the above | everything |

## Models & credentials

Lodlina is **Bedrock-first**. You select a model by a short **alias**
(`claude-sonnet-4-6`, `gpt-5.5`, …) and it resolves to that model's **Amazon
Bedrock** route by default, keeping prompts **in-boundary**. The direct
OpenAI / Anthropic APIs are secondary routes, chosen only when you explicitly
ask for them (`--provider openai|anthropic`) — there is **no silent
cross-boundary fallback**. You can also pass a full Inspect model string
directly. See [`src/lodlina/models.py`](src/lodlina/models.py) for the registry.

Copy [`.env.example`](.env.example) to `.env.local` and fill it in; the snippets
below show what each provider needs.

### Claude on Bedrock (Converse API, `us-east-1`)

The Claude line-up and the model-graded **grader** use Inspect's `bedrock/`
provider with standard AWS credentials:

```bash
export AWS_ACCESS_KEY_ID=...        # or: export AWS_PROFILE=<profile>
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
```

Bedrock model strings take the form `bedrock/<bedrock-model-id>`; Claude models
carry an `anthropic.` provider prefix and route via regional inference profiles,
e.g. `bedrock/us.anthropic.claude-sonnet-4-6`. (Haiku 4.5 has no short alias on
Bedrock, so it is pinned to the dated profile
`us.anthropic.claude-haiku-4-5-20251001-v1:0`.)

### OpenAI GPT-5.x on Bedrock Mantle (Responses API, `us-east-2`)

GPT-5.4 and GPT-5.5 are **not** served by the Converse API. They live on the
separate Bedrock **Mantle** endpoint and speak the OpenAI **Responses API**, so
Lodlina addresses them through Inspect's generic `openai-api` provider with the
service prefix `bedrock-mantle` and `responses_api=true`. They authenticate with
a **Bedrock long-term API key** (a bearer token, *not* SigV4 credentials), and
Mantle is only available in `us-east-2` / `us-west-2` / `us-gov-west-1`:

```bash
export BEDROCK_MANTLE_BASE_URL=https://bedrock-mantle.us-east-2.api.aws/openai/v1
export BEDROCK_MANTLE_API_KEY=ABSK...   # Bedrock console → API keys → long-term
```

Model strings look like `openai-api/bedrock-mantle/openai.gpt-5.4`. Alias
resolution applies `responses_api=true` automatically for these; on the CLI with
a full string, add `-M responses_api=true`. If the Mantle environment isn't set,
the leaderboard renders those rows as `—` rather than failing.

### Direct OpenAI / Anthropic APIs (off-boundary)

Secondary routes that send prompts to the commercial APIs. Select them
explicitly with `--provider`; they read the standard keys:

```bash
export OPENAI_API_KEY=...      # for: --provider openai
export ANTHROPIC_API_KEY=...   # for: --provider anthropic
```

The leaderboard labels each model's provider and data boundary (in-boundary
Bedrock vs off-boundary commercial API) in its output, so a reviewer can see at
a glance where each run sent its data.

### Air-gapped operation

Lodlina is designed to run with no internet access: all datasets are committed,
and Inspect's optional remote token-estimate is replaced with an offline
fallback at CLI startup (it does not affect grading). A fully vendored offline
install bundle is on the [roadmap](docs/ROADMAP.md).

## The `lodlina` command

```bash
lodlina list                          # available tasks + model aliases
lodlina run grounded-qa --model claude-sonnet-4-6 --limit 5
lodlina run records-redaction --model gpt-5.4     # GPT-5.4 via Bedrock Mantle
lodlina run plain-language --model claude-sonnet-4-6 --provider anthropic  # off-boundary
lodlina leaderboard --html            # full model-comparison board
lodlina validate                      # check the built-in datasets are sound
```

`run` resolves the model Bedrock-first, binds the neutral grader, and is air-gap
safe. Use `--grader-model self` to let a model grade its own output.

## Run a single task (via Inspect directly)

```bash
# Default model (Sonnet 4.6 on Bedrock)
inspect eval src/lodlina/tasks/records_redaction.py

# Pick a model explicitly
inspect eval src/lodlina/tasks/records_redaction.py \
  --model bedrock/us.anthropic.claude-opus-4-8

# A GPT-5.x model on the Bedrock Mantle endpoint (Responses API)
inspect eval src/lodlina/tasks/records_redaction.py \
  --model openai-api/bedrock-mantle/openai.gpt-5.4 -M responses_api=true

# Inspect the run logs in a browser
inspect view
```

Task modules:
`src/lodlina/tasks/records_redaction.py`,
`eligibility_fairness.py`, `grounded_qa.py`, `plain_language.py`.

## Regenerate / expand the synthetic data

```bash
python -m lodlina.datagen.generate_redaction
python -m lodlina.datagen.generate_eligibility
python -m lodlina.datagen.generate_grounded_qa
python -m lodlina.datagen.generate_plain_language
```

## Build the leaderboard

Runs every task across the configured model list and renders a Markdown (and
optional HTML) comparison table:

```bash
python -m lodlina.leaderboard                          # default Bedrock-first line-up
python -m lodlina.leaderboard --models claude-sonnet-4-6 gpt-5.5
python -m lodlina.leaderboard --models claude-sonnet-4-6 --provider anthropic --html
```

`--models` takes aliases or full Inspect model strings; `--provider
openai|anthropic` forces the off-boundary route for aliases. Output is written to
`leaderboard/` (`results.md` / `results.json`, and `results.html` with `--html`),
including a **Models & data boundary** section noting where each run sent its data.

## Development & tests

```bash
uv pip install -e ".[dev]"
pytest
```

The test suite runs the **full Inspect pipeline offline** — each task is driven
end-to-end by a mock model with canned outputs (including the model-graded
scorers), so the deterministic grading logic is verified without AWS credentials
or network access. (Inspect's *estimated* token counts use a remote tokenizer
that the tests stub out; this estimate is unrelated to Lodlina's grading.)

---

## Layout

```
src/lodlina/
  tasks/        # one Inspect @task per file
  scorers/      # custom scorers + shared grading helpers
  data/         # committed synthetic datasets (jsonl)
  datagen/      # scripts that generate the synthetic data
  leaderboard.py
docs/           # methodology writeup
leaderboard/    # generated results tables
```

Conventions mirror
[`inspect_evals`](https://github.com/UKGovernmentBEIS/inspect_evals) so Lodlina
could plausibly be contributed there later. License: **MIT** (matches
`inspect_evals`).

---

## License

[MIT](LICENSE).
