Lodlina — a plumb line for government AI

How well models do U.S. public-sector work, scored by defensible automated graders. Lower-level rates are auditable; the Lodlina Score rolls them up. Suite 2026.1.

Ranking — Lodlina Score

ModelLodlina Score ↑
/ 1,600
1claude-opus-4-81,364 / 1,600
2claude-sonnet-4-61,343 / 1,600
3gpt-5.51,312 / 1,600
4nova-2-lite1,252 / 1,600
5deepseek-r11,177 / 1,600
6claude-haiku-4-51,176 / 1,600
7llama4-maverick1,083 / 1,600
8nemotron-super1,055 / 1,600
9nova-premier982 / 1,600

By mission bucket

ModelB1
Benefits & Eligibility
B2
Records, Disclosure & Privacy
B3
Citizen Services & Correspondence
B4
Authoritative-Source Q&A
B7
National Security Info Protection
claude-opus-4-8451/500363/400150/300200/200200/200
claude-sonnet-4-6500/500349/40094/300200/200200/200
gpt-5.5500/500387/40025/300200/200200/200
nova-2-lite401/500349/400102/300200/200200/200
deepseek-r1376/500350/40094/300200/200158/200
claude-haiku-4-5351/500351/40075/300200/200200/200
llama4-maverick339/500318/40031/300200/200194/200
nemotron-super282/500340/40033/300200/200199/200
nova-premier144/500321/400117/300200/200200/200

Each cell is the Lodlina Score earned within that mission bucket (difficulty-weighted), out of the attainable for the bucket. Filter to the bucket that matches your mission.

Per-task rates (drill-down)

B2Records redaction (FOIA Exemption 6) (difficulty 2)
Modelleak_rate ↓over_redaction ↓
claude-opus-4-80.130.00
claude-sonnet-4-60.170.00
gpt-5.50.040.03
nova-2-lite0.170.00
deepseek-r10.170.00
claude-haiku-4-50.160.00
llama4-maverick0.170.00
nemotron-super0.220.01
nova-premier0.170.10
B1Eligibility fairness (metamorphic name-swap) (difficulty 1)
Modelaccuracy ↑flip_rate ↓
claude-opus-4-81.000.00
claude-sonnet-4-61.000.00
gpt-5.51.000.00
nova-2-lite1.000.00
deepseek-r10.700.44
claude-haiku-4-51.000.00
llama4-maverick1.000.00
nemotron-super1.000.00
nova-premier1.000.00
B4Grounded QA (citation faithfulness) (difficulty 2)
Modelhallucinated_citation ↓answer_correctness ↑citation_support ↑
claude-opus-4-80.001.001.00
claude-sonnet-4-60.001.001.00
gpt-5.50.001.001.00
nova-2-lite0.001.001.00
deepseek-r10.001.001.00
claude-haiku-4-50.001.001.00
llama4-maverick0.001.001.00
nemotron-super0.001.001.00
nova-premier0.001.000.00
B3Plain language (readability + meaning preservation) (difficulty 3)
Modelreadability ↑meaning_preserved ↑
claude-opus-4-821.140.67
claude-sonnet-4-619.280.75
gpt-5.513.291.00
nova-2-lite20.440.58
deepseek-r118.920.75
claude-haiku-4-520.350.50
llama4-maverick17.680.42
nemotron-super17.310.67
nova-premier20.170.67
B1Benefits eligibility — Household Food Assistance (hard, computational) (difficulty 4)
Modelaccuracy ↑flip_rate ↓
claude-opus-4-80.970.10
claude-sonnet-4-61.000.00
gpt-5.51.000.00
nova-2-lite0.810.07
deepseek-r10.960.12
claude-haiku-4-50.740.15
llama4-maverick0.700.15
nemotron-super0.730.38
nova-premier0.490.78
B2Records redaction — adversarial (over-redaction traps) (difficulty 2)
Modelleak_rate ↓over_redaction ↓
claude-opus-4-80.050.00
claude-sonnet-4-60.080.00
gpt-5.50.000.00
nova-2-lite0.080.00
deepseek-r10.080.00
claude-haiku-4-50.080.00
llama4-maverick0.080.17
nemotron-super0.070.00
nova-premier0.080.07
B7National security info protection — classification spillage (synthetic) (difficulty 2)
Modelleak_rate ↓over_redaction ↓
claude-opus-4-80.000.00
claude-sonnet-4-60.000.00
gpt-5.50.000.00
nova-2-lite0.000.00
deepseek-r10.210.00
claude-haiku-4-50.000.00
llama4-maverick0.030.00
nemotron-super0.010.00
nova-premier0.000.00

Provenance & reproducibility

Reproduce:

lodlina leaderboard --models claude-opus-4-8 claude-sonnet-4-6 claude-haiku-4-5 gpt-5.5 nova-premier llama4-maverick deepseek-r1 nemotron-super nova-2-lite --graders bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 openai-api/bedrock-mantle/openai.gpt-5.4 bedrock/us.amazon.nova-pro-v1:0 --html