Metadata-Version: 2.4
Name: omegaprompt
Version: 0.1.0
Summary: Calibration discipline for Claude API prompts - sensitivity-driven coordinate descent and walk-forward validation, ported from omega-lock.
Project-URL: Homepage, https://github.com/hibou04-ops/omegaprompt
Project-URL: Repository, https://github.com/hibou04-ops/omegaprompt
Project-URL: Parent framework, https://github.com/hibou04-ops/omega-lock
Project-URL: Methodology, https://github.com/hibou04-ops/Antemortem
Project-URL: Issues, https://github.com/hibou04-ops/omegaprompt/issues
Author: hibou04-ops
License: MIT License
        
        Copyright (c) 2026 hibou04-ops
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: anthropic,claude,llm-as-judge,overfitting,prompt-calibration,prompt-engineering,walk-forward
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.11
Requires-Dist: anthropic>=0.40.0
Requires-Dist: omega-lock>=0.1.4
Requires-Dist: pydantic>=2.6.0
Requires-Dist: typer>=0.12.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# omegaprompt

[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org)
[![PyPI](https://img.shields.io/badge/pypi-0.1.0-blue.svg)](https://pypi.org/project/omegaprompt/)
[![Status](https://img.shields.io/badge/status-alpha-orange.svg)](#status)
[![Parent framework](https://img.shields.io/badge/framework-omega--lock-blueviolet.svg)](https://github.com/hibou04-ops/omega-lock)

> **Calibration discipline for Claude API prompts.**
> A prompt that scores 5.0 on the training slice and collapses on held-out data is a textbook case of overfitting. omegaprompt ports omega-lock's sensitivity-driven coordinate descent and walk-forward validation to the prompt-engineering setting — so prompts ship only after they generalize, with hard gates that zero out fitness on refusal or malformed output.

```bash
pip install omegaprompt
```

한국어 README: [README_KR.md](README_KR.md)

---

## Table of Contents

- [Why this exists](#why-this-exists)
- [30-second demo](#30-second-demo)
- [The calibratable axes](#the-calibratable-axes)
- [Architecture](#architecture)
- [Design decisions worth defending](#design-decisions-worth-defending)
- [Cost & performance](#cost--performance)
- [Validation](#validation)
- [The 3-layer stack](#the-3-layer-stack)
- [Relation to adjacent tools](#relation-to-adjacent-tools)
- [Status](#status)
- [Roadmap](#roadmap)
- [Contributing](#contributing)
- [Citing](#citing)
- [License](#license)
- [Colophon](#colophon)

---

## Why this exists

Prompt engineering has a known failure mode: you iterate on a handful of hand-picked examples, the prompt starts looking great, you ship it, and it collapses on the first input outside your head. The failure mode is not new — it is overfitting, the same thing machine learning practitioners have been fighting since the 1990s — but the prompt-engineering toolchain has mostly reinvented evaluation without reinventing the defense.

omega-lock, a sibling project, solved this for numeric parameter calibration. The insight is simple and transfers:

1. **Measure sensitivity.** Which parameters actually matter? Perturb each one around a neutral baseline, rank by Gini coefficient of the fitness delta.
2. **Unlock top-K, lock the rest.** Search only in the subspace that moves the fitness. The rest stay at neutral.
3. **Walk forward.** After the search finishes on the training slice, re-evaluate on a held-out test slice the searcher never saw. Require a Pearson correlation above a pre-declared threshold (KC-4) before the result ships. No tuning the threshold after the fact.

This repository is the port. `PromptTarget` implements omega-lock's `CalibrableTarget` protocol, so every omega-lock pipeline (`run_p1`, `run_p1_iterative`, `run_p2_tpe`, `run_benchmark`) works on prompt calibration unchanged. The prompt-specific pieces — the LLM-as-judge scorer, the composite `hard_gate × soft_score` fitness, the five calibratable axes — sit on top.

Two guardrails distinguish this from "ask the judge and pick the highest score":

1. **Hard gates collapse fitness to zero.** `no_refusal`, `format_valid`, `no_safety_violation` — any gate fails on an item, that item contributes zero. A prompt that scores 5.0 on ten tasks but refuses on the eleventh does not rank above one that consistently scores 4.2 with no refusals.
2. **Walk-forward is the ship gate.** omega-lock's KC-4 (Pearson correlation between train and test rankings) is pre-declared and enforced. If the training-best prompt does not rank highly on the test slice, the calibration fails with `status = "FAIL:KC-4"` and no candidate ships. You cannot retroactively lower the threshold.

---

## 30-second demo

```bash
$ omegaprompt calibrate examples/sample_dataset.jsonl \
    --rubric examples/rubric_example.json \
    --variants examples/variants_example.json \
    --test examples/sample_test.jsonl \
    --target-model claude-haiku-4-5 \
    --judge-model claude-sonnet-4-6 \
    --method p1 \
    --unlock-k 3 \
    --output outcome.json

Loading dataset from examples/sample_dataset.jsonl ...
  5 items
Loading test set from examples/sample_test.jsonl ...
  5 items
Starting omega-lock run_p1 calibration (unlock_k=3, method=p1) ...
This issues Claude API calls. Budget accordingly.
Calibration complete. best_fitness=0.8240, test_fitness=0.7820, gen_gap=5.10%
Artifact: outcome.json
```

The `outcome.json` artifact contains the winning parameter set, the per-slice fitness, the generalization gap, the hard-gate pass rate, and aggregate token usage. It is machine-readable by design — `omegaprompt report` (v0.2) will render it; your own CI gate can diff outcomes across prompt revisions.

---

## The calibratable axes

`PromptTarget` exposes five axes to the searcher:

| Axis | Type | Meaning |
|---|---|---|
| `system_prompt_idx` | int | Index into your pool of candidate system prompts (`ParamVariants.system_prompts`). |
| `few_shot_count` | int | How many examples to include from `ParamVariants.few_shot_examples`. 0 = zero-shot. |
| `effort_idx` | int (0-2) | Maps to `effort: low / medium / high`. Only meaningful when thinking is enabled. |
| `thinking_enabled` | bool | Whether to enable adaptive thinking on the target call. |
| `max_tokens_bucket` | int (0-2) | Maps to `max_tokens: 1024 / 4096 / 16000`. Surfaces length-bound bias. |

The `PromptSpace` dataclass lets you lock axes out (`effort_min == effort_max`) when some dimensions are pre-decided. Setting all axes to a single value effectively runs a fixed-prompt benchmark through the judge; typical calibration leaves three axes open and unlocks the top-K by sensitivity.

New axes are deliberately conservative in v0.1. Adding `temperature`, `top_p`, or reasoning-budget knobs would require coupling to model-specific behavior that evolves between releases. The five axes above transfer across any chat-style Claude model string.

---

## Architecture

```
┌──────────────────────────────────────────────────────────────┐
│  Dataset (.jsonl)   ParamVariants (.json)   JudgeRubric (.json) │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  PromptTarget                                                 │
│    implements omega-lock CalibrableTarget                    │
│    param_space() ---> 5 axes                                  │
│    evaluate(params):                                          │
│       for each dataset item:                                  │
│         1. build prompt from params                           │
│         2. call_target(target_client, ...)  -> response       │
│         3. call_judge(judge_client, rubric, ...) -> JudgeResult │
│       CompositeFitness(judge_results) -> fitness              │
│    returns EvalResult(fitness, n_trials, metadata)            │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  omega-lock run_p1                                            │
│    measure_stress + select_unlock_top_k                       │
│    GridSearch over unlocked subspace                          │
│    WalkForward on --test slice (KC-4)                         │
│    emits grid_best, test_fitness, status                      │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────────────────────────┐
│  CalibrationOutcome (JSON artifact)                           │
│    best_params, best_fitness, test_fitness,                   │
│    generalization_gap, hard_gate_pass_rate,                   │
│    n_candidates_evaluated, total_api_calls, usage_summary     │
└──────────────────────────────────────────────────────────────┘
```

Every piece above the omega-lock line is new; every piece at or below it is reused unchanged. The composition boundary is the `CalibrableTarget` protocol — one function (`param_space()`) plus one method (`evaluate(params)`).

---

## Design decisions worth defending

**Single-responsibility adapter, reuses an existing calibration engine.** omega-lock already handles stress measurement, top-K unlock, grid search, walk-forward, KC gates, benchmark scorecards, and iterative lock-in. Reimplementing any of that here would be wrong. The `PromptTarget` is ~200 lines; the value is in the fact that every omega-lock pipeline works on it.

**LLM-as-judge with a Pydantic-validated response.** The judge call uses `messages.parse(output_format=JudgeResult)`. A malformed judge response raises `ValidationError` at the SDK boundary — it never pollutes the fitness. Without this, a single misbehaving judge call could tank an entire calibration run, and you would not notice until the final report.

**Hard gates collapse fitness to zero, no gradient.** A soft penalty on refusal (e.g. "refusals lose 20%") rewards prompts that *almost* refuse. Hard-zero punishes refusal absolutely. The searcher sees no reward signal from inside the refusal region, so it does not approach the boundary. This matches how real deployments evaluate prompts: a prompt that refuses 1-in-10 is not "90% as good," it is unshippable.

**Prompt caching on the judge system prompt.** The judge's ~170-line system prompt is sized past the cacheable-prefix minimum. A typical calibration run issues hundreds of judge calls; cache hits dominate cost. Every run surfaces `cache_read_input_tokens` so silent invalidators (e.g. a non-deterministic byte sneaking into the system prompt) fail loud. [The full judge prompt](src/omegaprompt/prompts.py) is worth reading as a case study in prompt-cache-aware design.

**`effort` and `thinking_enabled` are first-class axes, not globals.** Some prompts only need `low` effort; forcing `high` across the board wastes tokens without improving fitness. Letting the calibration surface this is the entire point — if `effort_idx` has high stress, it matters for your task; if it has low stress, lock it at the neutral and cut tokens.

**Target and judge clients are separate parameters.** Passing the same `Anthropic()` instance to both is the common case. Splitting them means you can swap judge models (the judge can be stronger than the target — often a better quality/cost tradeoff), mock each side independently in tests, or route judge calls through a different workspace.

**Every path is normalized to forward slashes before going into a prompt.** `src\foo.py` and `src/foo.py` are different bytes in the API payload. Prompt caching is cache-invariant only if the bytes match. Windows users would silently lose cache hits otherwise.

**No `temperature` / `top_p` axis.** Modern Claude pinned models remove these. Rather than pretend to support them and fail on the server, omegaprompt excludes them from the default space. If you need them for an older model, override the axes in `PromptSpace`.

---

## Cost & performance

Per `evaluate()` call: 2 × (dataset_size) API calls — one target, one judge per item. Typical 10-item dataset = 20 API calls per candidate.

| Scenario | Target calls | Judge calls | Est. cost (cached judge) |
|---|---|---|---|
| Single candidate, 10-item dataset | 10 | 10 | ~\$0.05-0.10 |
| Grid search, 5^3 = 125 candidates, 10-item dataset | 1250 | 1250 | ~\$6-12 |
| run_p1 with walk-forward (train + test) | 2x the above | 2x the above | ~\$12-24 |

Actual cost depends on target and judge model tiers. Use `claude-haiku-4-5` as the judge during prompt iteration to bring this down by 4-5×; promote to a stronger judge only for the final shipped-calibration run.

Every CLI invocation prints aggregate token usage at the end. If `cache_read_input_tokens` is zero across consecutive runs, something in the judge prompt drifted — the CLI surfaces this explicitly rather than silently absorbing the cost.

---

## Validation

**50 tests, 0 network calls.** Both the target client and judge client are accepted via a structural `Protocol`, so every API test mocks with `SimpleNamespace` or `MagicMock`. The test surface asserts the *exact* shape of each request payload (model, thinking config, cache_control placement, few-shot ordering) without negotiating with a real server.

| Module | Coverage |
|---|---|
| `schema.py` | ParamVariants / PromptSpace / CalibrationOutcome — required fields, range validation, JSON roundtrip. |
| `dataset.py` | JSONL loader — schema validation, duplicate id detection, blank-line tolerance, missing-file error. |
| `judge.py` | Dimension / HardGate / JudgeRubric / JudgeResult — scale validation, normalized weights, clamping out-of-scale scores, gate aggregation. |
| `fitness.py` | CompositeFitness — empty batch, all-pass, partial-fail, all-fail, per-item preservation. |
| `api.py` | call_target / call_judge — payload shape, thinking on/off, few-shot ordering, refusal branch, dict coercion, reference handling. |
| `target.py` | PromptTarget — end-to-end with mocked clients, default resolution, parameter clamping, usage accumulation. |
| `cli.py` | Help / version / subcommand wiring. |

Run with `uv run pytest -q`. Typical wall time: under one second.

---

## The 3-layer stack

omegaprompt does not stand alone. It is the applied layer in a three-project system:

```
       ┌─────────────────────────────────────────────┐
LAYER  │  omegaprompt  (this repo)                   │  "Apply the discipline to prompts"
APPLY  │  v0.1.0 — Claude API prompt calibration     │
       └────────────────────┬────────────────────────┘
                            │ depends on
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  omega-lock                                 │  "The calibration framework"
CORE   │  v0.1.4 — stress + grid + walk-forward + KC │
       └────────────────────┬────────────────────────┘
                            │ validated by
                            ▼
       ┌─────────────────────────────────────────────┐
LAYER  │  Antemortem + antemortem-cli                │  "The discipline around the build"
META   │  methodology + tooling for pre-impl recon   │
       └─────────────────────────────────────────────┘
```

- **[omega-lock](https://github.com/hibou04-ops/omega-lock)** supplies the calibration engine: stress measurement, top-K unlock, grid search, walk-forward, kill criteria, benchmark scorecards. omegaprompt uses it as a library.
- **[Antemortem](https://github.com/hibou04-ops/Antemortem)** + **[antemortem-cli](https://github.com/hibou04-ops/antemortem-cli)** — the pre-implementation reconnaissance discipline under which both omega-lock and omegaprompt were built. Antemortem catches ghost traps before code is written; omega-lock catches overfit parameters before they ship; omegaprompt catches overfit prompts before they deploy. The pattern repeats at three scales: spec, parameters, prompts.

The layering matters for credibility. The calibration engine was shipped and validated (omega-lock 0.1.4 on PyPI, 176 tests) before this prompt adapter was written. The adapter is ~200 lines because everything it needs already exists.

---

## Relation to adjacent tools

| Tool | What it does | What omegaprompt adds |
|---|---|---|
| **promptfoo** | Run prompts against test cases, compare outputs, assertion-based grading | Pre-declared walk-forward gate (KC-4) so training ≠ ship criterion. Hard gates that collapse fitness, not softly penalize. Stress-based axis selection. |
| **DSPy** | Prompt optimization via program abstraction + bootstrapped few-shot | Domain-agnostic adapter (any CalibrableTarget works). Calibration-first framing (stress + grid + walk-forward), not program synthesis. Composable with DSPy — DSPy output is just another `system_prompt_variant`. |
| **Optuna / Ray Tune (on prompts)** | General HPO over prompt knobs | Walk-forward validation + pre-declared kill criteria out of the box. LLM-as-judge with schema-enforced responses. Composite `hard_gate × soft_score` fitness built in, not reinvented per project. |
| **Hand-rolled "eval suite"** | Custom per-project scripts that call the model, score, rank | Structured data contract (Dataset, Rubric, Outcome), machine-readable artifact, reproducibility, and plug-and-play into an existing calibration engine that already has a 30-run reference benchmark. |

The USP is *discipline, not search*. The search part is handled by omega-lock (which handles it for any `CalibrableTarget`). omegaprompt contributes the prompt-specific adapter and the hard-gates-first fitness shape.

---

## Status

v0.1.0 is **alpha**. The data contract (`Dataset`, `JudgeRubric`, `ParamVariants`, `PromptSpace`, `CalibrationOutcome`) is stable. The CLI contract (`omegaprompt calibrate`, its flags, exit codes) is stable. The judge prompt will iterate as we accumulate scoring-quality data from real runs — expect v0.1.x bumps for judge prompt revisions, tracked in CHANGELOG under *"Judge prompt revisions"*.

Semver applies strictly from v1.0.

Full changelog: [CHANGELOG.md](CHANGELOG.md).

---

## Roadmap

**v0.1.x (judge prompt iteration track)**
- Dogfood against diverse task types (code generation, reasoning, extraction, classification). Record scoring drift.
- Reference scoring-quality benchmark so judge prompt revisions are measured, not guessed.
- Additional hard gate evaluators (format predicates, safety classifiers) callable without a judge round-trip.

**v0.2 (tooling depth)**
- `omegaprompt report <outcome.json>` — human-readable debrief renderer.
- Multi-judge validation pattern: `judge_v1` + `judge_v2` over top-K, disagreement = trust signal.
- `--dry-run` with cost estimate before launching a calibration run.
- Second `cache_control` breakpoint on the rubric for iterative same-rubric runs.

**v0.3 (ecosystem)**
- Benchmark harness: multiple (task × rubric × seed) combinations, RAGAS-style scorecard like omega-lock's.
- GitHub Action for CI gating — runs a calibration on PR, blocks merge on KC-4 fail.

**Explicitly out of scope:** web dashboard, proprietary hosting, multi-user tenancy. omegaprompt is a local developer tool; keep it local.

---

## Contributing

The most valuable contributions are published calibration outcomes — a dataset, a rubric, and the resulting `CalibrationOutcome.json` across methods. They make the judge prompt evidence-based.

Issues and PRs welcome. For non-trivial changes, run an antemortem first with `antemortem-cli` — we dogfood the discipline that built this framework.

---

## Citing

```
omegaprompt v0.1.0 — calibration discipline for Claude API prompts.
https://github.com/hibou04-ops/omegaprompt, 2026.
```

Parent framework:
```
omega-lock v0.1.4 — sensitivity-driven coordinate descent calibration framework.
https://github.com/hibou04-ops/omega-lock, 2026.
```

Methodology (how this and its siblings were built):
```
Antemortem v0.1.1 — AI-assisted pre-implementation reconnaissance for software changes.
https://github.com/hibou04-ops/Antemortem, 2026.
```

---

## License

MIT. See [LICENSE](LICENSE).

## Colophon

Designed, implemented, and shipped solo. Adapter layer over omega-lock; zero calibration-engine reimplementation. 50 tests, 0 live API calls in CI. The tool is built with the pre-implementation reconnaissance discipline it supports for its callers.
