Metadata-Version: 2.4
Name: codexopt
Version: 0.2.0
Summary: CodexOpt: Improve AGENTS.md and Skills for Codex with SkillOpt-style validation
Author-email: Shashi <shashi@super-agentic.ai>
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: build>=1.2.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: pymdown-extensions>=10.0.0; extra == "docs"
Dynamic: license-file

<p align="center">
  <img src="assets/codexopt_logo.png" alt="CodexOpt logo" width="280">
</p>

<p align="center">
  Benchmark and optimize <code>AGENTS.md</code> and <code>SKILL.md</code> for Codex.
</p>

# CodexOpt

[![PyPI version](https://img.shields.io/pypi/v/codexopt)](https://pypi.org/project/codexopt/)
[![Python](https://img.shields.io/pypi/pyversions/codexopt)](https://pypi.org/project/codexopt/)
[![Docs](https://img.shields.io/badge/docs-mkdocs-blue)](https://superagenticai.github.io/CodexOpt/)
[![Demo Repo](https://img.shields.io/badge/demo-codexopt--demo-0f766e)](https://github.com/SuperagenticAI/codexopt-demo)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

<p align="center">
  <a href="https://superagenticai.github.io/CodexOpt/"><img src="https://img.shields.io/badge/View-Documentation-2563eb?style=for-the-badge" alt="View Documentation"></a>
  <a href="https://github.com/SuperagenticAI/codexopt-demo"><img src="https://img.shields.io/badge/Try-Demo-0f766e?style=for-the-badge" alt="Try Demo"></a>
  <a href="https://pypi.org/project/codexopt/"><img src="https://img.shields.io/badge/Install-PyPI-111827?style=for-the-badge" alt="Install from PyPI"></a>
</p>

CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.

It focuses on Codex instruction assets:

- `AGENTS.md`
- `.codex/skills/**/SKILL.md`
- `.agents/skills/**/SKILL.md`

## Quick Links

- Documentation: [superagenticai.github.io/CodexOpt](https://superagenticai.github.io/CodexOpt/)
- Codex user workflow: [docs/codex-users.md](docs/codex-users.md)
- Demo repository: [github.com/SuperagenticAI/codexopt-demo](https://github.com/SuperagenticAI/codexopt-demo)
- PyPI package: [pypi.org/project/codexopt](https://pypi.org/project/codexopt/)
- Docs source: [docs/](/Users/shashi/oss/CodexOpt/docs)

CodexOpt gives teams a repeatable workflow to:

1. Scan instruction files.
2. Benchmark quality.
3. Generate optimized candidates.
4. Apply only improvements.
5. Produce a report.

## Why CodexOpt

Most teams edit `AGENTS.md` and `SKILL.md` manually, but struggle to answer:

- Did quality actually improve?
- Did we increase prompt bloat?
- Did we break skill frontmatter conventions?

CodexOpt turns these edits into measurable runs with artifacts you can inspect and version.

## Features

- Project scan with issue detection for agents and skills.
- Benchmark scoring with sub-scores and natural-language feedback.
- Optional evidence inputs from repo task files and issue exports.
- Optimization engine `heuristic` (default, local and deterministic).
- Reflective engine for Codex-backed SkillOpt/GEPA-style optimization.
- SkillOpt-inspired `skillopt` engine for SKILL.md files with train/validation evidence splits,
  bounded edits, and validation-gated acceptance.
- Explicit reporting when a model-backed run falls back to heuristic optimization.
- Safe apply flow with automatic backups.
- Markdown reporting from latest runs.
- Minimal OSS CI (lint, test, build).

## Installation

### Requirements

- Python `>=3.10`
- `uv` (recommended) or `pip`

### Recommended: uv (full workflow)

```bash
uv sync --extra dev
```

Run commands through the managed environment:

```bash
uv run codexopt --help
```

`uv.lock` is committed to keep dependency resolution reproducible across machines and CI.

### Alternative: pip

```bash
pip install -e ".[dev]"
```

## Quick Start (uv)

```bash
# 1) Create config
uv run codexopt init

# 2) Inspect what will be evaluated
uv run codexopt scan

# 3) Get baseline scores
uv run codexopt benchmark

# 4) Optimize AGENTS.md
uv run codexopt optimize agents --file AGENTS.md

# 5) Optimize skills
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"

# 6) Review apply impact without writing
uv run codexopt apply --kind agents --dry-run

# 7) Apply selected improvements
uv run codexopt apply --kind agents

# 8) Generate markdown summary
uv run codexopt report --output codexopt-report.md
```

For Codex-specific rollout workflows, including `codex exec --json` validation tasks, see
[Using CodexOpt with Codex](docs/codex-users.md).

## How Teams Use CodexOpt

Developers use CodexOpt in the repository that contains their Codex instruction assets:

- `AGENTS.md`
- `.codex/skills/**/SKILL.md`
- `.agents/skills/**/SKILL.md`

Optional evidence can also be added to improve benchmarking and optimization quality:

- task files (`tasks.md`, task lists, or JSON fixtures)
- issue/review exports (`issues.md` or JSON exports)

Typical workflow:

1. Run `scan` and `benchmark` to measure the current instruction assets.
2. Run `optimize agents` and `optimize skills` to generate improved candidates.
3. Review the generated diffs and report artifacts under `.codexopt/runs/`.
4. Run `apply --dry-run` first, then apply accepted changes.
5. Commit the updated instruction files and, if useful, attach the report to a PR.

Example with optional evidence configured in `codexopt.yaml`:

```yaml
evidence:
  task_files:
    - tasks.md
  issue_files:
    - issues.md
```

With that config in place, `benchmark` and `optimize` use:

- static prompt-quality checks
- repo task alignment
- recurring issue/review themes

Today, task and issue files influence scoring and feedback. With `--engine skillopt`, CodexOpt
uses task evidence as train/validation splits so skill candidates must improve held-out evidence
before they are accepted. JSON task files can also define executable rollout commands; when present,
those rollout pass rates become the held-out validation gate.

Use `codexopt.example.yaml` as a starting point for committed team config.

## Command Reference

### Global options

```bash
codexopt --config <path-to-codexopt.yaml> <command>
```

### `init`

Create a default config file.

```bash
codexopt init [--path PATH] [--force]
```

### `scan`

Discover AGENTS/SKILL targets and validate shape.

```bash
codexopt scan
```

### `benchmark`

Score current files using built-in heuristics.

```bash
codexopt benchmark
```

### `optimize agents`

Optimize AGENTS files.

```bash
codexopt optimize agents \
  [--file PATTERN] \
  [--engine heuristic|reflective] \
  [--reflection-model MODEL] \
  [--max-metric-calls N]
```

### `optimize skills`

Optimize SKILL files.

```bash
codexopt optimize skills \
  [--glob PATTERN] \
  [--engine heuristic|skillopt|reflective] \
  [--reflection-model MODEL] \
  [--max-metric-calls N]
```

### `improve`

One command for Codex users: discover targets, mine starter tasks, run the
reflective optimizer, and preview the diff.

```bash
codexopt improve                    # offline preview
codexopt improve --live             # Codex-backed reflective preview
codexopt improve --live --apply     # write validated changes with backups
```

### `apply`

Apply best candidates from the latest optimization run (or a provided run id).

```bash
codexopt apply [--kind agents|skills] [--run-id RUN_ID] [--dry-run]
```

### `report`

Generate a markdown report from latest runs in state.

```bash
codexopt report [--output FILE.md]
```

## Configuration

Default `codexopt.yaml`:

```yaml
version: 1
targets:
  agents_files:
    - AGENTS.md
    - "**/AGENTS.md"
    - "**/AGENTS.override.md"
  skills_globs:
    - ".codex/skills/**/SKILL.md"
    - "**/.codex/skills/**/SKILL.md"
    - ".agents/skills/**/SKILL.md"
    - "**/.agents/skills/**/SKILL.md"
  exclude_globs:
    - ".git/**"
    - ".codexopt/**"
    - ".venv/**"
    - "node_modules/**"
    - "reference/**"
output:
  root_dir: ".codexopt"
evidence:
  task_files: []
  issue_files: []
optimization:
  engine: "heuristic"
  min_apply_delta: 0.01
  max_metric_calls: 60
  reflection_model: null
  skillopt_train_ratio: 0.67
  skillopt_edit_budget: 24
  skillopt_validation_delta: 0.01
```

Config notes:

- `targets.agents_files`: glob patterns for AGENTS targets.
- `targets.skills_globs`: glob patterns for `SKILL.md` targets.
- `targets.exclude_globs`: paths ignored during scan.
- `output.root_dir`: run artifacts and backups location.
- `evidence.task_files`: optional markdown/json task lists used for repo-alignment scoring.
- `evidence.issue_files`: optional markdown/json issue or review exports used for theme-aware feedback.
- `optimization.engine`: default optimization engine (`heuristic`, `reflective`, or `skillopt` for skills).
- `optimization.min_apply_delta`: minimum score gain required to apply.
- `optimization.max_metric_calls`: legacy GEPA metric budget.
- `optimization.reflection_model`: legacy GEPA reflection model.
- `optimization.skillopt_train_ratio`: task evidence fraction used for skill candidate proposal.
- `optimization.skillopt_edit_budget`: maximum line edit operations allowed for SkillOpt candidates.
- `optimization.skillopt_validation_delta`: minimum held-out validation gain required for SkillOpt acceptance.

## How Scoring Works

CodexOpt computes a `0.0` to `1.0` score per file.

AGENTS scoring factors include:

- Too short or too long content penalties.
- Token-heaviness estimate penalty.
- Empty file penalty.
- Contradictory guidance penalties.
- Missing workflow / verification / output-format guidance penalties.
- Repo-context and task-alignment signals when evidence files are configured.

SKILL scoring factors include:

- Missing frontmatter penalties.
- Missing `name` / `description` penalties.
- Overly long frontmatter fields penalties.
- Too short or too long content penalties.
- Weak trigger/workflow/verification guidance penalties.
- Repo task alignment signals when evidence files are configured.

Each benchmarked file also includes:

- criterion-level sub-scores
- natural-language feedback
- optional evidence summary from configured task/issue files

## Optimization Behavior

### Heuristic engine

Candidate transforms include:

- Whitespace normalization.
- Blank-line compaction.
- Duplicate adjacent line removal.
- Skill-specific frontmatter synthesis/trimming.

The best candidate is selected by score delta. If delta is below `min_apply_delta`, original content is kept.

### Reflective engine

The maintained SkillOpt/GEPA-inspired path is `--engine reflective`, or the
Codex-user shortcut `codexopt improve`. It evaluates a candidate document on
tasks, captures textual feedback, asks an optimizer model to rewrite the
document, and accepts the rewrite only when it improves held-out validation
tasks.

Defaults stay offline and use static/verifier signals. To run the full live
Codex loop, use:

```bash
codexopt improve --live
```

`--live` uses `codex exec` as both optimizer and judge. You can also set
`reflective.optimizer_model` and `reflective.judge_model` to `codex`,
`openai/<model>`, or another OpenAI-compatible model.

### Legacy GEPA engine

`--engine gepa` is deprecated. It targeted an older `gepa.optimize_anything`
API and now falls back with a clear warning. Use `--engine reflective` instead.

For SkillOpt-style skill optimization:

```yaml
optimization:
  engine: "skillopt"
  reflection_model: "openai/gpt-5-mini"  # optional; without it, heuristic proposers are used
  skillopt_train_ratio: 0.67
  skillopt_edit_budget: 24
  skillopt_validation_delta: 0.01
```

Executable rollout task files can be listed in `evidence.task_files`:

```json
[
  {
    "name": "skill-verifier",
    "description": "Run a repo-local verifier against the candidate skill.",
    "command": ["python", "scripts/verify_skill.py"],
    "timeout_seconds": 30
  }
]
```

Codex-backed rollout tasks can use `backend: "codex"` and `codex_prompt`:

```json
[
  {
    "name": "codex-skill-task",
    "backend": "codex",
    "description": "Run Codex against the candidate skill.",
    "codex_prompt": "Use the local skill to update CHANGELOG.md for a patch release.",
    "timeout_seconds": 120,
    "expected_final_response_contains": "CHANGELOG.md",
    "expected_file_change": "CHANGELOG.md",
    "expected_file_contains": {
      "path": "CHANGELOG.md",
      "contains": "Patch"
    }
  }
]
```

CodexOpt evaluates those commands in a temporary copy of the repo with the candidate `SKILL.md`
written in place, then records pass/fail details in `optimize.json`. For Codex-backed rollouts,
CodexOpt also parses `codex exec --json` events into trajectory metadata: final response,
commands, file changes, token usage, and errors.

For OpenAI-compatible reflective models, set the provider credentials and use
`reflective.optimizer_model` / `reflective.judge_model` values such as
`openai/gpt-5-mini`:

```bash
export OPENAI_API_KEY="your-openai-key"
```

For Gemini-compatible endpoints, set the credentials expected by your OpenAI-compatible
client or run through `codexopt improve --live` to use `codex exec` directly.

```bash
export GEMINI_API_KEY="your-gemini-key"
export GOOGLE_API_KEY="$GEMINI_API_KEY"
```

Fallback behavior:

- If a configured optimizer or judge model is unavailable, CodexOpt records a note and
  falls back to the weaker heuristic/static path.
- Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.

## Artifacts and State

By default, everything is written under `.codexopt/`:

- `runs/<run_id>/scan.json`
- `runs/<run_id>/benchmark.json`
- `runs/<run_id>/optimize.json`
- `runs/<run_id>/apply.json`
- `backups/<timestamp>/...` (created on non-dry-run apply)
- `state.json` (tracks latest run ids per command type)

Run ids are timestamped and namespaced by command kind, for example:

- `20260308T184800123456Z-benchmark`
- `20260308T184812654321Z-optimize-skills`

## Typical Team Workflow

1. Commit current `AGENTS.md` and skills.
2. Run `scan` and `benchmark` to establish baseline.
3. Run `optimize agents` and/or `optimize skills`.
4. Review `optimize.json` and diffs.
5. Run `apply --dry-run` first, then `apply`.
6. Run `report` and attach report to PR.

## Examples

### Example A: `AGENTS.md` cleanup

Before (`AGENTS.md`):

```md
## Coding Rules
Always run tests before commit.
Always run tests before commit.


Keep changes minimal.
```

After optimization (heuristic):

```md
## Coding Rules
Always run tests before commit.

Keep changes minimal.
```

What changed:

- Removed duplicate adjacent line.
- Compacted extra blank lines.

### Example B: `SKILL.md` missing frontmatter

Before (`.codex/skills/my_skill/SKILL.md`):

```md
Use this skill for repository release checks.
Run lint, tests, and changelog validation.
```

After optimization (heuristic):

```md
---
name: my-skill
description: Repository-specific workflow skill.
---

Use this skill for repository release checks.
Run lint, tests, and changelog validation.
```

What changed:

- Added required frontmatter block.
- Generated normalized `name` from folder name.
- Added default `description`.

### Example C: Reproduce end-to-end on a repo

```bash
uv run codexopt init
uv run codexopt scan
uv run codexopt benchmark
uv run codexopt optimize agents --file AGENTS.md
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
uv run codexopt apply --kind skills --dry-run
uv run codexopt apply --kind skills
uv run codexopt report --output codexopt-report.md
```

Files to inspect after running:

- `.codexopt/runs/*/scan.json`
- `.codexopt/runs/*/benchmark.json`
- `.codexopt/runs/*/optimize.json`
- `.codexopt/runs/*/apply.json`
- `.codexopt/backups/*`

## CI

GitHub Actions workflow is included at `.github/workflows/ci.yml` and runs:

- `uv lock --check` for lockfile consistency.
- `uv sync --extra dev` for environment setup.
- Ruff lint checks.
- Pytest tests.
- Package build (`uv build`).

It does not publish packages.

## Development

```bash
uv lock
uv sync --extra dev
uv run --no-sync ruff check src tests
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q
uv build
```

## FAQ / Troubleshooting

### `codexopt apply` says "no optimization run found"

Cause:

- No prior optimization run for the selected kind.
- `state.json` does not contain the expected latest run pointer.

Fix:

```bash
uv run codexopt optimize agents
uv run codexopt apply --kind agents
```

Or pass an explicit run:

```bash
uv run codexopt apply --kind agents --run-id <run_id>
```

### `--engine gepa` did not use GEPA

Cause:

- The legacy GEPA engine targeted an older `gepa.optimize_anything` API.

Behavior:

- CodexOpt falls back to heuristic optimization and records the deprecation reason.

Fix:

```bash
uv run codexopt optimize agents --engine reflective
uv run codexopt improve --live
```

### `apply --dry-run` says files would be applied, but nothing changed

Expected behavior:

- `--dry-run` reports candidate applications without writing files.

To write changes, run again without `--dry-run`:

```bash
uv run codexopt apply --kind agents
```

### Build fails with network/isolation issues

If your environment blocks dependency resolution in isolated builds, use:

```bash
uv build
```

### Pytest fails due to unrelated external plugins

Some environments auto-load global pytest plugins that can break local tests.
Run with plugin autoload disabled:

```bash
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q
```

### Optimization produced no applied changes

Cause:

- Best candidate delta is below `optimization.min_apply_delta`, or
- File content is already equivalent.

Fix:

- Lower `optimization.min_apply_delta` in `codexopt.yaml`, then re-run optimize/apply.

## License

MIT. See `LICENSE`.

## Author

- Shashi (`shashi@super-agentic.ai`)
