Metadata-Version: 2.4
Name: ai-prompt-simulation
Version: 0.1.0
Summary: Prompt simulation and autonomous-agent effectiveness benchmarking framework
Project-URL: Homepage, https://github.com/zaber-dev/ai-prompt-simulation
Project-URL: Repository, https://github.com/zaber-dev/ai-prompt-simulation
Project-URL: Issues, https://github.com/zaber-dev/ai-prompt-simulation/issues
Author: zaber-dev
License: MIT
License-File: LICENSE.md
Keywords: ai-agents,benchmarking,evaluation,prompt-engineering,simulation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.1.7
Requires-Dist: pydantic>=2.7.0
Requires-Dist: pyyaml>=6.0.1
Provides-Extra: dev
Requires-Dist: black>=24.4.2; extra == 'dev'
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.2.0; extra == 'dev'
Requires-Dist: ruff>=0.4.4; extra == 'dev'
Description-Content-Type: text/markdown

# ai-prompt-simulation

A professional Python framework for testing prompt strength, quality, and real autonomous-agent effectiveness.

This repository helps you answer practical questions before deploying prompts in production:
- Is this prompt strong enough for autonomous execution?
- Which quality dimensions are weak (clarity, specificity, robustness, consistency, efficiency)?
- How does prompt A compare to prompt B under repeatable conditions?
- Can I customize scoring, scenarios, and evaluator logic for my domain?

## Why This Project Exists

Prompt quality is often judged subjectively. This project provides a repeatable simulation pipeline with transparent scoring, configurable weighting, and benchmark workflows that can be run from both Python API and CLI.

## Core Capabilities

- Deterministic simulation engine with seed-based runs and retries
- Hybrid quality model:
  - Deterministic heuristics (clarity, specificity, robustness, consistency, efficiency)
  - Optional LLM-as-judge dimensions (reasoning, goal completion)
- Benchmark mode for multi-case prompt suites
- Side-by-side prompt comparison
- Extensible plugin registry for custom evaluators and scenario factories
- JSON report output for automation and CI pipelines

## Architecture

High-level module map:

- `core`: Typed schemas, config validation, report contracts
- `providers`: LLM provider abstraction and deterministic mock provider
- `scoring`: Dimension evaluators and weighted aggregation
- `engine`: Prompt simulation and benchmark orchestration
- `plugins`: Custom evaluator and scenario registration
- `api`: Python-first public interface
- `cli`: Terminal commands for automation and team workflows

For deeper details see `docs/architecture.md`.

## Scoring Model

### Base Dimensions (always available)

- `clarity`: readability and structural guidance
- `specificity`: explicit constraints and output requirements
- `robustness`: edge-case and failure-handling guidance
- `consistency`: output stability across runs
- `efficiency`: verbosity and likely token/latency pressure

### Optional Judge Dimensions

- `reasoning`: quality of chain-of-thought style structure
- `goal_completion`: likelihood that prompt drives task completion

### Overall Score

The framework computes weighted components and a final score band:

- `production-ready`: 80-100
- `good`: 65-79
- `developing`: 50-64
- `failing`: 0-49

For formulas and rationale see `docs/scoring.md`.

## Installation

### Local development

```bash
git clone https://github.com/zaber-dev/ai-prompt-simulation.git
cd ai-prompt-simulation
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
```

### Verify

```bash
pytest
```

## Optional Real LLM Providers

By default, the framework uses the deterministic `mock` provider for reproducible testing.

You can optionally use real model providers:

- `openai` (default model: `gpt-4o-mini`)
- `gemini` (default model: `gemini-2.0-flash`)

Set API keys via environment variables:

```bash
# Windows PowerShell
$env:OPENAI_API_KEY = "your-openai-key"
$env:GEMINI_API_KEY = "your-gemini-key"
```

## Quick Start (Python API)

```python
from ai_prompt_simulation.api.public import run_simulation
from ai_prompt_simulation.core.models import SimulationConfig

config = SimulationConfig(
    runs=4,
    judge={
        "enabled": True,
        "reasoning_weight": 0.1,
        "goal_completion_weight": 0.1,
    },
)

result = run_simulation(
    "You are an autonomous planning agent. Output JSON with fields plan, risks, and next_action. "
    "Include one fallback if required data is missing.",
    case_id="quickstart-1",
    config=config,
  provider_name="openai",
  model="gpt-4o-mini",
)

print(result.report.summary.overall_score, result.report.summary.band)
for d in result.report.dimensions:
    print(d.name, d.score)
```

## Quick Start (CLI)

### Simulate one prompt

```bash
prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider openai \
  --model gpt-4o-mini \
  --runs 4 \
  --config configs/default.yaml \
  --output out/sim_result.json
```

Use Gemini instead:

```bash
prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider gemini \
  --model gemini-2.0-flash
```

### Run benchmark suite

```bash
prompt-sim benchmark \
  --name "core-suite" \
  --cases-file examples/benchmark_cases.yaml \
  --config configs/default.yaml \
  --output out/benchmark_result.json
```

### Compare two prompts

```bash
prompt-sim compare \
  --prompt-a "Summarize this issue." \
  --prompt-b "Summarize in exactly 3 bullets, include assumptions, output JSON."
```

### Explain a saved score

```bash
prompt-sim explain-score --result-file out/sim_result.json
```

### Validate config

```bash
prompt-sim validate-config --config configs/default.yaml
```

## Customization

### Register custom evaluator

```python
from ai_prompt_simulation.core.models import DimensionScore
from ai_prompt_simulation.engine.simulator import PromptSimulator

def domain_evaluator(prompt, outputs, _config, _provider):
    hits = sum(k in prompt.lower() for k in ["goal", "constraints", "fallback", "verify"])
    return DimensionScore(
        name="autonomy_readiness",
        score=min(100.0, 30 + hits * 15),
        rationale="Domain-specific autonomous readiness score",
        evidence={"marker_hits": hits},
    )

sim = PromptSimulator()
sim.register_evaluator("autonomy_readiness", domain_evaluator)
```

See `examples/custom_evaluator.py`.

## Input File Format

Benchmark case files (`.yaml` or `.json`) must be a list of prompt cases:

```yaml
- id: case-1
  task: qa
  prompt: |
    You are an autonomous support agent.
    Answer in exactly 3 bullet points.
  variables:
    locale: en-US
```

## Project Structure

```text
.
|-- configs/
|-- docs/
|-- examples/
|-- src/ai_prompt_simulation/
|   |-- api/
|   |-- cli/
|   |-- core/
|   |-- engine/
|   |-- plugins/
|   |-- providers/
|   `-- scoring/
|-- tests/
|-- LEARN.md
|-- LICENSE.md
`-- README.md
```

## Documentation Index

- `LEARN.md`: progressive learning path and usage curriculum
- `docs/architecture.md`: design and extension points
- `docs/scoring.md`: scoring methodology and formulas
- `docs/testing.md`: testing and validation practices
- `CONTRIBUTING.md`: contribution standards and workflow
- `SECURITY.md`: vulnerability disclosure policy

## Quality Standards

- Typed Pydantic contracts for all major data flows
- Deterministic mock provider for reproducible test runs
- CI-ready test, lint, and type-check configuration
- Structured JSON reports for automation and traceability

## Versioning and Releases

- Versioning follows semantic versioning (MAJOR.MINOR.PATCH)
- Initial target release: `0.1.0` (alpha)
- Release notes are tracked in `CHANGELOG.md`

## Contributing

Contributions are welcome. See `CONTRIBUTING.md` for branch naming, tests, and review requirements.

## License

This project is licensed under MIT. See `LICENSE.md`.
