Metadata-Version: 2.4
Name: promptdebug
Version: 0.2.0
Summary: Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.
Project-URL: Homepage, https://github.com/entropyvector/promptdebug
Project-URL: Documentation, https://github.com/entropyvector/promptdebug#readme
Project-URL: Repository, https://github.com/entropyvector/promptdebug
Project-URL: Issues, https://github.com/entropyvector/promptdebug/issues
Project-URL: Changelog, https://github.com/entropyvector/promptdebug/blob/main/CHANGELOG.md
Author-email: Zaur Jafarov <entropyvector.dev@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: ablation,debugging,llm,optimization,prompt
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.1.0
Requires-Dist: litellm>=1.40.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers>=3.0.0
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: typer>=0.12.0
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: python-dotenv>=1.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: twine>=4.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# promptdebug

[![PyPI version](https://img.shields.io/pypi/v/promptdebug.svg)](https://pypi.org/project/promptdebug/)
[![Downloads](https://pepy.tech/badge/promptdebug)](https://pepy.tech/project/promptdebug)
[![CI](https://github.com/entropyvector/promptdebug/actions/workflows/ci.yml/badge.svg)](https://github.com/entropyvector/promptdebug/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

Find dead tokens in your system prompts. Ablation-based influence analysis for LLM prompts.

promptdebug systematically removes each section of your system prompt and measures how the model's output changes. Sections that can be removed without affecting the output are **dead weight** — tokens you're paying for that do nothing.

## Install

```bash
pip install promptdebug
```

> **Note:** On first run, promptdebug downloads the `all-mpnet-base-v2` sentence-transformers model (~420 MB) for semantic scoring. This happens once and is cached locally by the `sentence-transformers` library.

Set your API key for whichever provider you use:

```bash
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export GEMINI_API_KEY="..."
```

## Quick Start

```bash
# Analyze a system prompt
promptdebug analyze prompt.txt --query "I want a refund"

# HTML report
promptdebug analyze prompt.txt --query "I want a refund" --format html

# Analyze across multiple queries for more robust results
promptdebug analyze prompt.txt --queries queries.txt

# Validate analysis reliability with a counterfactual injection
promptdebug analyze prompt.txt --query "test" --sanity-check

# Get rewrite suggestions for dead sections
promptdebug analyze prompt.txt --query "test" --suggest

# Watch mode — re-analyze automatically on every save
promptdebug watch prompt.txt --query "test"

# Compare influence between git versions
promptdebug diff prompt.txt --ref HEAD~1 --query "test"

# Compare across models
promptdebug compare prompt.txt --query "test query" --models gpt-4o-mini,claude-haiku-4-5

# Strip dead sections and output a cleaned prompt
promptdebug optimize prompt.txt --query "test query"

# Dry run (no API calls, shows cost estimate)
promptdebug analyze prompt.txt --query "test" --dry-run
```

## How It Works

1. **Parse** — Your system prompt is split into sections using automatic strategy detection (markdown headers, XML tags, labeled blocks, numbered lists, or paragraph breaks).

2. **Baseline** — The full prompt is sent to the model N times to establish baseline outputs.

3. **Ablate** — Each section is removed one at a time. The ablated prompt is sent to the model N times.

4. **Score** — Each section gets a composite influence score:

```
influence = 0.60 × semantic + 0.20 × structural + 0.20 × behavioral
```

- **Semantic** — cosine distance between sentence embeddings of baseline vs. ablated output
- **Structural** — character-level diff + paragraph/bullet/code block feature distance
- **Behavioral** — format-appropriate signals (JSON field match, classification exact match, or surface signals for free text)

5. **Classify** — Sections with influence < 0.10 are classified as **dead**.

## Output Example

```
Section 1: Role definition          [████████  ] 0.82  HIGH
Section 2: Output format rules      [████      ] 0.44  MEDIUM
Section 3: Tone guidelines          [█         ] 0.12  LOW
Section 4: Legacy constraint note   [          ] 0.03  DEAD
Section 5: Core task instruction    [███████   ] 0.71  HIGH

Dead token rate: 14.2% (127 / 894 tokens)
Estimated savings: ~$0.02 per 1K calls
```

## Commands

### `analyze` — influence heatmap for a prompt

```bash
promptdebug analyze prompt.txt --query "test query"

# Options
--queries FILE       Text file with one query per line (multi-query mode)
--model MODEL        LLM to use (default: gpt-4o-mini)
--runs N             API calls per ablation (default: 3)
--temperature FLOAT  Sampling temperature (default: 0.3)
--format FORMAT      terminal | html | json | csv (default: terminal)
--dead-threshold F   Influence below this is dead (default: 0.10)
--sanity-check       Inject a counterfactual section; warn if not detected
--suggest            Generate LLM rewrite suggestions for dead sections
--dry-run            Estimate cost without making API calls
```

### `watch` — re-analyze on every file save

```bash
promptdebug watch prompt.txt --query "test query"

# Options
--interval SECONDS   Poll interval in seconds (default: 5)
--threshold FLOAT    Re-print only when dead rate changes by this much
```

### `diff` — compare influence between git revisions

```bash
promptdebug diff prompt.txt --ref HEAD~1 --query "test query"

# Options
--ref REF   Git ref to compare against (default: HEAD~1)
```

### `compare` — side-by-side multi-model comparison

```bash
promptdebug compare prompt.txt --query "test" --models gpt-4o-mini,claude-haiku-4-5
```

### `optimize` — output a cleaned prompt with dead sections removed

```bash
promptdebug optimize prompt.txt --query "test"
```

## Output Formats

| Format | Flag | Description |
|--------|------|-------------|
| Terminal | `--format terminal` | Rich heatmap (default) |
| HTML | `--format html` | Interactive report, opens in browser |
| JSON | `--format json` | Machine-readable export |
| CSV | `--format csv` | Spreadsheet-friendly export |

## Multi-Query Mode

Single-query analysis can be noisy — a section that looks dead for one query may be critical for another. Multi-query mode runs ablation across several test queries and aggregates the scores, giving a more stable, query-independent result:

```bash
# queries.txt — one query per line
printf "I want a refund\nMy login is broken\nHow do I cancel?\n" > queries.txt
promptdebug analyze prompt.txt --queries queries.txt
```

## Sanity Check

Before acting on dead-section results, verify the scoring engine is working correctly for your specific prompt and query. The sanity check injects a known-high-influence instruction and confirms it scores above 0.5. If it doesn't, the analysis may be unreliable:

```bash
promptdebug analyze prompt.txt --query "test" --sanity-check
# ✓ Sanity check passed (score: 0.73)
# ⚠ Sanity check failed (score: 0.31) — results may be unreliable for this prompt/query
```

## Watch Mode

Iterate on your prompt and see the influence change in real time:

```bash
promptdebug watch prompt.txt --query "I want a refund" --interval 10
# Watching prompt.txt (every 10s) ...
# [14:32:07] Change detected — re-analyzing ...
# ...heatmap...
# [14:35:22] Change detected — re-analyzing ...
```

## Configuration

Create a `.promptdebug.yml` in your project directory (or any parent directory):

```yaml
model: gpt-4o-mini
runs: 3
temperature: 0.3
dead_threshold: 0.10
cache_expire_days: 7
weights:
  semantic: 0.6
  structural: 0.2
  behavioral: 0.2
```

All fields are optional. Defaults are shown above.

## Supported Models

Any model supported by [LiteLLM](https://docs.litellm.ai/docs/providers):

- **OpenAI**: gpt-4o, gpt-4o-mini, gpt-4-turbo, ...
- **Anthropic**: claude-sonnet-4-5, claude-haiku-4-5, ...
- **Google**: gemini/gemini-2.0-flash, gemini/gemini-1.5-pro, ...
- **Mistral**: mistral/mistral-large-latest, ...
- **Local**: ollama/llama3, ollama/codellama, ...

## Caching

API responses are cached in a local SQLite database (`.promptdebug_cache.db`) using SHA256 content-hash keys. Cache auto-expires after 7 days (configurable). Re-running the same analysis costs zero API calls.

## Python API

```python
import asyncio
from promptdebug import (
    run_ablation,
    run_ablation_multi_query,
    run_sanity_check,
    generate_all_suggestions,
    render_terminal,
    LLMProvider,
    Cache,
)

async def main():
    provider = LLMProvider(model="gpt-4o-mini")
    cache = Cache()

    # Single-query ablation
    result = await run_ablation(
        prompt_text="You are a helpful assistant. ...",
        query="Hello, how can you help me?",
        provider=provider,
        cache=cache,
        runs=3,
    )

    render_terminal(result, model="gpt-4o-mini", runs=3)

    # Multi-query ablation (aggregated)
    aggregated, per_query = await run_ablation_multi_query(
        prompt_text="...",
        queries=["query 1", "query 2", "query 3"],
        provider=provider,
        runs=3,
    )

    # Sanity check — validate scoring reliability
    passed, score = await run_sanity_check(
        prompt_text="...",
        query="test query",
        provider=provider,
    )
    print(f"Sanity check: {'passed' if passed else 'FAILED'} (score={score:.2f})")

    # Get rewrite suggestions for dead sections
    suggestions = await generate_all_suggestions(
        section_results=result.sections,
        provider=provider,
        threshold=0.2,
    )
    for section_idx, rewrites in suggestions.items():
        print(f"Section {section_idx} suggestions:")
        for s in rewrites:
            print(f"  → {s}")

asyncio.run(main())
```

## Development

```bash
git clone https://github.com/entropyvector/promptdebug.git
cd promptdebug
pip install -e ".[dev]"

# Run unit tests (762 tests, no API key required)
python -m pytest tests/ --ignore=tests/test_integration.py

# Run integration tests (requires OPENAI_API_KEY)
python -m pytest tests/test_integration.py -v
```

## License

[MIT](LICENSE)

## Third-Party Licenses

See [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md) for a full list of dependencies and their licenses.
