Metadata-Version: 2.4
Name: reval-cli
Version: 1.0.0
Summary: Evaluate eval regressions.
Project-URL: Homepage, https://github.com/calebevans/reval
Project-URL: Repository, https://github.com/calebevans/reval
Project-URL: Issues, https://github.com/calebevans/reval/issues
Author: Caleb Evans
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: analysis,eval,evaluation,llm,regression
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: click
Requires-Dist: gitpython
Requires-Dist: httpx
Requires-Dist: jinja2
Requires-Dist: langchain
Requires-Dist: langchain-litellm
Requires-Dist: litellm[google]>=1.83.0
Requires-Dist: pydantic>=2.13.3
Requires-Dist: pyyaml
Requires-Dist: rich
Requires-Dist: unidiff
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

# reval

</div>

reval correlates your [Langfuse](https://langfuse.com) eval sessions with your
git history and uses a multi-agent LLM pipeline to pinpoint which code changes
caused which metric regressions. It produces a report with explanations,
evidence, and suggested fixes.

## Installation

From PyPI:

```bash
pip install reval-cli
```

From source:

```bash
git clone https://github.com/calebevans/reval.git
cd reval
pip install .
```

For development (includes pytest, mypy, ruff, pre-commit):

```bash
pip install ".[dev]"
```

Requires Python 3.10+.

## Quick Start

1. Generate a starter config:

```bash
reval init
```

2. Set your Langfuse credentials (or add them to `reval.yaml`):

```bash
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-..."
export LANGFUSE_SECRET_KEY="sk-..."
```

3. Run an analysis against a Langfuse eval session:

```bash
reval analyze --eval-results <session-id>
```

4. To compare two sessions (current vs. baseline) and correlate regressions with
   code changes:

```bash
reval analyze \
  --eval-results <current-session-id> \
  --eval-baseline <baseline-session-id> \
  --base main
```

## Configuration

reval is configured through a `reval.yaml` file in your project root. Every
field has a sensible default, so the file is optional for simple use cases.

```yaml
langfuse:
  api_url: https://cloud.langfuse.com
  public_key: pk-...
  secret_key: sk-...
  project_id: ""                  # auto-detected if omitted
  current_session_id: ""          # or use --eval-results
  baseline_session_id: ""         # or use --eval-baseline
  publish: false                  # post results back to Langfuse

metrics:
  - name: answer_relevancy
    threshold: 0.05               # flag if score drops by more than this
  - name: faithfulness
    threshold: 0.05

relevance:
  include_patterns: []            # empty = include all non-ignored files
  ignore_patterns:
    - "**/tests/**"
    - "**/__pycache__/**"
    - "*.md"
    - "*.lock"
  category_mappings:
    prompt:
      - "**/prompts/**"
      - "**/*.prompt"
    model_config:
      - "**/config/model*"
      - "**/*llm_config*"
    retrieval:
      - "**/retrieval/**"
      - "**/rag/**"
    tool_definition:
      - "**/tools/**"
      - "**/functions/**"
    output_parsing:
      - "**/parsers/**"
      - "**/schema*"
    eval_config:
      - "**/eval*"

llm:
  model: openai/gpt-4o            # any LiteLLM model identifier
  temperature: 0.2
  max_tokens: 4096
  context_window: null             # override the model's default context window
  diff_model: null                 # use a different model for diff analysis
  eval_model: null                 # use a different model for eval analysis
  synthesis_model: null            # use a different model for synthesis

git:
  base: HEAD                       # base commit ref
  head: working                    # "working" = uncommitted changes
```

### Configuration Sections

**langfuse** - Connection settings for your Langfuse instance. Credentials can
also be set through environment variables (see below). Set `publish: true` to
write analysis results back to Langfuse as comments.

**metrics** - List of metric names and their regression thresholds. A metric is
flagged as regressed when `current_score - baseline_score` falls below
`-threshold`. Defaults to 0.05 if not specified.

**relevance** - Controls which files from the git diff are included in analysis.
Files matching `ignore_patterns` are excluded. If `include_patterns` is
non-empty, only files matching at least one include pattern (and no ignore
pattern) are kept. The `category_mappings` section maps glob patterns to
semantic categories (prompt, model_config, retrieval, etc.) so the analysis
agents understand the role of each changed file.

**llm** - Model configuration. The `model` field accepts any
[LiteLLM model identifier](https://docs.litellm.ai/docs/providers) (e.g.
`openai/gpt-4o`, `anthropic/claude-sonnet-4-20250514`, `vertex_ai/gemini-2.0-flash`).
You can assign different models to each analysis agent using `diff_model`,
`eval_model`, and `synthesis_model`.

**git** - The commit refs to diff. Set `head` to `working` to diff uncommitted
changes against `base`, or set both to commit SHAs/branch names.

## Environment Variables

Langfuse credentials can be provided through environment variables instead of
(or in addition to) `reval.yaml`. Environment variables take precedence when
the corresponding config field is left empty.

| Variable | Config equivalent | Description |
|---|---|---|
| `LANGFUSE_BASE_URL` | `langfuse.api_url` | Langfuse API URL |
| `LANGFUSE_PUBLIC_KEY` | `langfuse.public_key` | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | `langfuse.secret_key` | Langfuse secret key |
| `LANGFUSE_PROJECT_ID` | `langfuse.project_id` | Langfuse project ID (auto-detected if omitted) |

## CLI Reference

### `reval init`

Generate a starter `reval.yaml` with interactive prompts.

```bash
reval init [--output PATH]
```

| Option | Default | Description |
|---|---|---|
| `--output` | `reval.yaml` | Path for the generated config file |

### `reval analyze`

Run the analysis pipeline. This is the main command.

```bash
reval analyze [OPTIONS]
```

| Option | Default | Description |
|---|---|---|
| `--eval-results` | | Langfuse session ID for the current eval run (required) |
| `--eval-baseline` | | Langfuse session ID for the baseline run (omit for single-session mode) |
| `--base` | From config or `HEAD` | Base commit ref |
| `--head` | From config or `working` | Head ref (`working` for uncommitted changes) |
| `--config` | `reval.yaml` | Path to config file |
| `--output` | `terminal` | Output format: `terminal`, `json`, or `markdown` |
| `--output-file` | | Write the report to a file instead of stdout |
| `--threshold` | `0.05` | Global regression threshold (overrides per-metric config) |
| `--model` | From config | LLM model to use (overrides config) |
| `--publish / --no-publish` | From config | Publish results back to Langfuse |
| `--verbose` | `false` | Show debug information |

### `reval report`

Re-render a previously saved JSON report in a different format.

```bash
reval report REPORT_FILE [OPTIONS]
```

| Option | Default | Description |
|---|---|---|
| `--output` | `terminal` | Output format: `terminal`, `json`, or `markdown` |
| `--output-file` | | Write the report to a file instead of stdout |

Example: save a JSON report, then render it as markdown later:

```bash
reval analyze --eval-results sess-123 --output json --output-file report.json
reval report report.json --output markdown
```

## Analysis Modes

### Compare mode

Activated when you provide both `--eval-results` and `--eval-baseline`. reval
fetches both sessions from Langfuse, diffs the git history between `--base` and
`--head`, and runs three agents:

1. **Diff agent** examines code changes in isolation and forms hypotheses about
   their potential eval impact.
2. **Eval agent** investigates each regressed test case by comparing outputs,
   scores, and evaluator reasoning between current and baseline runs.
3. **Synthesis agent** correlates the diff and eval findings into a final report
   with explanations and suggested fixes.

### Single-session mode

Activated when you omit `--eval-baseline`. reval analyzes a single eval session
without a baseline comparison. It loads source files matching your relevance
patterns, runs the eval agent on any test cases that fall below threshold, and
produces findings about what may be going wrong.

## Output Formats

| Format | Flag | Description |
|---|---|---|
| Terminal | `--output terminal` | Rich tables and panels with color-coded diffs (default) |
| JSON | `--output json` | Machine-readable output, can be re-rendered with `reval report` |
| Markdown | `--output markdown` | Tables and fenced diff blocks, suitable for PRs or documentation |

All formats can be written to a file with `--output-file PATH`.

## Publishing to Langfuse

When `--publish` is passed (or `langfuse.publish` is set to `true` in config),
reval posts its analysis results back to Langfuse:

- A **session comment** with the full markdown report is added to the current
  session.
- A **trace comment** with relevant findings is added to each failed trace.

This makes it easy to review reval's analysis directly in the Langfuse UI
alongside your eval results.
