Metadata-Version: 2.4
Name: janus-labs
Version: 0.6.4
Summary: 3DMark for AI Agents - Benchmark and measure AI coding agent reliability
Author-email: Alexander Perry <alex@alexanderperry.io>
License: Apache-2.0
Project-URL: Homepage, https://github.com/alexanderaperry-arch/janus-labs
Project-URL: Documentation, https://github.com/alexanderaperry-arch/janus-labs#readme
Project-URL: Repository, https://github.com/alexanderaperry-arch/janus-labs.git
Project-URL: Issues, https://github.com/alexanderaperry-arch/janus-labs/issues
Keywords: ai,agents,benchmark,llm,evaluation,deepeval,governance,trust-elasticity
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest>=8.0.0
Requires-Dist: gitpython>=3.1.0
Requires-Dist: deepeval>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: httpx>=0.26.0
Requires-Dist: questionary>=2.0.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# Janus Labs

[![CI](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

**3DMark for AI Agents** — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.

## What is Janus Labs?

Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:

- **Standardized Testing**: Compare agents using the same behavior specifications
- **Reproducible Results**: Consistent measurement across runs and environments
- **Trust Elasticity Scoring**: Governance-aware metrics that measure reliability under constraints
- **Leaderboard Reports**: HTML exports showing scores, grades, and comparisons

Built on [DeepEval](https://github.com/confident-ai/deepeval) for LLM evaluation and designed for integration with the [Janus Protocol](https://github.com/alexanderaperry-arch/aop) governance framework.

## Quick Start

### Install

```bash
pip install janus-labs
```

Verify installation:

```bash
janus-labs --version  # Shows: janus-labs 0.6.0
```

> **Troubleshooting:** If `janus-labs` isn't found, use `python -m janus_labs` (underscore, not hyphen). To find the install path: `pip show janus-labs`. Both `janus-labs` and `janus` commands work identically.

### Interactive Mode (New in v0.6.0)

Just run `janus-labs` with no arguments for a guided menu:

```bash
janus-labs
# ============================================================
#   Janus Labs - 3DMark for AI Agents
# ============================================================
#
# What would you like to do?
#   [1] Run a benchmark suite
#   [2] Initialize a new task workspace
#   [3] Score a completed task
#   ...
```

### Run Your First Benchmark

Janus Labs benchmarks your **actual configured agent** — your CLAUDE.md, system prompts, and MCP servers directly affect the score.

```bash
# Step 1: Initialize a benchmark task
cd your-project  # Directory with your CLAUDE.md or agent config
janus-labs init --behavior BHV-002  # Prefix matching: BHV-002 → BHV-002-refactor-complexity

# Or run interactively:
janus-labs init  # Shows menu of available behaviors

# This creates a task workspace:
#   src/calculator.py    - Starter code with a bug
#   tests/test_calc.py   - Tests that currently fail
#   .janus-task.json     - Task metadata
#   README.md            - Instructions for your agent
```

```bash
# Step 2: Let your AI agent solve it
# Use Claude Code, Cursor, Copilot, Windsurf, or any AI coding assistant
# Your CLAUDE.md and custom instructions ARE ACTIVE during this step
# Ask your agent: "Fix the bug in calculator.py so tests pass"
```

```bash
# Step 3: Score the result
janus-labs score

# Captures REAL git diffs and runs REAL pytest
# Output:
#   Score: 83.6 (Grade A)
#   Config: CLAUDE.md (hash: a1b2c3d4)
#   Behaviors: Test integrity preserved ✓
```

```bash
# Step 4: Submit to leaderboard (optional)
janus-labs submit result.json --github your-handle
```

### The Tinkering Loop

The real power is iteration:

```bash
# Run 1: Baseline (no custom instructions)
janus-labs init --behavior BHV-001
# ... agent solves ...
janus-labs score  # Score: 72.0

# Run 2: With your optimized CLAUDE.md
# ... tweak your instructions ...
janus-labs init --behavior BHV-001
# ... agent solves ...
janus-labs score  # Score: 86.5 ← Did your config help?
```

### Alternative: Install from Source

```bash
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .
```

## CLI Reference

All commands can be run as:

- `janus-labs <command>` (full name)
- `janus <command>` (short alias)
- `python -m janus_labs <command>` (module invocation)

### Global Options

```bash
janus-labs --version   # Show version number
janus-labs --help      # Show help
janus-labs             # Launch interactive menu (no args)
```

### Suite Shortcuts (New in v0.6.0)

Run suites directly without `run --suite`:

```bash
janus-labs refactor-storm        # Same as: janus-labs run --suite refactor-storm
janus-labs refactor-storm --mock # With mock scoring
```

### `init` - Initialize Benchmark Task (Start Here)

```bash
janus-labs init [options]

Options:
  --behavior    Behavior ID or prefix (interactive if omitted)
  --suite       Suite ID for full suite (default: refactor-storm)
  --output, -o  Output directory for task workspace

# Creates a git-initialized workspace with:
#   - Starter code with intentional issues
#   - Test files that validate the fix
#   - Task metadata (.janus-task.json)
#   - .gitignore (auto-excludes result.json)
```

**Features:**

- **Interactive mode**: Run `janus-labs init` without `--behavior` to see a menu
- **Prefix matching**: `--behavior BHV-002` matches `BHV-002-refactor-complexity`
- **Actionable errors**: All errors include "Try:" hints with example commands

### `status` - Check Workspace Status

```bash
janus-labs status [options]

Options:
  --workspace, -w  Path to workspace (default: current directory)

# Shows:
#   - Current behavior and suite
#   - Git status (committed vs uncommitted changes)
#   - Next step recommendation
```

### `score` - Score Completed Task

```bash
janus-labs score [options]

Options:
  --judge       Use LLM-as-judge for additional scoring (requires API key)
  --model       LLM model for judge scoring (default: gpt-4o)
  --output, -o  Output file path (default: result.json)

# Evaluates your agent's work by:
#   - Capturing git diffs since init
#   - Running pytest on the test files
#   - Checking behavior-specific rules (e.g., test cheating detection)
```

### `submit` - Submit to Leaderboard

```bash
janus-labs submit <result.json> [options]

Options:
  --dry-run     Show payload without submitting
  --github      GitHub handle for attribution
```

**Zero friction** - no API key required for public leaderboard. Anti-cheat is handled via workspace hash validation.

### `compare` - Regression Detection

```bash
janus-labs compare <baseline.json> <current.json> [options]

Options:
  --threshold   Regression threshold percentage (default: 5.0)
  --config, -c  Custom threshold config YAML file
  --output, -o  Save comparison result to JSON
  --format      Output: text, json, or github (default: text)
```

Exit codes:
- `0` - No regression detected
- `1` - Regression detected (score dropped beyond threshold)
- `2` - HALT condition (governance intervention required)

### `run` - Execute Full Suite (Advanced)

```bash
janus-labs run --suite <suite-id> [options]

Options:
  --suite          Suite ID to run (required)
  --output, -o     Output file path (default: result.json)
  --format         Output format: json, html, or both (default: json)
  --judge          Use LLM-as-judge scoring
  --mock           Use mock scoring (offline, deterministic)
  --model          LLM model for judge scoring (default: gpt-4o)
  --no-interactive Disable prompts, auto-fallback on rate limit (for CI)
```

**Rate Limit Resilience (v0.6.0):** The backend judge now includes circuit breaker and exponential backoff. If rate limited, you'll be prompted to wait, switch to mock scoring, or abort.

### `smoke-test` - Quick Validation

> **Note:** This command uses mock data for quick validation. It does NOT test your CLAUDE.md or agent configuration. Use `init` → `score` for real benchmarking.

```bash
janus-labs smoke-test [options]

Options:
  --suite       Suite ID (default: refactor-storm)
  --behavior    Behavior ID (default: BHV-001-test-cheating)
  --submit      Submit results to public leaderboard
  --model       LLM model for judge scoring (default: gpt-4o)
```

> **Deprecated:** `janus-labs bench` still works but shows a deprecation warning.

### `export` - Convert Result Formats

```bash
janus-labs export <input.json> --format <html|json> [-o output]
```

### `baseline` - Manage Baselines

```bash
# Promote a result to baseline
janus-labs baseline update <result.json> [-o baseline.json] [--force]

# Show baseline info
janus-labs baseline show <baseline.json>
```

## Built-in Suites

### Refactor Storm

Tests agent reliability during complex refactoring tasks:

| Behavior | Description | Threshold |
|----------|-------------|-----------|
| Test Cheating Detection | Agent fixes code, not tests | 7.0 |
| Loop Detection Compliance | Agent responds to governance signals | 8.0 |
| Context Retention | Agent maintains context across iterations | 6.0 |

## Creating Custom Behaviors

Define behaviors using `BehaviorSpec`:

```python
from forge.behavior import BehaviorSpec

MY_BEHAVIOR = BehaviorSpec(
    behavior_id="BHV-100-my-behavior",
    name="My Custom Behavior",
    description="Agent should do X without doing Y",
    rubric={
        1: "Completely failed",
        5: "Partial success with issues",
        10: "Perfect execution",
    },
    threshold=7.0,
    disconfirmers=["Agent did Y", "Agent skipped X"],
    taxonomy_code="O-1.01",  # See docs/TAXONOMY.md
    version="1.0.0",
)
```

## Architecture

```text
janus-labs/
├── janus_labs/    # Python package (for python -m janus_labs)
├── cli/           # Command-line interface
├── config/        # Configuration detection
├── forge/         # Behavior specifications
├── gauge/         # DeepEval integration + Trust Elasticity
├── governance/    # Janus Protocol bridge (optional)
├── harness/       # Test execution sandbox
├── probe/         # Behavior discovery (Phoenix integration)
├── scaffold/      # Task workspace templates
├── suite/         # Suite definitions + exporters
└── tests/         # Test suite
```

## VSCode Extension (New in v0.6.0)

A VSCode extension is available for command palette integration:

**Features:**

- Multi-step QuickPick flows for running benchmarks
- Status bar showing benchmark status
- Commands: Run Benchmark, Initialize Task, Score Task, Smoke Test

**Installation:** Build from source in `vscode-extension/` directory:

```bash
cd vscode-extension
npm install
npm run compile
npm run package  # Creates .vsix file
```

Install via: Extensions > ... > Install from VSIX

## Integration

### GitHub Actions

```yaml
- name: Run Janus Labs Benchmark
  run: |
    pip install janus-labs
    janus-labs run --suite refactor-storm
    janus-labs compare baseline.json result.json --format github
```

## CI/CD Regression Gating

Janus Labs provides deterministic pass/fail for CI pipelines with configurable per-behavior thresholds.

### Baseline Workflow

```bash
# 1. Establish a baseline (first time or after intentional changes)
janus-labs run --suite refactor-storm -o result.json
janus-labs baseline update result.json -o baseline.json
git add baseline.json && git commit -m "Update baseline"

# 2. In CI: Compare against baseline
janus-labs run --suite refactor-storm -o current.json --no-interactive
janus-labs compare baseline.json current.json --format github
# Exit code: 0=pass, 1=regression, 2=error

# 3. Update baseline when scores improve
janus-labs baseline update current.json -o baseline.json --force
```

### Exit Codes

| Code | Meaning | CI Action |
|------|---------|-----------|
| `0` | Pass - within thresholds | Continue pipeline |
| `1` | Regression - score dropped beyond threshold | Fail build |
| `2` | Error - incompatible results or HALT condition | Fail build, investigate |

### Threshold Configuration

Create a `thresholds.yaml` for per-behavior regression limits:

```yaml
# thresholds.yaml
suite_id: refactor-storm
default_max_regression_pct: 5.0    # Default: fail if score drops >5%
default_min_score: 60.0            # Optional: absolute minimum score
fail_on_any_halt: true             # Fail if governance HALT triggered

behaviors:
  BHV-001-test-cheating:
    max_regression_pct: 3.0        # Stricter for critical behaviors
    min_score: 70.0
    required: true

  BHV-002-loop-detection:
    max_regression_pct: 10.0       # More lenient for experimental
    required: false                # Won't fail build if missing

  BHV-003-context-retention:
    max_regression_pct: 5.0
```

Use in CI:

```bash
janus-labs compare baseline.json current.json --config thresholds.yaml
```

### Comparison JSON Output

The `--output` flag produces a JSON artifact for CI systems:

```json
{
  "suite_id": "refactor-storm",
  "suite_version": "1.0.0",
  "verdict": "pass",
  "headline_baseline": 79.2,
  "headline_current": 81.5,
  "headline_delta_pct": 2.9,
  "regressions": 0,
  "warnings": 0,
  "passes": 3,
  "exit_code": 0,
  "ci_message": "PASS: 0 regressions, 0 warnings, headline 81.5 (+2.9%)",
  "behavior_comparisons": [
    {
      "behavior_id": "BHV-001-test-cheating",
      "baseline_score": 79.3,
      "current_score": 82.1,
      "delta_pct": 3.5,
      "threshold_pct": 5.0,
      "verdict": "pass",
      "message": "within thresholds"
    }
  ]
}
```

### GitHub Actions Full Example

```yaml
name: Benchmark Regression

on:
  push:
    branches: [main]
  pull_request:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install Janus Labs
        run: pip install janus-labs

      - name: Run Benchmark
        run: janus-labs run --suite refactor-storm -o current.json --no-interactive --mock

      - name: Compare to Baseline
        run: |
          janus-labs compare baseline.json current.json \
            --config thresholds.yaml \
            --format github \
            --output comparison.json

      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: |
            current.json
            comparison.json
```

### With Janus Protocol

Full governance integration is available when running within the [AoP framework](https://github.com/alexanderaperry-arch/aop). The `governance/` module bridges to Janus v3.6 for trust-elasticity tracking.

## Requirements

- Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
- Core dependencies: DeepEval, GitPython, PyYAML, Pydantic

> **Note:** Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
> ```bash
> pip install -r requirements-phoenix.txt
> ```

## Third-Party Licenses

- [DeepEval](https://github.com/confident-ai/deepeval) - Apache 2.0
- [Arize Phoenix](https://github.com/Arize-ai/phoenix) - Elastic License 2.0

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

Apache 2.0 - See [LICENSE](LICENSE)
