Metadata-Version: 2.4
Name: janus-labs
Version: 0.1.1
Summary: 3DMark for AI Agents - Benchmark and measure AI coding agent reliability
Author-email: Alexander Perry <alex@alexanderperry.io>
License: Apache-2.0
Project-URL: Homepage, https://github.com/alexanderaperry-arch/janus-labs
Project-URL: Documentation, https://github.com/alexanderaperry-arch/janus-labs#readme
Project-URL: Repository, https://github.com/alexanderaperry-arch/janus-labs.git
Project-URL: Issues, https://github.com/alexanderaperry-arch/janus-labs/issues
Keywords: ai,agents,benchmark,llm,evaluation,deepeval,governance,trust-elasticity
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest>=8.0.0
Requires-Dist: gitpython>=3.1.0
Requires-Dist: deepeval>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: httpx>=0.26.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# Janus Labs

[![CI](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

**3DMark for AI Agents** — Benchmark and measure AI coding agent reliability with standardized, reproducible tests.

## What is Janus Labs?

Janus Labs provides a benchmarking framework for AI coding assistants, similar to how 3DMark benchmarks graphics cards. It enables:

- **Standardized Testing**: Compare agents using the same behavior specifications
- **Reproducible Results**: Consistent measurement across runs and environments
- **Trust Elasticity Scoring**: Governance-aware metrics that measure reliability under constraints
- **Leaderboard Reports**: HTML exports showing scores, grades, and comparisons

Built on [DeepEval](https://github.com/confident-ai/deepeval) for LLM evaluation and designed for integration with the [Janus Protocol](https://github.com/alexanderaperry-arch/aop) governance framework.

## Quick Start (5 minutes)

### 1. Install

```bash
# Clone the repository
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Run Your First Benchmark

```bash
# Run the built-in "refactor-storm" suite
python -m cli run --suite refactor-storm --format both

# This creates:
#   result.json  - Machine-readable results
#   result.html  - Visual leaderboard report
```

### 3. View Results

Open `result.html` in your browser to see:
- Headline score (0-100) with letter grade (S/A/B/C/D/F)
- Per-behavior breakdown
- Configuration badge showing default vs custom agent config

### 4. Compare Runs

```bash
# Save current result as baseline
python -m cli baseline update result.json --output baseline.json

# Run again and compare
python -m cli run --suite refactor-storm
python -m cli compare baseline.json result.json
```

## CLI Reference

### `run` - Execute a Benchmark Suite

```bash
python -m cli run --suite <suite-id> [options]

Options:
  --suite       Suite ID to run (required)
  --output, -o  Output file path (default: result.json)
  --format      Output format: json, html, or both (default: json)
```

### `compare` - Regression Detection

```bash
python -m cli compare <baseline.json> <current.json> [options]

Options:
  --threshold   Regression threshold percentage (default: 5.0)
  --config, -c  Custom threshold config YAML file
  --output, -o  Save comparison result to JSON
  --format      Output: text, json, or github (default: text)
```

Exit codes:
- `0` - No regression detected
- `1` - Regression detected (score dropped beyond threshold)
- `2` - HALT condition (governance intervention required)

### `export` - Convert Result Formats

```bash
python -m cli export <input.json> --format <html|json> [-o output]
```

### `baseline` - Manage Baselines

```bash
# Promote a result to baseline
python -m cli baseline update <result.json> [-o baseline.json] [--force]

# Show baseline info
python -m cli baseline show <baseline.json>
```

## Built-in Suites

### Refactor Storm

Tests agent reliability during complex refactoring tasks:

| Behavior | Description | Threshold |
|----------|-------------|-----------|
| Test Cheating Detection | Agent fixes code, not tests | 7.0 |
| Loop Detection Compliance | Agent responds to governance signals | 8.0 |
| Context Retention | Agent maintains context across iterations | 6.0 |

## Creating Custom Behaviors

Define behaviors using `BehaviorSpec`:

```python
from forge.behavior import BehaviorSpec

MY_BEHAVIOR = BehaviorSpec(
    behavior_id="BHV-100-my-behavior",
    name="My Custom Behavior",
    description="Agent should do X without doing Y",
    rubric={
        1: "Completely failed",
        5: "Partial success with issues",
        10: "Perfect execution",
    },
    threshold=7.0,
    disconfirmers=["Agent did Y", "Agent skipped X"],
    taxonomy_code="O-1.01",  # See docs/TAXONOMY.md
    version="1.0.0",
)
```

## Architecture

```
janus-labs/
├── cli/           # Command-line interface
├── config/        # Configuration detection
├── forge/         # Behavior specifications
├── gauge/         # DeepEval integration + Trust Elasticity
├── governance/    # Janus Protocol bridge (optional)
├── harness/       # Test execution sandbox
├── probe/         # Behavior discovery (Phoenix integration)
├── suite/         # Suite definitions + exporters
└── tests/         # Test suite (67 tests)
```

## Integration

### GitHub Actions

```yaml
- name: Run Janus Labs Benchmark
  run: |
    python -m cli run --suite refactor-storm
    python -m cli compare baseline.json result.json --format github
```

### With Janus Protocol

Full governance integration is available when running within the [AoP framework](https://github.com/alexanderaperry-arch/aop). The `governance/` module bridges to Janus v3.6 for trust-elasticity tracking.

## Requirements

- Python 3.12+ (3.12–3.13 recommended, 3.14 supported)
- Core dependencies: DeepEval, GitPython, PyYAML, Pydantic

> **Note:** Phoenix telemetry is optional and requires Python <3.14. To enable Phoenix, run:
> ```bash
> pip install -r requirements-phoenix.txt
> ```

## Third-Party Licenses

- [DeepEval](https://github.com/confident-ai/deepeval) - Apache 2.0
- [Arize Phoenix](https://github.com/Arize-ai/phoenix) - Elastic License 2.0

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

Apache 2.0 - See [LICENSE](LICENSE)
