Metadata-Version: 2.4
Name: janus-labs
Version: 1.1.6
Summary: 3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience
Author-email: Alexander Perry <alex@alexanderperry.io>
License: Apache-2.0
Project-URL: Homepage, https://github.com/alexanderaperry-arch/janus-labs
Project-URL: Documentation, https://github.com/alexanderaperry-arch/janus-labs#readme
Project-URL: Repository, https://github.com/alexanderaperry-arch/janus-labs.git
Project-URL: Issues, https://github.com/alexanderaperry-arch/janus-labs/issues
Keywords: ai,agents,benchmark,capability-profiling,llm,evaluation,deepeval,radar-chart,agent-comparison
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest>=8.0.0
Requires-Dist: gitpython>=3.1.0
Requires-Dist: deepeval>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: httpx>=0.26.0
Requires-Dist: questionary>=2.0.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# Janus Labs

[![CI](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml)
[![Baselines](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/baseline.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/baseline.yml)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

**3DMark for AI Agents**. Profile AI coding agents across two active axes: Code Quality and Error Resilience. Measure Instruction Resilience separately with `janus-labs diagnose` when you have configured and vanilla results to compare.

## What Janus Labs Does

Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one opaque number, it runs a fixed suite of coding tasks and produces a capability profile that shows where an agent is strong, weak, or uneven.

- `refactor-storm` ships 4 built-in behaviors grouped into 2 active axes
- `janus-labs run` is the primary workflow for generating a full suite result
- Backend-hosted judging is available without a local API key, and `--mock` supports offline dry runs
- Results can be compared against bundled baselines and submitted to a public leaderboard

Public leaderboard: <https://fulfilling-courtesy-production-9c2c.up.railway.app>

## Quick Start

### 1. Install

```bash
pip install janus-labs
janus-labs doctor
```

`doctor` checks Python version, dependencies, which agent CLIs are on your PATH, and API keys. On Windows, use `python -m janus_labs` if `janus-labs` is not in `PATH`.

### 2. Benchmark Your Agent

The primary workflow: run the full suite with your agent, then submit.

```bash
# Benchmark Codex (or claude, gemini, copilot)
janus-labs run --full --agent codex --suite refactor-storm -o result.json

# Custom agent command (any CLI that accepts a prompt)
janus-labs run --full --agent-cmd "my-agent --prompt {prompt_file}" --suite refactor-storm -o result.json

# Submit to the public leaderboard
janus-labs submit result.json --github your-handle
```

This initializes 4 behavior workspaces, runs your agent on each, scores the outcomes, and produces a single result file. Built-in agent presets: `codex`, `claude`, `gemini`, `copilot`.

### 3. Try It Without an Agent

Don't have an agent CLI handy? Use mock or backend-hosted scoring to explore the pipeline:

```bash
# Offline mock scoring (instant, deterministic, no API key)
janus-labs run --suite refactor-storm --mock -o result.json

# Backend-hosted judging (no local API key needed)
janus-labs run --suite refactor-storm -o result.json

# Suite alias shortcut
janus-labs refactor-storm -o result.json
```

These modes score the unmodified scaffold code and are useful for testing the pipeline, CI setup, or exploring the output format.

### 4. Compare and Profile

```bash
# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline

# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines

# Measure optional instruction resilience (needs configured + vanilla results)
janus-labs diagnose result.json
```

### Alternative: Single-Behavior Manual Workflow

Use `init -> status -> score` when you want to hand a single behavior workspace to an external agent and inspect the repo diff yourself.

```bash
janus-labs init --suite refactor-storm --output ./janus-task
cd janus-task/BHV-001-test-cheating
# ... let your agent work ...
janus-labs score --workspace . --output result.json
```

### Install From Source

```bash
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .
```

## CLI Reference

All commands can be run as:

- `janus-labs <command>`
- `janus <command>`
- `python -m janus_labs <command>`

### Global

```bash
janus-labs --help
janus-labs --version
janus-labs
```

Running `janus-labs` with no arguments opens the interactive menu.

### `run`

Run an end-to-end suite directly.

```bash
janus-labs run --suite refactor-storm --mock -o result.json
janus-labs run --suite refactor-storm -o result.json
janus-labs refactor-storm -o result.json
```

Options:

- `--full`: run the full `init -> agent -> score -> output` pipeline (the primary workflow)
- `--agent`: built-in agent preset (`codex`, `claude`, `gemini`, `copilot`). Requires `--full`
- `--agent-cmd`: custom agent command template. Use `{prompt_file}` for file path or `{prompt_content}` for inline prompt. Requires `--full`
- `--timeout`: agent timeout in seconds per behavior (default: 300)
- `--suite`: suite ID, required
- `--output`, `-o`: output file, default `result.json`
- `--format`: `json`, `html`, or `both`
- `--mock`: use deterministic offline scoring (no agent needed)
- `--judge`: use local LLM-as-judge scoring
- `--model`: judge model, default `gpt-4o`
- `--no-interactive`: disable prompts on backend rate limits
- `--request-delay`: delay between judge requests

### `init`

Initialize workspaces for every behavior in a suite.

```bash
janus-labs init --suite refactor-storm --output ./janus-task
```

Options:

- `--suite`: suite ID, default `refactor-storm`
- `--output`, `-o`: output directory for generated behavior workspaces

### `status`

Inspect a task workspace and get the recommended next step.

```bash
janus-labs status --workspace ./janus-task/BHV-001-test-cheating
```

### `score`

Score one completed task workspace.

```bash
janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json
```

Options:

- `--workspace`, `-w`: workspace path, default current directory
- `--output`, `-o`: output file, default `result.json`
- `--judge`: enable LLM-as-judge scoring
- `--model`: judge model, default `gpt-4o`
- `--bundle`: optional bundle file for judge scoring
- `--agent`: override detected agent identifier
- `--agent-model`: override detected agent model

### `submit`

Submit a scored result to the public leaderboard.

```bash
janus-labs submit result.json --github your-handle
```

Options:

- `--dry-run`: print the payload without submitting
- `--github`: GitHub handle for attribution
- `--no-judge`: skip backend LLM judging during submission
- `--no-open`: skip opening the share URL in a browser (useful in CI)

### `compare`

Detect regressions between two results, or compare a result against a precomputed vanilla baseline.

```bash
janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline
```

### `profile`

Generate capability profiles from one baseline file or a directory of baselines.

```bash
janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability
```

### `diagnose`

Measure optional instruction resilience by comparing configured results against vanilla baselines.

```bash
janus-labs diagnose result.json
```

### `smoke-test`

Quick validation path using mock data. This does not benchmark your real config.

```bash
janus-labs smoke-test --suite refactor-storm
```

### `baseline`

Inspect or update baseline files.

```bash
janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json
```

## Built-In Suite

### Refactor Storm (`v2.0.0`)

`refactor-storm` ships 4 built-in behaviors grouped into 2 active axes:

- `BHV-001` Test Cheating Detection -> `Code Quality`
- `BHV-002` Reduce Cyclomatic Complexity -> `Code Quality`
- `BHV-003` Add Comprehensive Error Handling -> `Error Resilience`
- `BHV-004` Loop Detection Compliance -> `Error Resilience`

Composite score = `avg(Code Quality, Error Resilience)`.

`janus-labs diagnose` can compute optional Instruction Resilience separately by comparing configured runs against vanilla baselines. It is not part of the 2-axis composite.

## GitHub Actions

```yaml
- name: Install Janus Labs
  run: pip install janus-labs

- name: Run mock suite
  run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json

- name: Compare to baseline
  run: janus-labs compare baseline.json current.json --format github
```

## Requirements

- Python 3.12+
- Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic

Phoenix telemetry is optional and requires Python `<3.14`:

```bash
pip install -r requirements-phoenix.txt
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

Apache 2.0. See [LICENSE](LICENSE).
