Metadata-Version: 2.4
Name: janus-labs
Version: 1.0.0
Summary: 3DMark for AI Agents - Profile AI coding agent capabilities across code quality, error resilience, and instruction resilience
Author-email: Alexander Perry <alex@alexanderperry.io>
License: Apache-2.0
Project-URL: Homepage, https://github.com/alexanderaperry-arch/janus-labs
Project-URL: Documentation, https://github.com/alexanderaperry-arch/janus-labs#readme
Project-URL: Repository, https://github.com/alexanderaperry-arch/janus-labs.git
Project-URL: Issues, https://github.com/alexanderaperry-arch/janus-labs/issues
Keywords: ai,agents,benchmark,capability-profiling,llm,evaluation,deepeval,radar-chart,agent-comparison
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest>=8.0.0
Requires-Dist: gitpython>=3.1.0
Requires-Dist: deepeval>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: httpx>=0.26.0
Requires-Dist: questionary>=2.0.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# Janus Labs

[![CI](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/ci.yml)
[![Baselines](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/baseline.yml/badge.svg)](https://github.com/alexanderaperry-arch/janus-labs/actions/workflows/baseline.yml)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

**3DMark for AI Agents**. Profile AI coding agents across code quality, error resilience, and instruction resilience.

## What Janus Labs Does

Janus Labs benchmarks real coding-agent behavior on reproducible tasks. Instead of reducing everything to one score, it produces a capability profile so you can see where an agent is strong, weak, or uneven.

- Multi-axis profiling across code quality, error resilience, and instruction resilience
- Reproducible task workspaces with real source files, tests, and git diffs
- Public leaderboard and shareable result pages
- Baseline comparison against precomputed agent/model runs

Public leaderboard: <https://fulfilling-courtesy-production-9c2c.up.railway.app>

## Quick Start

### Install

```bash
pip install janus-labs
janus-labs --version
```

If `janus-labs` is not on your `PATH`, use:

```bash
python -m janus_labs --version
```

### Benchmark Your Agent

Janus Labs currently initializes a full suite workspace. Each behavior gets its own git-initialized subdirectory.

```bash
# 1. Generate the suite workspaces
janus-labs init --suite refactor-storm --output ./janus-task
```

That creates a structure like:

```text
janus-task/
  BHV-001-test-cheating/
  BHV-002-refactor-complexity/
  BHV-003-error-handling/
  ...
```

```bash
# 2. Pick one behavior workspace and let your agent work inside it
cd janus-task/BHV-001-test-cheating
```

```bash
# 3. Check workspace state
janus-labs status --workspace .
```

```bash
# 4. Score the completed task
janus-labs score --workspace . --output result.json
```

```bash
# 5. Submit to the leaderboard (optional)
janus-labs submit result.json --github your-handle
```

### Compare Against Baselines

```bash
# Generate capability profiles from bundled baseline results
janus-labs profile --baselines-dir data/baselines

# Compare your result against an auto-selected vanilla baseline
janus-labs compare result.json --auto-baseline
```

### Install From Source

```bash
git clone https://github.com/alexanderaperry-arch/janus-labs.git
cd janus-labs
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .
```

## CLI Reference

All commands can be run as:

- `janus-labs <command>`
- `janus <command>`
- `python -m janus_labs <command>`

### Global

```bash
janus-labs --help
janus-labs --version
janus-labs
```

Running `janus-labs` with no arguments opens the interactive menu.

### `init`

Initialize workspaces for every behavior in a suite.

```bash
janus-labs init --suite refactor-storm --output ./janus-task
```

Options:

- `--suite`: suite ID, default `refactor-storm`
- `--output`, `-o`: output directory for generated behavior workspaces

### `status`

Inspect a task workspace and get the recommended next step.

```bash
janus-labs status --workspace ./janus-task/BHV-001-test-cheating
```

### `score`

Score one completed task workspace.

```bash
janus-labs score --workspace ./janus-task/BHV-001-test-cheating --output result.json
```

Options:

- `--workspace`, `-w`: workspace path, default current directory
- `--output`, `-o`: output file, default `result.json`
- `--judge`: enable LLM-as-judge scoring
- `--model`: judge model, default `gpt-4o`
- `--bundle`: optional bundle file for judge scoring
- `--agent`: override detected agent identifier
- `--agent-model`: override detected agent model

### `submit`

Submit a scored result to the public leaderboard.

```bash
janus-labs submit result.json --github your-handle
```

Options:

- `--dry-run`: print the payload without submitting
- `--github`: GitHub handle for attribution

### `compare`

Detect regressions between two results, or compare a result against a precomputed vanilla baseline.

```bash
janus-labs compare baseline.json current.json
janus-labs compare result.json --auto-baseline
```

### `profile`

Generate capability profiles from one baseline file or a directory of baselines.

```bash
janus-labs profile --baseline data/baselines/baseline.json
janus-labs profile --baselines-dir data/baselines
janus-labs profile --baselines-dir data/baselines --json
janus-labs profile --leaderboard
janus-labs profile --reliability
```

### `run`

Run an end-to-end suite directly.

```bash
janus-labs run --suite refactor-storm --mock
```

### `smoke-test`

Quick validation path using mock data. This does not benchmark your real config.

```bash
janus-labs smoke-test --suite refactor-storm
```

### `diagnose`

Analyze instruction resilience — compare configured scores against vanilla baselines to measure instruction-file impact.

```bash
janus-labs diagnose result.json
```

### `baseline`

Inspect or update baseline files.

```bash
janus-labs baseline list
janus-labs baseline show baseline.json
janus-labs baseline update result.json -o baseline.json
```

## Built-In Suite

### Refactor Storm (`v1.6.0`)

`refactor-storm` currently ships **10 built-in behaviors**:

- `BHV-001` Test Cheating Detection
- `BHV-002` Refactor Complexity
- `BHV-003` Error Handling
- `BHV-004` Loop Detection
- `BHV-005` Context Retention
- `BHV-008` Error Propagation Chain
- `BHV-009` Cross-Module Extract Method
- `BHV-010` Integration Contract Change
- `O-2.01` Instruction Adherence
- `O-3.01` Code Quality

The capability profile uses a 3-axis radar (Code Quality, Error Resilience, Instruction Resilience). Tier 2 multi-file behaviors (BHV-008/009/010) are included in the suite but parked from the profile axes pending further calibration.

## GitHub Actions

```yaml
- name: Install Janus Labs
  run: pip install janus-labs

- name: Run mock suite
  run: janus-labs run --suite refactor-storm --mock --no-interactive -o current.json

- name: Compare to baseline
  run: janus-labs compare baseline.json current.json --format github
```

## Requirements

- Python 3.12+
- Core dependencies include DeepEval, GitPython, PyYAML, and Pydantic

Phoenix telemetry is optional and requires Python `<3.14`:

```bash
pip install -r requirements-phoenix.txt
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

Apache 2.0. See [LICENSE](LICENSE).
