Metadata-Version: 2.4
Name: litmus-llm
Version: 0.2.1
Summary: Litmus — LLM scenario runner TUI
Keywords: llm,benchmark,tui,agents,evaluation,testing
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Dist: rich>=13.0
Requires-Dist: textual>=3.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: openai>=1.0
Requires-Dist: pydantic>=2.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# Litmus 🧪

**Terminal UI for running LLM agent scenarios and comparing their performance.**

Litmus executes coding tasks across multiple AI agents and models, runs tests against the results, and produces detailed evaluation reports — all from a single TUI.

## What it does

1. **Detects agents** installed on your system (Claude Code, Codex, Aider, Cursor Agent, KiloCode, OpenCode)
2. **Runs scenarios** — each scenario is a coding task with tests and scoring criteria
3. **Evaluates results** — an LLM judge scores agent and model performance across 20 criteria each
4. **Generates reports** — HTML reports with per-scenario breakdowns, logs, and scores

## Supported agents

| Agent | Binary | Model listing |
|-------|--------|---------------|
| Claude Code | `claude` | Built-in list |
| Codex | `codex` | Built-in list |
| OpenCode | `opencode` | `opencode models` |
| KiloCode | `kilocode` | `kilocode models` |
| Aider | `aider` | `aider --list-models` |
| Cursor Agent | `agent` | `agent models` |

Litmus auto-detects which agents are available and queries their model lists.

## Quick start

Requires **Python 3.12+** and [uv](https://docs.astral.sh/uv/).

```bash
# Run without installing — uv fetches everything automatically
uvx --from git+https://github.com/ivkond/litmus.git litmus
```

On first launch Litmus will detect installed agents, generate a config, and open the TUI.

### Alternative ways to install

```bash
# Install as a global tool
uv tool install git+https://github.com/ivkond/litmus.git
litmus

# Or clone for development
git clone https://github.com/ivkond/litmus.git
cd litmus
uv sync
uv run litmus
```

> Once published to PyPI, install will simplify to `uvx litmus`.

### TUI workflow

1. 📋 **Models** — select agents and models to test
2. 🧩 **Scenarios** — pick which coding tasks to run
3. ▶️ **Run** — watch execution progress in real time
4. 📊 **Analysis** — review LLM-judged scores
5. 📄 **Reports** — browse generated HTML reports

## How it works

Each scenario lives in `template/<id>/` and contains:

```
template/1-data-structure/
  prompt.txt        # Task description sent to the agent
  task.txt          # Detailed requirements
  scoring.csv       # Evaluation criteria
  project/          # Starter code with tests
```

Execution pipeline per scenario:

```
uv sync  ->  agent call  ->  pytest  ->  collect logs
```

After all runs complete, an LLM judge evaluates the results using 20 agent criteria (tool efficiency, error recovery, reasoning depth...) and 20 model criteria (code correctness, instruction following, hallucination resistance...).

## Configuration

On first launch, Litmus generates a config file with detected agents and their settings. Configure the analysis model (any OpenAI-compatible API) through the TUI settings screen.

## Scenario packs

Litmus supports exporting and importing scenario archives (`.litmus-pack` ZIP files) for sharing test suites between machines or teams.

## Project structure

```
src/litmus/
  __init__.py     # Entry point, PROJECT_ROOT
  app.py          # Textual TUI (screens, widgets)
  agents.py       # Agent registry, detection, model listing
  run.py          # Scenario execution engine
  analysis.py     # LLM-powered evaluation (20+20 criteria)
  report.py       # HTML report generation
  pack/           # Scenario export/import
```

## Tech stack

- [Textual](https://textual.textualize.io/) — TUI framework
- [Rich](https://rich.readthedocs.io/) — terminal formatting
- [Pydantic](https://docs.pydantic.dev/) — structured evaluation models
- [OpenAI SDK](https://github.com/openai/openai-python) — LLM judge (any compatible API)

## License

[MIT](LICENSE)
