Metadata-Version: 2.4
Name: robotframework-chat
Version: 1.4.1
Summary: Robot Framework-based test harness for systematically testing LLMs
Project-URL: Homepage, https://github.com/tkarcheski/robotframework-chat
Project-URL: Repository, https://github.com/tkarcheski/robotframework-chat
Project-URL: Issues, https://github.com/tkarcheski/robotframework-chat/issues
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: benchmark,llm,ollama,openai,robotframework,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Robot Framework
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: certifi==2025.11.12
Requires-Dist: charset-normalizer==3.4.4
Requires-Dist: docker==7.1.0
Requires-Dist: idna==3.11
Requires-Dist: iniconfig==2.3.0
Requires-Dist: packaging==25.0
Requires-Dist: pluggy==1.6.0
Requires-Dist: pygments==2.19.2
Requires-Dist: pytest==9.0.2
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: requests==2.32.5
Requires-Dist: robotframework-lint==1.1
Requires-Dist: robotframework-requests>=0.9.7
Requires-Dist: robotframework==7.4.1
Requires-Dist: urllib3==2.6.2
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pip-audit; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: python-dotenv; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Requires-Dist: types-docker; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Provides-Extra: playwright
Requires-Dist: robotframework-browser>=18.0.0; extra == 'playwright'
Provides-Extra: superset
Requires-Dist: psycopg2-binary>=2.9; extra == 'superset'
Requires-Dist: sqlalchemy>=2.0; extra == 'superset'
Description-Content-Type: text/markdown

# robotframework-chat

A Robot Framework-based test harness for systematically testing Large Language Models (LLMs) using LLMs as both the system under test and as automated graders. Test results are archived to SQL and visualized in Apache Superset dashboards.

---

## Quick Start

### Prerequisites

- **Python 3.11+** and **astral-uv** for dependency management
- **Docker** for containerized code execution, LLM testing, and the Superset stack
- **Ollama** (optional) for local LLM testing

### Installation (Linux / macOS)

```bash
make install                # Install all dependencies
pre-commit install          # Install pre-commit hooks
ollama pull phi4:14b         # Pull default LLM model (optional)
```

### Installation (Windows)

The `tasks.py` script provides a cross-platform alternative to the Makefile.
It requires only Python and `uv` — no `make`, `bash`, or Unix tools needed.

```powershell
uv run python tasks.py install      # Install all dependencies
uv run pre-commit install           # Install pre-commit hooks
ollama pull qwen3.5:27b             # Pull default LLM model (optional)
uv run python tasks.py help         # List all available targets
```

> **Note:** Docker-based tests require [Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) with the WSL 2 backend enabled.

### Running Tests

```bash
# Linux / macOS
make robot                  # Run all Robot Framework test suites
make robot-math             # Run math tests
make robot-docker           # Run Docker tests
make robot-safety           # Run safety tests

# All platforms (including Windows)
uv run python tasks.py robot        # Run all suites
uv run python tasks.py robot-math   # Run math tests
uv run python tasks.py robot-dryrun # Validate tests (dry run)
uv run python tasks.py check        # Lint + typecheck + coverage
```

### Superset Dashboard

```bash
# Linux / macOS
cp .env.example .env        # Configure environment
make docker-up              # Start PostgreSQL + Redis + Superset
make bootstrap              # First-time Superset initialization

# Windows — tasks.py copies .env automatically if missing
uv run python tasks.py docker-up
```

Open <http://localhost:8088> to view the dashboard.

---

## Ollama Configuration

### Pulling Models

The default model is `phi4:14b` (set via `DEFAULT_MODEL` in `.env`).
Pull additional models depending on how many you want to test against:

**Starter (3 models):**

```bash
ollama pull phi4:14b
ollama pull llama3.2:latest
ollama pull gemma2:latest
```

**Standard (4–5 models):**

```bash
ollama pull phi4:14b
ollama pull llama3.2:latest
ollama pull gemma2:latest
ollama pull mistral:latest
ollama pull qwen3.5:27b
```

**Full fleet** — pull all models from `config/test_suites.yaml`:

```bash
make cron-sync-models        # Pulls any master models missing locally
```

### Loading Multiple Models Simultaneously

By default Ollama keeps up to 3 models loaded in memory (3 × number of GPUs,
or 3 for CPU inference). To load more models concurrently, configure these
Ollama server environment variables:

| Variable | Default | Description |
|---|---|---|
| `OLLAMA_MAX_LOADED_MODELS` | 3 × GPUs (or 3) | Max models resident in memory at once |
| `OLLAMA_NUM_PARALLEL` | `1` | Parallel requests per loaded model |
| `OLLAMA_MAX_QUEUE` | `512` | Max queued requests before rejecting |

> **Memory note:** each loaded model consumes VRAM/RAM proportional to its
> size. A 7B Q4 model uses ~4 GB; a 27B model uses ~16 GB. Setting
> `OLLAMA_NUM_PARALLEL` > 1 multiplies context memory per model.

**Linux (systemd):**

```bash
sudo systemctl edit ollama.service
```

Add under `[Service]`:

```ini
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=5"
Environment="OLLAMA_NUM_PARALLEL=2"
```

Then restart:

```bash
sudo systemctl restart ollama
```

**macOS:**

```bash
launchctl setenv OLLAMA_MAX_LOADED_MODELS 5
launchctl setenv OLLAMA_NUM_PARALLEL 2
```

Restart the Ollama application after setting these.

**Windows:**

Set `OLLAMA_MAX_LOADED_MODELS` and `OLLAMA_NUM_PARALLEL` as system environment
variables, then restart Ollama.

### VRAM Sizing Guide

| Models Loaded | Recommended VRAM | Example Hardware |
|---|---|---|
| 3 (default) | 24 GB | RTX 4090, M2 Pro |
| 4 | 32 GB | 2× RTX 4080, M2 Max |
| 5+ | 48+ GB | 2× RTX 4090, M3 Ultra |

Actual requirements depend on model sizes and quantization levels.

### Auto-Discovery and Multi-Model Testing

The test harness auto-discovers available models at startup and skips tests
for models that are not installed — you will never get failures from missing
models.

```bash
make discover-local-models   # List models available on all configured nodes
make run-local-models        # Run all test suites against every discovered model

# Windows
uv run python scripts/run_local_models.py --discover-models
uv run python scripts/run_local_models.py
```

Use `ITERATIONS` for continuous testing:

```bash
make run-local-models ITERATIONS=-1   # Run forever
make run-local-models ITERATIONS=0    # Stop on first error
```

### Multi-Node Setup (Optional)

To distribute tests across multiple machines running Ollama, set
`OLLAMA_NODES_LIST` in `.env`:

```bash
OLLAMA_NODES_LIST=localhost,gpu-server-1,gpu-server-2
```

Or edit the `nodes` list in `config/test_suites.yaml` directly. Check node
status with:

```bash
make discover-local-nodes
```

### Project Environment Variables

| Variable | Default | Description |
|---|---|---|
| `LLM_PROVIDER` | `ollama` | Provider backend (`ollama` or `openai`) |
| `OLLAMA_ENDPOINT` | `http://localhost:11434` | Ollama API endpoint |
| `DEFAULT_MODEL` | `phi4:14b` | Model used for standard test runs |
| `OLLAMA_TIMEOUT` | `5400` | Request timeout in seconds (90 min) |
| `OLLAMA_NODES_LIST` | `localhost` | Comma-separated Ollama hostnames |

---

## Example Test

```robot
*** Test Cases ***
LLM Can Do Basic Math
    ${answer}=    Ask LLM    What is 2 + 2?
    ${score}    ${reason}=    Grade Answer    What is 2 + 2?    4    ${answer}
    Should Be Equal As Integers    ${score}    1
```

---

## Core Philosophy

- **LLMs are software** — test them like software
- **Determinism before intelligence** — structured, machine-verifiable evaluation first
- **Constrained grading** — scores, categories, pass/fail; no prose from the evaluation layer
- **Modular by design** — composable pieces; new providers and graders plug in without rewriting core
- **Robot Framework as the orchestration layer** — readable, keyword-driven tests
- **Every test run is archived** — listeners always active, results flow to SQL
- **CI-native, regression-focused** — if it can't run unattended, it's not done

See [ai/AGENTS.md](ai/AGENTS.md#core-philosophy) for the full philosophy.

---

## Documentation

| Document | Description |
|----------|-------------|
| [docs/TEST_DATABASE.md](docs/TEST_DATABASE.md) | Database schema and usage |
| [docs/GITLAB_CI_SETUP.md](docs/GITLAB_CI_SETUP.md) | CI/CD setup guide |
| [docs/GRAFANA_SUPERSET_SETUP.md](docs/GRAFANA_SUPERSET_SETUP.md) | Superset visualization stack setup (Grafana deferred to v2+) |
| [docs/SUPERSET_EXPORT_GUIDE.md](docs/SUPERSET_EXPORT_GUIDE.md) | Superset dashboard export, import, and backup |
| [Ollama Configuration](#ollama-configuration) | Multi-model loading, VRAM sizing, and multi-node setup |

---

## Contributing

1. Read [ai/DEV.md](ai/DEV.md) for the development workflow and TDD discipline
2. Follow the code style guidelines in [ai/AGENTS.md](ai/AGENTS.md)
3. Add tests for new features (see [ai/CLAUDE.md](ai/CLAUDE.md) for grading tiers)
4. Run `pre-commit run --all-files` before committing
