Metadata-Version: 2.4
Name: rightsize-cli
Version: 0.1.0
Summary: Benchmark prompts against multiple LLMs via OpenRouter to find the cheapest model that works.
Project-URL: Homepage, https://github.com/NehmeAILabs/rightsize-cli
Project-URL: Repository, https://github.com/NehmeAILabs/rightsize-cli.git
Author-email: Nehme AI Labs <support@nehmeailabs.com>
License: MIT
Keywords: ai,benchmark,cli,cost-optimization,llm,openrouter
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12.0
Description-Content-Type: text/markdown

# rightsize-cli

**The Biggest Model for Every Task? That's Just Lazy.**

Stop overpaying for AI. Benchmark your prompts against 200+ models via OpenRouter to find the cheapest one that still works.

This is the production-grade CLI version of the [RightSize web tool](https://nehmeailabs.com/right-size).

## The Problem

You're probably using Claude Sonnet or GPT-5 for everything. That's expensive. Many tasks work just as well with smaller, cheaper models - you just don't know which ones.

## The Solution

RightSize benchmarks your actual prompts against multiple models and shows you:
- **Accuracy** - How well each model performs (judged by a strong LLM)
- **Latency** - Response time (p95)
- **Cost** - Projected cost per 1,000 runs
- **Savings** - How much you save vs your baseline

## Quick Start

```bash
# Install
uv pip install -e .

# Set your OpenRouter API key
export RIGHTSIZE_OPENROUTER_API_KEY="sk-or-..."

# Run a benchmark
rightsize benchmark data/test_cases.csv \
  --template prompts/classify.j2 \
  --model google/gemma-3-12b-it \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --model deepseek/deepseek-chat-v3.1 \
  --judge google/gemini-2.5-flash \
  --baseline google/gemini-2.5-flash
```

Output:
```
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Model                                   ┃ Accuracy ┃ Latency (p95) ┃ Cost/1k  ┃ Savings  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ google/gemma-3-12b-it                   │    71.0% │        4200ms │  $0.0028 │   +93.7% │
│ mistralai/mistral-small-3.1-24b-instruct│    85.0% │        1200ms │  $0.0035 │   +92.2% │
│ deepseek/deepseek-chat-v3.1             │    95.0% │         800ms │  $0.0180 │   +60.0% │
│ google/gemini-2.5-flash                 │   100.0% │        1900ms │  $0.0450 │       —  │
└─────────────────────────────────────────┴──────────┴───────────────┴──────────┴──────────┘
```

## How It Works

1. **You provide test cases** - A CSV with inputs and expected outputs
2. **Candidate models compete** - All models run the same prompts in parallel
3. **LLM-as-Judge scores** - A judge model compares each output to your expected output
4. **You see the results** - Cost, accuracy, latency - pick the cheapest model that meets your bar

## CSV Format

Two columns: `input_data` and `expected_output`:

```csv
input_data,expected_output
"My order hasn't arrived.",billing::high
"How do I reset my password?",account::high
"I want a refund!",refund::high
```

The judge model compares each model's output to `expected_output` and scores:
- **1.0** - Exact or semantic match
- **0.8** - Very close with minor differences  
- **0.5** - Partially correct
- **0.0** - Wrong or irrelevant

### Best practices for test data

1. **Use minimal output formats** - Delimiter-separated (`category::confidence`) keeps responses short, costs low
2. **Consistent task type** - All rows should be the same kind of task
3. **Representative samples** - Use real data from your production use case
4. **Clear expected outputs** - Unambiguous so the judge can score fairly
5. **10-20 test cases** - Enough to be statistically meaningful, fast to run

## Prompt Templates

Templates wrap your inputs with instructions. Supports Jinja2 (`.j2`) or Python f-strings.

**Example: Classification template** (`prompts/classify.j2`):
```jinja2
Classify this support ticket.

CATEGORIES: billing, account, refund, subscription, technical
CONFIDENCE: high, medium, low

OUTPUT FORMAT: <category>::<confidence>
OUTPUT ONLY the format above. No explanation. No punctuation. No other text.

TICKET: {{ input_data }}

OUTPUT:
```

**Example: Extraction template** (`prompts/extract.j2`):
```jinja2
Extract the email from this text.

OUTPUT FORMAT: <email or NONE>
OUTPUT ONLY the format above. No explanation. No other text.

TEXT: {{ input_data }}

OUTPUT:
```

### When to use templates

| Use Case | Template Purpose |
|----------|------------------|
| Classification | Define categories, enforce output format |
| Extraction | Specify what to extract, output format |
| Summarization | Set length constraints, style |
| Translation | Specify target language, formality |
| Q&A | Provide context, format requirements |

### Template variable

| Variable | Description |
|----------|-------------|
| `input_data` | The value from your CSV's `input_data` column |

## CLI Reference

### `rightsize benchmark`

```bash
rightsize benchmark <csv_file> [OPTIONS]
```

| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--template` | `-t` | (required) | Path to prompt template file |
| `--model` | `-m` | (required) | Model ID to test (repeat for multiple) |
| `--judge` | `-j` | (required) | Model for judging outputs |
| `--baseline` | `-b` | None | Baseline model for savings calculation |
| `--concurrency` | `-c` | 10 | Max parallel requests |
| `--output` | `-o` | `table` | Output format: table, json, csv |

### `rightsize models`

List all available models and their pricing:

```bash
rightsize models
```

## Configuration

Set via environment variables or `.env` file:

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `RIGHTSIZE_OPENROUTER_API_KEY` | Yes | - | Your OpenRouter API key |
| `RIGHTSIZE_MAX_CONCURRENCY` | No | 10 | Default concurrency |
| `RIGHTSIZE_TIMEOUT_SECONDS` | No | 60 | Request timeout |

## Examples

### Support ticket classification
```bash
rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash
```

### Find the cheapest model that works
```bash
rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m google/gemma-3-12b-it \
  -m google/gemma-3-27b-it \
  -m qwen/qwen3-8b \
  -m meta-llama/llama-3.3-70b-instruct \
  -j anthropic/claude-sonnet-4 \
  -b google/gemini-2.5-flash
```

### Export results to JSON
```bash
rightsize benchmark data/test_cases.csv \
  -t prompts/classify.j2 \
  -m mistralai/mistral-small-3.1-24b-instruct \
  -m deepseek/deepseek-chat-v3.1 \
  -j google/gemini-2.5-flash \
  -b google/gemini-2.5-flash \
  -o json > results.json
```

## Tips

1. **Use minimal output formats** - `category::confidence` is cheaper than JSON, JSON is cheaper than prose
2. **End prompts with "OUTPUT:"** - Primes the model to respond immediately without preamble
3. **Start with 10-20 test cases** - Enough to be representative, fast to iterate
4. **Set a quality bar** - Decide what accuracy % is acceptable (e.g., 95%+)
5. **Consider latency** - Sometimes a slower cheap model isn't worth it
6. **Iterate on prompts** - A better prompt can make cheaper models work better

## License

MIT
