Metadata-Version: 2.4
Name: modelscout-sdk
Version: 0.2.0
Summary: Python SDK for ModelScout - LLM Benchmarking and Evaluation
Author: ModelScout Team
License-Expression: LicenseRef-Proprietary
Project-URL: Documentation, https://docs.modelscout.co
Project-URL: Repository, https://github.com/modelscout/modelscout-python
Project-URL: Changelog, https://github.com/modelscout/modelscout-python/blob/main/CHANGELOG.md
Keywords: llm,benchmarking,evaluation,ai,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.25.0
Requires-Dist: typing_extensions>=4.0.0
Requires-Dist: cryptography>=41.0.0
Requires-Dist: msgpack>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-httpx>=0.21.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: numpy>=1.24; extra == "dev"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.18.0; extra == "anthropic"
Provides-Extra: google
Requires-Dist: google-generativeai>=0.3.0; extra == "google"
Provides-Extra: providers
Requires-Dist: openai>=1.0.0; extra == "providers"
Requires-Dist: anthropic>=0.18.0; extra == "providers"
Requires-Dist: google-generativeai>=0.3.0; extra == "providers"
Provides-Extra: graders
Requires-Dist: jsonschema>=4.0.0; extra == "graders"
Provides-Extra: all
Requires-Dist: modelscout[dev,graders,providers]; extra == "all"
Dynamic: license-file

# ModelScout Python SDK

**Find the best LLM for your product.** Run benchmarks across multiple models on your own data to see which performs best for quality, cost, and latency.

## Installation

```bash
pip install modelscout-sdk
```

## Quick Start

```python
from modelscout import Benchmark

# Set MODELSCOUT_API_KEY in your environment, or pass api_key="ms_..."
# Models are selected at checkout and locked to your purchase
results = Benchmark().run(
    purchased_benchmark_id="pb_...",  # from dashboard checkout
    prompts=["Write a SQL query to find active users", "Explain quantum computing"],
)

print(results.best_model_for("quality"))  # Best quality model
print(results.best_model_for("cost"))     # Cheapest model
```

## Features

### Benchmarking
Compare LLMs side-by-side on your evaluation data. Get quality scores, cost analysis, latency metrics, and statistical significance.

### Data Generation

Need synthetic test data? Generate evaluation datasets from the [dashboard](https://modelscout.co/dashboard/datasets) — describe your use case and get representative prompts in minutes.

### Dataset Upload
Upload your own evaluation data:

```python
dataset_id = benchmark.upload_dataset(
    name="My Test Data",
    samples=[
        {"input": "What is machine learning?"},
        {"input": "Explain neural networks"},
    ],
)
```

### Agentic Evaluation
Test tool-calling capabilities with multi-turn evaluation (SDK-only):

```python
from modelscout import Benchmark, AgenticConfig, ToolDefinition

def my_search_function(query: str) -> str:
    return f"Results for: {query}"

config = AgenticConfig(tools=[
    ToolDefinition(
        name="search",
        description="Search the web",
        parameters={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
        implementation=my_search_function,
    )
])

# Models are locked to your purchase — selected at checkout
results = Benchmark().run(
    name="Agent Eval",
    purchased_benchmark_id="pb_...",
    prompts=["Find information about quantum computing"],
    agentic_config=config,
)
```

### Execution Graders
Pass a `grader=` to `Benchmark.run()` to add a deterministic pass/fail layer that runs on *your* machine, against *your* ground truth. LLM judges score style well but can miss correctness on structured outputs like SQL, JSON, or code — graders fill that gap. The verdict is both passed to the judge as primary evidence of correctness and surfaced independently as `execution_pass_rate_by_model`.

Pre-built graders: `SQLGrader`, `JSONSchemaGrader`, `NumericGrader`. Subclass `Grader` for anything else.

```python
from modelscout import Benchmark
from modelscout.graders import SQLGrader

grader = SQLGrader(db_path="./warehouse.sqlite")

result = Benchmark(api_key="ms_...").run(
    name="SQL Benchmark",
    purchased_benchmark_id="pb_...",
    samples=[
        {
            "input": "Which products sold the most last quarter?",
            # expected_output is the canonical row set as JSON
            "expected_output": '[["Widget", 1200], ["Gadget", 900]]',
        },
    ],
    grader=grader,
)

print(result.execution_pass_rate_by_model)
# {'anthropic/claude-opus-4-7': 48.0, 'openai/gpt-5.4': 28.0}
```

Full guide: [Graders](https://modelscout.co/docs/sdk#graders).

## Supported Models

30 models across 10 providers:

| Provider | Models |
|----------|--------|
| OpenAI | gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5-mini, gpt-5-nano, gpt-oss-120b, gpt-oss-20b |
| Anthropic | claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001 |
| Google | gemini-3.1-pro, gemini-3-flash, gemini-3.1-flash-lite, gemini-2.5-flash-lite, gemma-3-27b-it |
| DeepSeek | deepseek-v3.2, deepseek-v3.2-speciale, deepseek-r1 |
| Qwen | qwen3.5-397b-a17b, qwen3.5-flash-02-23, qwen3-235b-a22b |
| Meta | llama-4-maverick, llama-4-scout |
| Mistral | mistral-large-2512, mistral-small-2603 |
| xAI | grok-4, grok-4.1-fast |
| Zhipu | glm-5, glm-5-turbo, glm-5.1 |
| Moonshot | kimi-k2.5 |

## Pricing

**Pay-as-you-go:** Purchase benchmarks from the [dashboard](https://modelscout.co/dashboard). Price depends on selected models, sample count, and judge tier. Starting from ~$4.50.

**Launch Discount:** 10% off all benchmarks during our launch period. Applied automatically at checkout.

## Documentation

Full documentation: [modelscout.co/docs/sdk](https://modelscout.co/docs/sdk)

---

## License

Proprietary. See [LICENSE](LICENSE) for details.
