Metadata-Version: 2.4
Name: freesolo
Version: 0.2.3
Summary: Tracing, evaluation, and training utilities for LLM applications.
Requires-Python: >=3.10
Requires-Dist: httpx>=0.27.0
Requires-Dist: wandb>=0.17.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.11.0; extra == 'dev'
Provides-Extra: examples
Requires-Dist: anthropic>=0.40.0; extra == 'examples'
Requires-Dist: google-genai>=1.0.0; extra == 'examples'
Requires-Dist: openai>=1.0.0; extra == 'examples'
Provides-Extra: gepa
Requires-Dist: gepa>=0.1.1; extra == 'gepa'
Description-Content-Type: text/markdown

# freesolo

`freesolo` is a Python tracing and evaluation package for LLM apps.

It is built for the lowest-friction integration possible:

1. Install the package
2. Set `FREESOLO_API_KEY`
3. Wrap your OpenAI, Anthropic, Gemini, or OpenAI-compatible client
4. Run traces and evaluations from the package APIs

## Current provider support

`freesolo` currently supports automatic client instrumentation for:

- OpenAI
- Anthropic
- Gemini
- OpenAI-compatible clients via `wrap(...)` / `wrap_provider(...)`

## Install

Install the package plus the provider client you use:

```bash
pip install freesolo openai
```

or

```bash
pip install freesolo anthropic
```

or

```bash
pip install freesolo google-genai
```

## Environment

- `FREESOLO_API_KEY`
- `FREESOLO_BASE_URL` (optional, defaults to `https://api.freesolo.co`)

```bash
export FREESOLO_API_KEY=fslo_...
```

## Quickstart

```python
from openai import OpenAI
from freesolo import wrap

client = wrap(OpenAI())

result = client.responses.create(
    model="gpt-4.1-mini",
    instructions="Reply in plain text.",
    input=[
        {
            "role": "user",
            "content": [{"type": "input_text", "text": "How do I reset my password?"}],
        }
    ],
)

print(result.output_text or "")
```

## OpenRouter Quickstart

```python
from openai import OpenAI
from freesolo import wrap

client = wrap(
    OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key="YOUR_OPENROUTER_API_KEY",
    )
)

response = client.chat.completions.create(
    model="openai/gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "Reply in plain text."},
        {"role": "user", "content": "Write a one-sentence launch blurb."},
    ],
    max_tokens=120,
)

print(response.choices[0].message.content or "")
```

## Gemini Quickstart

```python
from google import genai
from freesolo import instrument_gemini

client = instrument_gemini(genai.Client())

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Write a one-sentence release note for traced Gemini support.",
)

print(response.text or "")
```

## Group Multiple Model Calls

For agentic or long-horizon tasks, strongly prefer wrapping the whole task in `start_trace(...)` so all of the model calls land in one trace.

For a single one-off OpenAI, Anthropic, or Gemini request, you can skip it.

```python
from anthropic import Anthropic
from freesolo import instrument_anthropic, start_trace

client = instrument_anthropic(Anthropic())

with start_trace("support-agent-run"):
    first = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say hello"}],
    )
    second = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say goodbye"}],
    )
```

## What Gets Stored

- Trace title if you explicitly pass it to `start_trace("...")`
- Trace metadata if you explicitly pass it to `start_trace(..., metadata=...)`
- Input payloads with `system_prompt`, `user_prompt`, and `images`
- Output payloads as plain text
- Token usage when available
- Image inputs with inline previews for the trace UI

## Notes

- You do not need `@trace()` for ordinary LLM tracing.
- A single instrumented OpenAI, Anthropic, or Gemini request creates a trace automatically.
- For OpenAI-compatible providers like OpenRouter, prefer `wrap(...)` instead of provider-specific helpers.
- For agentic or long-horizon workflows, strongly recommend `start_trace("descriptive-title")` so planning, retries, and follow-up calls stay grouped.
- Delivery is best-effort by default. Trace ingestion failures do not break your app.

## Evaluations

`freesolo` also includes a small evaluation API for CI jobs, GitHub bots, and
eval scripts. All evaluation runs require `FREESOLO_API_KEY` or an explicit
`api_key`.

Evaluation data is a list of plain dictionaries. There is no separate `Example`
class to construct.

Define scorers by subclassing `CustomScorer` and returning `BinaryResponse` or
`NumericResponse`. Scorers run in your process, and Freesolo uploads the final
results with your API key. Pass scorer objects, not strings.

```python
from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class ExactMatch(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        actual = str(row.get("actual_output", "")).strip()
        expected = str(row.get("expected_output", "")).strip()
        return BinaryResponse(
            value=actual == expected and bool(actual),
            reason="actual_output matched expected_output",
        )


client = EvaluationClient()

results = client.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris",
            "expected_output": "Paris",
        }
    ],
    scorers=[ExactMatch()],
)

print(results[0].success)
```

## Tinker Deployment

`freesolo.utils.deployment` is a thin proxy for the Modal deployment server. It posts
a Tinker checkpoint URL to the pinned Modal `/deployments` endpoint and returns
the server JSON response.

```python
from freesolo.utils.deployment import deploy_tinker_checkpoint

result = deploy_tinker_checkpoint(
    "tinker://<run_id>/sampler_weights/final",
    base_model="Qwen/Qwen3.5-35B-A3B",
)

print(result["repoId"])
```

### Environment-driven evaluations

For training contracts, you can use the same `Environment` adapter for evals,
SFT, and GRPO. `run_environment` loads examples, builds prompt messages, calls
your model callback, scores the response through the environment, and uploads
the same `scorers_data` shape used by the eval DB.

```python
from typing import Any

from openai import OpenAI

from freesolo.environments import (
    Environment,
    EnvironmentGeneration,
    RewardMetric,
    RewardResult,
    TaskExample,
)
from freesolo.evaluation import EvaluationClient


class ContractEnvironment(Environment):
    def build_prompt_messages(
        self,
        example: TaskExample,
        contract_text: str,
    ):
        return [
            {"role": "system", "content": contract_text},
            {"role": "user", "content": example.task},
        ]

    def score_response(
        self,
        example: TaskExample,
        response_text: str,
    ) -> RewardResult:
        passed = response_text.strip() == str(example.expected_output).strip()
        return RewardResult(
            name="exact_match",
            score=1.0 if passed else 0.0,
            success=passed,
            threshold=1.0,
            reason="matched expected output" if passed else "mismatch",
            return_type="binary",
            metrics=(
                RewardMetric(
                    name="canonical_match",
                    score=1.0 if passed else 0.0,
                    success=passed,
                    threshold=1.0,
                ),
            ),
        )


model = OpenAI()


def generate(messages: list[dict[str, str]], example: TaskExample):
    response = model.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
    )
    return EnvironmentGeneration(
        response_text=response.choices[0].message.content or "",
        total_tokens=response.usage.total_tokens if response.usage else None,
    )


results = EvaluationClient().run_environment(
    name="contract-eval",
    source="eval.jsonl",
    contract_path="TRAINING_CONTRACT.md",
    environment=ContractEnvironment(),
    generate=generate,
)
```

`RewardResult` is the top-level scorer entry stored in
`eval_tasks.scorers_data`. Its fields are:

- `name`: scorer name shown in the UI.
- `score`: numeric reward value.
- `success`: pass/fail. If omitted, Freesolo derives it from `threshold`, then
  from whether `score > 0`.
- `threshold`, `value`, `reason`, `error`, `return_type`: scorer display and
  pass/fail context.
- `latency_ms`, `total_tokens`: optional per-response usage metadata.
- `metadata`: JSON object for scorer-specific details.
- `metrics`: optional `RewardMetric` components, also JSON-only, with `name`,
  `score`, `value`, `success`, `threshold`, `weight`, `reason`, and `metadata`.

Custom scorer:

```python
from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class NoEmptyAnswer(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        ok = bool(str(row.get("actual_output", "")).strip())
        return BinaryResponse(value=ok, reason="actual_output is non-empty")


results = EvaluationClient().run(
    name="support-agent-non-empty",
    data=[{"actual_output": "hello"}],
    scorers=[NoEmptyAnswer()],
)
```

LLM-as-judge is also a custom scorer. The scorer can call your judge model and
return a `NumericResponse`; Freesolo stores the eval run and score output with
your `FREESOLO_API_KEY`. This example uses `OPENAI_API_KEY` for the judge model
call and `FREESOLO_API_KEY` for eval upload.

```python
import json
from typing import Any

from openai import OpenAI

from freesolo import instrument_openai
from freesolo.evaluation import CustomScorer, EvaluationClient, NumericResponse


class CorrectnessJudge(CustomScorer[NumericResponse]):
    name = "correctness_llm_judge"
    threshold = 0.8

    def __init__(self, client: OpenAI) -> None:
        self.client = client

    async def score(self, row: dict[str, Any]) -> NumericResponse:
        response = self.client.responses.create(
            model="gpt-4.1-mini",
            instructions=(
                "Grade correctness from 0.0 to 1.0. "
                "Return JSON only: {\"score\": 0.0, \"reason\": \"...\"}"
            ),
            input=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": json.dumps(
                                {
                                    "input": row.get("input", ""),
                                    "actual_output": row.get("actual_output", ""),
                                    "expected_output": row.get("expected_output", ""),
                                }
                            ),
                        }
                    ],
                }
            ],
        )

        parsed = json.loads(response.output_text or "{}")
        return NumericResponse(
            value=float(parsed["score"]),
            reason=str(parsed.get("reason", "")),
        )


judge_client = instrument_openai(OpenAI())

results = EvaluationClient().run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris is the capital of France.",
            "expected_output": "Paris",
        }
    ],
    scorers=[CorrectnessJudge(judge_client)],
)
```

Hosted scorers are also available out of the box and use OpenRouter by default:

- `ReferenceCorrectnessScorer`
- `RubricScorer`
- `GroundednessScorer`
- `InstructionFollowingScorer`
- `PairwisePreferenceScorer`

```python
from freesolo.evaluation import HostedJudgeClient, ReferenceCorrectnessScorer

judge = HostedJudgeClient(api_key="YOUR_OPENROUTER_API_KEY")

scorer = ReferenceCorrectnessScorer(client=judge)
```

Tracing is available through namespaced helpers:

```python
from freesolo.tracing import start_trace

with start_trace("support-agent-run"):
    ...
```
