Metadata-Version: 2.4
Name: freesolo
Version: 0.2.2
Summary: Tracing and evaluation SDK for LLM applications.
Requires-Python: >=3.10
Requires-Dist: httpx>=0.27.0
Provides-Extra: dev
Requires-Dist: ruff>=0.11.0; extra == 'dev'
Provides-Extra: examples
Requires-Dist: anthropic>=0.40.0; extra == 'examples'
Requires-Dist: google-genai>=1.0.0; extra == 'examples'
Requires-Dist: openai>=1.0.0; extra == 'examples'
Description-Content-Type: text/markdown

# freesolo

`freesolo` is a Python tracing and evaluation package for LLM apps.

It is built for the lowest-friction integration possible:

1. Install the package
2. Set `FREESOLO_API_KEY`
3. Wrap your OpenAI, Anthropic, Gemini, or OpenAI-compatible client
4. Run traces and evaluations from the same SDK

## Current provider support

`freesolo` currently supports automatic client instrumentation for:

- OpenAI
- Anthropic
- Gemini
- OpenAI-compatible clients via `wrap(...)` / `wrap_provider(...)`

## Install

Install the package plus the provider SDK you use:

```bash
pip install freesolo openai
```

or

```bash
pip install freesolo anthropic
```

or

```bash
pip install freesolo google-genai
```

## Environment

- `FREESOLO_API_KEY`
- `FREESOLO_BASE_URL` (optional, defaults to `https://api.freesolo.co`)

```bash
export FREESOLO_API_KEY=fslo_...
```

## Quickstart

```python
from openai import OpenAI
from freesolo import wrap

client = wrap(OpenAI())

result = client.responses.create(
    model="gpt-4.1-mini",
    instructions="Reply in plain text.",
    input=[
        {
            "role": "user",
            "content": [{"type": "input_text", "text": "How do I reset my password?"}],
        }
    ],
)

print(result.output_text or "")
```

## OpenRouter Quickstart

```python
from openai import OpenAI
from freesolo import wrap

client = wrap(
    OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key="YOUR_OPENROUTER_API_KEY",
    )
)

response = client.chat.completions.create(
    model="openai/gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "Reply in plain text."},
        {"role": "user", "content": "Write a one-sentence launch blurb."},
    ],
    max_tokens=120,
)

print(response.choices[0].message.content or "")
```

## Gemini Quickstart

```python
from google import genai
from freesolo import instrument_gemini

client = instrument_gemini(genai.Client())

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Write a one-sentence release note for traced Gemini support.",
)

print(response.text or "")
```

## Group Multiple Model Calls

For agentic or long-horizon tasks, strongly prefer wrapping the whole task in `start_trace(...)` so all of the model calls land in one trace.

For a single one-off OpenAI, Anthropic, or Gemini request, you can skip it.

```python
from anthropic import Anthropic
from freesolo import instrument_anthropic, start_trace

client = instrument_anthropic(Anthropic())

with start_trace("support-agent-run"):
    first = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say hello"}],
    )
    second = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=64,
        messages=[{"role": "user", "content": "Say goodbye"}],
    )
```

## What Gets Stored

- Trace title if you explicitly pass it to `start_trace("...")`
- Trace metadata if you explicitly pass it to `start_trace(..., metadata=...)`
- Input payloads with `system_prompt`, `user_prompt`, and `images`
- Output payloads as plain text
- Token usage when available
- Image inputs with inline previews for the trace UI

## Notes

- You do not need `@trace()` for ordinary LLM tracing.
- A single instrumented OpenAI, Anthropic, or Gemini request creates a trace automatically.
- For OpenAI-compatible providers like OpenRouter, prefer `wrap(...)` instead of provider-specific helpers.
- For agentic or long-horizon workflows, strongly recommend `start_trace("descriptive-title")` so planning, retries, and follow-up calls stay grouped.
- Delivery is best-effort by default. Trace ingestion failures do not break your app.

## Evaluations

`freesolo` also includes a small evaluation SDK for CI jobs, GitHub bots, and
eval scripts. All evaluation runs require `FREESOLO_API_KEY` or an explicit
`api_key`.

Evaluation data is a list of plain dictionaries. There is no separate `Example`
class to construct.

Define scorers by subclassing `CustomScorer` and returning `BinaryResponse` or
`NumericResponse`. Scorers run in your process, and Freesolo uploads the final
results with your API key. Pass scorer objects, not strings.

```python
from typing import Any

from freesolo import Freesolo
from freesolo.evaluation import BinaryResponse, CustomScorer


class ExactMatch(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        actual = str(row.get("actual_output", "")).strip()
        expected = str(row.get("expected_output", "")).strip()
        return BinaryResponse(
            value=actual == expected and bool(actual),
            reason="actual_output matched expected_output",
        )


client = Freesolo()

results = client.evals.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris",
            "expected_output": "Paris",
        }
    ],
    scorers=[ExactMatch()],
)

print(results[0].success)
```

Custom scorer:

```python
from typing import Any

from freesolo import Freesolo
from freesolo.evaluation import BinaryResponse, CustomScorer


class NoEmptyAnswer(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        ok = bool(str(row.get("actual_output", "")).strip())
        return BinaryResponse(value=ok, reason="actual_output is non-empty")


results = Freesolo().evals.run(
    name="support-agent-non-empty",
    data=[{"actual_output": "hello"}],
    scorers=[NoEmptyAnswer()],
)
```

LLM-as-judge is also a custom scorer. The scorer can call your judge model and
return a `NumericResponse`; Freesolo stores the eval run and score output with
your `FREESOLO_API_KEY`. This example uses `OPENAI_API_KEY` for the judge model
call and `FREESOLO_API_KEY` for eval upload.

```python
import json
from typing import Any

from openai import OpenAI

from freesolo import Freesolo, instrument_openai
from freesolo.evaluation import CustomScorer, NumericResponse


class CorrectnessJudge(CustomScorer[NumericResponse]):
    name = "correctness_llm_judge"
    threshold = 0.8

    def __init__(self, client: OpenAI) -> None:
        self.client = client

    async def score(self, row: dict[str, Any]) -> NumericResponse:
        response = self.client.responses.create(
            model="gpt-4.1-mini",
            instructions=(
                "Grade correctness from 0.0 to 1.0. "
                "Return JSON only: {\"score\": 0.0, \"reason\": \"...\"}"
            ),
            input=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": json.dumps(
                                {
                                    "input": row.get("input", ""),
                                    "actual_output": row.get("actual_output", ""),
                                    "expected_output": row.get("expected_output", ""),
                                }
                            ),
                        }
                    ],
                }
            ],
        )

        parsed = json.loads(response.output_text or "{}")
        return NumericResponse(
            value=float(parsed["score"]),
            reason=str(parsed.get("reason", "")),
        )


judge_client = instrument_openai(OpenAI())

results = Freesolo().evals.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris is the capital of France.",
            "expected_output": "Paris",
        }
    ],
    scorers=[CorrectnessJudge(judge_client)],
)
```

Hosted scorers are also available out of the box and use OpenRouter by default:

- `ReferenceCorrectnessScorer`
- `RubricScorer`
- `GroundednessScorer`
- `InstructionFollowingScorer`
- `PairwisePreferenceScorer`

```python
from freesolo.evaluation import HostedJudgeClient, ReferenceCorrectnessScorer

judge = HostedJudgeClient(
    api_key="YOUR_OPENROUTER_API_KEY",
    model="openai/gpt-oss-120b",
)

scorer = ReferenceCorrectnessScorer(client=judge)
```

Tracing is available from the same root client:

```python
from freesolo import Freesolo

client = Freesolo()

with client.traces.start("support-agent-run"):
    ...
```

You can also import namespaced tracing helpers directly:

```python
from freesolo.tracing import start_trace, wrap
```
