Metadata-Version: 2.4
Name: freesolo
Version: 0.2.4
Summary: Tracing, evaluation, and training utilities for LLM applications.
Requires-Python: >=3.11
Requires-Dist: gepa>=0.1.1
Requires-Dist: httpx>=0.27.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: opentelemetry-api>=1.28.0
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.28.0
Requires-Dist: opentelemetry-sdk>=1.28.0
Requires-Dist: pymongo>=4.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: tinker-cookbook>=0.3.0
Requires-Dist: tinker>=0.19.0
Requires-Dist: wandb>=0.17.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.11.0; extra == 'dev'
Provides-Extra: examples
Requires-Dist: openai>=1.0.0; extra == 'examples'
Description-Content-Type: text/markdown

# freesolo

`freesolo` is a Python tracing and evaluation package for LLM apps.

It is built for the lowest-friction integration possible:

1. Install the package
2. Set `FREESOLO_API_KEY`
3. Configure the tracer
4. Run traces and evaluations from the package APIs

## Install

Install the package:

```bash
pip install freesolo
```

## Environment

- `FREESOLO_API_KEY`
- `FREESOLO_BASE_URL` (optional, defaults to `https://api.freesolo.co`)

```bash
export FREESOLO_API_KEY=fslo_...
```

## Quickstart

```python
from freesolo.tracing import configure_tracer, get_tracer

configure_tracer(service_name="my-llm-app")
tracer = get_tracer()

with tracer.start_as_current_span(
    "model.call",
    attributes={
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-5.5",
        "freesolo.input": {"prompt": "How do I reset my password?"},
    },
) as span:
    result = "Reset it from account settings."
    span.set_attribute("freesolo.output", result)
```

## Runnable Examples

Copy-pasteable examples live in [`examples/`](examples/):

- `tracing_manual_span.py`: configure OpenTelemetry and send one application span.
- `evaluation_custom_scorer.py`: run custom binary and numeric eval scorers.
- `evaluation_from_files.py`: run evals from a concrete dataset and environment.
- `environment.py`: example environment used by evals, training, and GEPA.
- `support_dataset.py`: example dataset paths and loaders used by evals, SFT, GRPO, and GEPA.
- `gepa_prompt_example.py`: run the Freesolo GEPA adapter over the example dataset.
- `training_sft_grpo.py`: start SFT or GRPO training runs from package APIs.

From a repo checkout:

```bash
cd freesolo-sdk
export PYTHONPATH="$PWD/pypi"
uv run python examples/evaluation_custom_scorer.py --local
```

## Public API

The root `freesolo` module intentionally exports no functions. Import from the
subpackages below; lower-level modules may be importable, but they are
implementation helpers unless they appear here or in an example.

| Import | Use case |
| --- | --- |
| `freesolo.tracing.configure_tracer`, `get_tracer`, `force_flush`, `shutdown` | Send OpenTelemetry traces from an application to Freesolo. |
| `freesolo.evaluation.EvaluationClient` | Run custom-scorer evals or environment evals and upload results to Freesolo. |
| `freesolo.evaluation.run_local_evaluation` | Run custom scorers locally without uploading results. |
| `freesolo.evaluation.CustomScorer`, `BinaryResponse`, `NumericResponse` | Define local scorer logic for eval rows. |
| `freesolo.evaluation.HostedJudgeClient` and hosted scorer classes | Use hosted LLM-as-judge scorers with OpenRouter-compatible credentials. |
| `freesolo.datasets.TaskExample`, `Dataset`, `load_dataset` | Load task examples and construct labeled conversations for evals or training. |
| `freesolo.environments.Environment`, `RewardResult`, `RewardMetric`, `GrpoConfig`, `EnvironmentGeneration` | Define task behavior once for evals, GEPA, SFT, and GRPO. |
| `freesolo.training.SftConfig`, `TrainGrpoOptions`, `train_sft`, `train_grpo` | Start SFT or GRPO training from package APIs. |
| `freesolo.gepa.GEPASetup`, `GEPAConfig`, `DefaultReflectionAgent`, `attach_gepa`, `optimize_gepa` | Optimize prompts through the GEPA adapter using the same environment and dataset abstractions. |
| `freesolo.contracts.load_contract_text`, `extract_contract_spec`, `load_contract_spec`, `build_oracle_messages` | Read contract markdown and build oracle prompt messages. |
| `freesolo.utils.oracle.generate_ground_truth_records` | Generate ground-truth JSONL records from source examples using a contract, environment, and oracle model. |
| `freesolo.utils.upload.upload_tinker_checkpoint_to_huggingface` | Upload a Tinker checkpoint to a private Hugging Face model repo. |

## What Gets Stored

- Native OTLP traces and spans
- Resource attributes like `service.name`
- Span names, timings, parent span ids, status, and errors
- Common model attributes such as `gen_ai.system`, `gen_ai.request.model`, and token counts
- Optional `freesolo.input` and `freesolo.output` span attributes

## Notes

- Tracing uses native OpenTelemetry protobuf export to `/api/traces/ingest`.
- Configure third-party OpenTelemetry instrumentors against the provider returned by `configure_tracer(...)`.
- Delivery is handled by the OpenTelemetry span processor you configure.

## Evaluations

`freesolo` also includes a small evaluation API for CI jobs, GitHub bots, and
eval scripts. All evaluation runs require `FREESOLO_API_KEY` or an explicit
`api_key`.

Evaluation data is a list of plain dictionaries. There is no separate `Example`
class to construct.

Define scorers by subclassing `CustomScorer` and returning `BinaryResponse` or
`NumericResponse`. Scorers run in your process, and Freesolo uploads the final
results with your API key. Pass scorer objects, not strings.

```python
from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class ExactMatch(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        actual = str(row.get("actual_output", "")).strip()
        expected = str(row.get("expected_output", "")).strip()
        return BinaryResponse(
            value=actual == expected and bool(actual),
            reason="actual_output matched expected_output",
        )


client = EvaluationClient()

results = client.run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris",
            "expected_output": "Paris",
        }
    ],
    scorers=[ExactMatch()],
)

print(results[0].success)
```

## Tinker Hugging Face Upload

`freesolo.utils.upload` posts a Tinker checkpoint URL to the Freesolo upload
service and returns the Hugging Face upload response.

```python
from freesolo.utils.upload import upload_tinker_checkpoint_to_huggingface

result = upload_tinker_checkpoint_to_huggingface(
    "tinker://<run_id>/sampler_weights/final",
    base_model="Qwen/Qwen3.5-35B-A3B",
)

print(result["repoId"])
```

### Environment-driven evaluations

For training contracts, `Environment` describes task behavior for evals and
GRPO/RL: prompt construction, response normalization, and reward scoring.
Dataset loading and labeled conversation construction live in `freesolo.datasets`.
`run_environment` loads task examples, calls your model callback, scores the
response through the environment, and uploads the same `scorers_data` shape used
by the eval DB.

```python
from typing import Any

from openai import OpenAI

from freesolo.datasets import TaskExample
from freesolo.environments import (
    Environment,
    EnvironmentGeneration,
    RewardMetric,
    RewardResult,
)
from freesolo.evaluation import EvaluationClient


class PromptEnvironment(Environment):
    def build_prompt_messages(
        self,
        example: TaskExample,
        prompt_text: str,
    ):
        return [
            {"role": "system", "content": prompt_text},
            {"role": "user", "content": example.task},
        ]

    def score_response(
        self,
        example: TaskExample,
        response_text: str,
    ) -> RewardResult:
        passed = response_text.strip() == str(example.expected_output).strip()
        return RewardResult(
            name="exact_match",
            score=1.0 if passed else 0.0,
            success=passed,
            threshold=1.0,
            reason="matched expected output" if passed else "mismatch",
            return_type="binary",
            metrics=(
                RewardMetric(
                    name="canonical_match",
                    score=1.0 if passed else 0.0,
                    success=passed,
                    threshold=1.0,
                ),
            ),
        )


model = OpenAI()


def generate(messages: list[dict[str, str]], example: TaskExample):
    response = model.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
    )
    return EnvironmentGeneration(
        response_text=response.choices[0].message.content or "",
        total_tokens=response.usage.total_tokens if response.usage else None,
    )


results = EvaluationClient().run_environment(
    name="contract-eval",
    source="eval.jsonl",
    contract_path="TRAINING_CONTRACT.md",
    environment=ContractEnvironment(),
    generate=generate,
)
```

`RewardResult` is the top-level scorer entry stored in
`eval_tasks.scorers_data`. Its fields are:

- `name`: scorer name shown in the UI.
- `score`: numeric reward value.
- `success`: pass/fail. If omitted, Freesolo derives it from `threshold`, then
  from whether `score > 0`.
- `threshold`, `value`, `reason`, `error`, `return_type`: scorer display and
  pass/fail context.
- `latency_ms`, `total_tokens`: optional per-response usage metadata.
- `metadata`: JSON object for scorer-specific details.
- `metrics`: optional `RewardMetric` components, also JSON-only, with `name`,
  `score`, `value`, `success`, `threshold`, `weight`, `reason`, and `metadata`.

Custom scorer:

```python
from typing import Any

from freesolo.evaluation import BinaryResponse, CustomScorer, EvaluationClient


class NoEmptyAnswer(CustomScorer[BinaryResponse]):
    async def score(self, row: dict[str, Any]) -> BinaryResponse:
        ok = bool(str(row.get("actual_output", "")).strip())
        return BinaryResponse(value=ok, reason="actual_output is non-empty")


results = EvaluationClient().run(
    name="support-agent-non-empty",
    data=[{"actual_output": "hello"}],
    scorers=[NoEmptyAnswer()],
)
```

LLM-as-judge is also a custom scorer. The scorer can call your judge model and
return a `NumericResponse`; Freesolo stores the eval run and score output with
your `FREESOLO_API_KEY`. This example uses `OPENAI_API_KEY` for the judge model
call and `FREESOLO_API_KEY` for eval upload.

```python
import json
from typing import Any

from openai import OpenAI

from freesolo.evaluation import CustomScorer, EvaluationClient, NumericResponse


class CorrectnessJudge(CustomScorer[NumericResponse]):
    name = "correctness_llm_judge"
    threshold = 0.8

    def __init__(self, client: OpenAI) -> None:
        self.client = client

    async def score(self, row: dict[str, Any]) -> NumericResponse:
        response = self.client.responses.create(
            model="gpt-4.1-mini",
            instructions=(
                "Grade correctness from 0.0 to 1.0. "
                "Return JSON only: {\"score\": 0.0, \"reason\": \"...\"}"
            ),
            input=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": json.dumps(
                                {
                                    "input": row.get("input", ""),
                                    "actual_output": row.get("actual_output", ""),
                                    "expected_output": row.get("expected_output", ""),
                                }
                            ),
                        }
                    ],
                }
            ],
        )

        parsed = json.loads(response.output_text or "{}")
        return NumericResponse(
            value=float(parsed["score"]),
            reason=str(parsed.get("reason", "")),
        )


judge_client = OpenAI()

results = EvaluationClient().run(
    name="support-agent-correctness",
    data=[
        {
            "input": "What is the capital of France?",
            "actual_output": "Paris is the capital of France.",
            "expected_output": "Paris",
        }
    ],
    scorers=[CorrectnessJudge(judge_client)],
)
```

Hosted scorers are also available out of the box and use OpenRouter by default:

- `ReferenceCorrectnessScorer`
- `RubricScorer`
- `GroundednessScorer`
- `InstructionFollowingScorer`
- `PairwisePreferenceScorer`

```python
from freesolo.evaluation import HostedJudgeClient, ReferenceCorrectnessScorer

judge = HostedJudgeClient(api_key="YOUR_OPENROUTER_API_KEY")

scorer = ReferenceCorrectnessScorer(client=judge)
```

Tracing is available through the OpenTelemetry helpers in `freesolo.tracing`.
