Metadata-Version: 2.4
Name: fairtrace
Version: 0.1.0
Summary: Developer-first fairness regression testing for LLM applications.
Author: fairtrace maintainers
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: yaml
Requires-Dist: PyYAML>=6.0; extra == "yaml"
Dynamic: license-file

# fairtrace

[![ci](https://github.com/nicoalbo0/fairtrace/actions/workflows/ci.yml/badge.svg)](https://github.com/nicoalbo0/fairtrace/actions/workflows/ci.yml)
[![license](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

`fairtrace` is a compact fairness regression library for LLM agents and RAG
pipelines.

It measures counterfactual disparity in the parts of an app that output-only
evals miss:

- tool use parity
- retrieval exposure gaps
- plan length gaps
- escalation parity
- friction point gaps
- escalation reason parity

Text metrics stay in the package, but they are supporting signals rather than
the main story.

## Why fairtrace?

Output parity is not enough for agentic systems.

- Two requests can get the same final answer while one path uses more tools,
  retrieves worse-ranked documents, escalates more often, or adds more friction.
- Those process differences affect user effort, access, and service quality.
- `fairtrace` turns those differences into CI checks.

## Quick Start

Install in editable mode:

```bash
python -m pip install -e .
```

Run the test suite:

```bash
python -m unittest discover -s tests -t .
```

Run the bundled smoke example:

```bash
python -m fairtrace.cli run examples/launch_smoke.json --app examples.launch_smoke_app:respond --output /tmp/fairtrace-smoke
```

Run the bundled text-fairness demo:

```bash
python -m fairtrace.cli run examples/fairness.json --app examples.simple_app:respond --output /tmp/fairtrace-report
```

Generate a starter suite:

```bash
fairtrace init --output fairtrace.json
```

## Smoke Example

`examples/launch_smoke.json` is the public smoke path used by CI.
It exercises the CLI, report writers, and trace metrics with a stable app so
the repo has a clean end-to-end check that passes on a fresh install.

The other `examples/` suites remain useful as regression demos because they
show how the metrics fail when the app behaves asymmetrically.

## Trace Schema

`fairtrace` validates a small trace schema before metrics read it.

Supported trace fields:

- `tool_calls`: list of objects with a non-empty `name`
- `retrieved_documents`: list of objects with a `group` and optional `rank`
- `plan_steps`: list of non-empty strings
- `escalated`: boolean
- `escalation_reason`: non-empty string
- `friction_points`: list of non-empty strings

Accepted aliases:

- `toolCalls` -> `tool_calls`
- `retrievedDocuments` -> `retrieved_documents`
- `planSteps` -> `plan_steps`
- `escalationReason` -> `escalation_reason`
- `frictionPoints` -> `friction_points`

Example app response metadata:

```json
{
  "metadata": {
    "helpfulness_score": 0.8,
    "toxicity_score": 0.0,
    "trace": {
      "tool_calls": [{ "name": "kb_search" }],
      "retrieved_documents": [
        { "group": "policy_docs", "rank": 1 },
        { "group": "support_docs", "rank": 3 }
      ],
      "plan_steps": ["search", "summarize", "respond"],
      "escalated": false,
      "friction_points": ["extra identity check"]
    }
  }
}
```

## Config Shape

```json
{
  "dataset": {
    "prompts": [
      {
        "id": "support-password-reset",
        "prompt": "Help a {region} customer reset a password",
        "attributes": {
          "region": ["consumer", "enterprise"]
        }
      }
    ]
  },
  "metrics": [
    { "type": "tool_use_parity", "threshold": 0.1 },
    { "type": "retrieval_exposure_gap", "threshold": 0.1 },
    { "type": "plan_length_gap", "threshold": 1.0 },
    { "type": "escalation_parity", "threshold": 0.1 },
    { "type": "friction_point_gap", "threshold": 1.0 },
    { "type": "escalation_reason_parity", "threshold": 0.1 }
  ]
}
```

`helpfulness_gap` reads `response_metadata.helpfulness_score` when present.

`toxicity_gap` reads `response_metadata.toxicity_score` when present, otherwise
it falls back to a small built-in heuristic list.

`tool_use_parity` reads `response_metadata.trace.tool_calls` and compares tool
use rates across groups.

`retrieval_exposure_gap` reads `response_metadata.trace.retrieved_documents`
and compares ranking exposure across document groups.

`plan_length_gap` reads `response_metadata.trace.plan_steps` and compares
average plan length across groups.

`escalation_parity` reads `response_metadata.trace.escalated` and compares
escalation rates across groups.

`friction_point_gap` reads `response_metadata.trace.friction_points` and
compares extra friction counts across groups.

`escalation_reason_parity` reads `response_metadata.trace.escalation_reason`
and compares escalation reasons across groups.

Metric scores are effect-size estimates. Bootstrap intervals are descriptive,
and optional permutation p-values are there to flag regressions, not to replace
a full statistical study.

Trace metric rationale: [docs/trace_fairness.md](docs/trace_fairness.md)

`refusal_gap`, `helpfulness_gap`, and `toxicity_gap` also accept explicit
evaluator hooks. If you do not provide one, they fall back to the built-in
heuristics and mark that in the metric details.

## Evaluator Hooks

You can point a metric at a `module:function` hook in suite config.

```json
{
  "metrics": [
    {
      "type": "toxicity_gap",
      "threshold": 0.1,
      "toxicity_evaluator": "examples.evaluator_hooks:toxicity_score"
    }
  ]
}
```

Example hook shapes:

```python
def toxicity_score(response: str, record: dict) -> float:
    return 0.0 if "safe" in response.lower() else 1.0

def refusal_detected(response: str, record: dict) -> bool:
    return record["assignments"].get("region") == "restricted"

def helpfulness_score(response: str, record: dict) -> float:
    return 0.9 if "help" in response.lower() else 0.2
```

## Adapters

- `CallableAdapter` for plain Python functions
- `OpenAICompatibleAdapter` for clients exposing `client.responses.create(...)`
- `OpenAIAgentsAdapter` for agent objects exposing `run(...)`
- `LangChainAdapter` for objects exposing `invoke(...)`
- `LangGraphAdapter` for graph state objects exposing `invoke(...)`

Each adapter can take a `trace_mapper` callback when the source app emits a
different trace shape.

Import helpers:

- `load_promptfoo_variants(...)`
- `load_deepeval_variants(...)`
- `assert_fairtrace_passes(...)`

CI wiring example:

- [docs/ci.md](docs/ci.md)

Compare two runs:

```bash
python -m fairtrace.cli compare baseline.json candidate.json --format markdown
```

You can also point `compare` at two report directories and it will read each
directory's `report.json`.

## Validation Rules

Suite files are rejected early when they contain:

- unknown top-level fields
- unknown dataset or metric fields
- duplicate prompt ids
- empty prompt lists or metric lists
- prompt placeholders that do not match defined attributes
- unsupported metric types

Dataset files may use either:

- `dataset.prompts` for template expansion
- `dataset.variants` for explicit imported cases

Never both in the same suite file.

## External Imports

For external eval tools, import to explicit variants first.

Promptfoo importer accepts:

```json
{
  "tests": [
    {
      "id": "case-1",
      "group_id": "seed-1",
      "prompt": "hello",
      "vars": { "gender": "woman" }
    }
  ]
}
```

DeepEval importer accepts:

```json
{
  "cases": [
    {
      "id": "case-1",
      "input": "hello",
      "metadata": {
        "seed_id": "seed-1",
        "assignments": { "gender": "man" }
      }
    }
  ]
}
```
