Metadata-Version: 2.4
Name: syncreus-eval
Version: 0.2.0
Summary: Open-source AI evaluation toolkit — hallucination detection, safety, industry-specific evals
Project-URL: Homepage, https://syncreus.com
Project-URL: Repository, https://github.com/syncreus/syncreus-eval
Project-URL: Documentation, https://docs.syncreus.com/eval-sdk
Project-URL: Issues, https://github.com/syncreus/syncreus-eval/issues
Author-email: Syncreus <hello@syncreus.com>
License: MIT
License-File: LICENSE
Keywords: ai,evaluation,hallucination,llm,safety,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: google-genai>=1.0.0
Requires-Dist: pydantic>=2.0
Provides-Extra: accuracy
Requires-Dist: fastembed>=0.4; extra == 'accuracy'
Requires-Dist: numpy>=1.24; extra == 'accuracy'
Provides-Extra: all
Requires-Dist: fastembed>=0.4; extra == 'all'
Requires-Dist: httpx>=0.27; extra == 'all'
Requires-Dist: llm-guard>=0.3; extra == 'all'
Requires-Dist: numpy>=1.24; extra == 'all'
Requires-Dist: presidio-analyzer>=2.2; extra == 'all'
Requires-Dist: spacy>=3.7; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: prompt-injection
Requires-Dist: llm-guard>=0.3; extra == 'prompt-injection'
Provides-Extra: safety
Requires-Dist: presidio-analyzer>=2.2; extra == 'safety'
Requires-Dist: spacy>=3.7; extra == 'safety'
Provides-Extra: upload
Requires-Dist: httpx>=0.27; extra == 'upload'
Description-Content-Type: text/markdown

# syncreus-eval

Open-source AI evaluation toolkit that shows you exactly which claims are hallucinated and why. Not just a score -- per-claim forensics with evidence.

## Quick Start

```bash
pip install syncreus-eval
```

```python
from syncreus_eval import check

result = check(
    "Lisinopril is an ACE inhibitor used for hypertension. "
    "It should not be combined with potassium-sparing diuretics. "
    "The standard starting dose is 20mg daily.",
    context="Lisinopril is an ACE inhibitor indicated for hypertension. "
    "Concurrent use with potassium-sparing diuretics increases hyperkalemia risk. "
    "The recommended starting dose is 10mg once daily."
)
print(result)
```

## What You Get

Every AI output is decomposed into individual factual claims. Each claim gets a verdict and evidence quote from your reference context:

```
CheckResult(passed=False, score=0.67, claims=3)
  x UNSUPPORTED  'The standard starting dose is 20mg daily'
  ~ AMBIGUOUS    'It should not be combined with potassium-sparing diuretics'
  v SUPPORTED    'Lisinopril is an ACE inhibitor used for hypertension'
```

The `CheckResult` object is designed for programmatic use:

```python
result.passed        # False -- at least one unsupported claim
result.score         # 0.67 -- fraction of claims that are supported
result.claims        # list[ClaimVerdict] -- per-claim details
result.unsupported   # filtered list of unsupported claims only
result.supported     # filtered list of supported claims only

if result:           # CheckResult is falsy when hallucinations are found
    deploy()

for claim in result.unsupported:
    print(claim.claim)     # "The standard starting dose is 20mg daily"
    print(claim.verdict)   # "UNSUPPORTED"
    print(claim.evidence)  # "not found" or a quote from context
```

## How It Works

A 3-tier pipeline calibrated across 109 ML experiments:

1. **Claim decomposition** -- LLM extracts every atomic factual claim from the AI output
2. **NLI pre-filter** -- Natural language inference catches obvious mismatches cheaply
3. **Chain-of-thought judge** -- Gemini 2.5 Flash reasons through ambiguous cases with structured JSON output

Embedding-based methods (cosine similarity, BERTScore) fail on modern RLHF-aligned hallucinations because they are semantically indistinguishable from truth at the vector level. Claim decomposition + reasoning-based judgment is the approach that holds up.

## pytest Integration

syncreus-eval registers a pytest plugin automatically. Use the `syncreus` fixture to gate CI on evaluation thresholds:

```python
# test_my_chatbot.py

def test_no_hallucination(syncreus):
    """Fails the test if any unsupported claims are found."""
    syncreus.check(
        "Paris is the capital of France.",
        context="Paris is the capital and largest city of France.",
    )

def test_medical_accuracy(syncreus):
    """Run a domain-specific evaluator and assert it passes."""
    syncreus.assert_eval(
        "healthcare",
        ai_input="Patient records and clinical guidelines...",
        ai_output="The recommended dosage is 10mg daily...",
    )

def test_custom_threshold(syncreus):
    """Run without asserting -- apply your own logic."""
    result = syncreus.run("hallucination", ai_input="...", ai_output="...")
    assert result.score >= 0.9, f"Score too low: {result.score}"
```

Set a global score threshold:

```bash
pytest --syncreus-threshold=0.95
```

Upload results to the Syncreus platform:

```bash
SYNCREUS_API_KEY=syn_... pytest --syncreus-upload
```

The terminal summary shows evaluation results alongside your test output:

```
======== Syncreus Evaluation Summary ========
  3 passed, 1 failed, 0 errors (4 total evals)
  FAILED evals:
    test_chatbot.py::test_rag_accuracy [hallucination] score=0.67
```

## All Evaluators

### General Purpose

| Evaluator | What it checks | Requires |
|-----------|---------------|----------|
| `HALLUCINATION` | Unsupported factual claims against reference context | Gemini API key |
| `ACCURACY` | Golden dataset comparison via semantic similarity | `[accuracy]` extra |
| `CONSISTENCY` | Pairwise similarity across repeated prompts | `[accuracy]` extra |
| `PERFORMANCE` | Latency, token counts, cost metrics from trace data | Nothing |
| `AGENT_TASK` | Whether an agent's completion claim matches reality | Gemini API key |
| `REGRESSION` | Baseline comparison against previous runs | Syncreus platform |

### Safety and Compliance

| Evaluator | What it checks | Requires |
|-----------|---------------|----------|
| `SAFETY` | PII/sensitive data detection + content safety | `[safety]` extra |
| `BIAS` | Demographic parity / EEOC four-fifths rule | Nothing |
| `IDEOLOGY` | Political neutrality (OMB M-26-04) | Gemini API key |
| `PROMPT_INJECTION` | Injection attempt detection | `[prompt-injection]` extra |

### Industry-Specific

| Evaluator | What it checks | Requires |
|-----------|---------------|----------|
| `HEALTHCARE` | Medical accuracy, drug safety, PHI detection | Gemini API key |
| `LEGAL` | Citation validity, holding fidelity, fabricated case law | Gemini API key |
| `FINANCE` | Regulatory accuracy, numerical precision, fabricated data | Gemini API key |
| `CODE_ACCURACY` | API existence, function signatures, package validity | Gemini API key |

Each industry evaluator returns domain-specific claim types. For example, the healthcare evaluator categorizes claims as `drug_interaction`, `dosage`, `contraindication`, `diagnosis`, `treatment`, `terminology`, or `phi_leak` -- each with a severity level (`critical`, `major`, `minor`).

## Full Evaluator API

For evaluators beyond hallucination, use the `evaluate()` function:

```python
from syncreus_eval import evaluate, EvalType

# Healthcare: drug safety and PHI detection
result = evaluate(
    EvalType.HEALTHCARE,
    ai_input="Clinical documentation and drug references...",
    ai_output="The AI assistant's medical response...",
)
print(result.passed)                     # True/False/None
print(result.details["critical_count"])  # number of critical findings
print(result.details["phi_detected"])    # whether PHI was leaked

# Legal: citation verification
result = evaluate(
    EvalType.LEGAL,
    ai_input="Case law and statute text...",
    ai_output="The court held in Smith v. Jones (2024)...",
)

# Run multiple evaluators at once
results = evaluate(
    [EvalType.HALLUCINATION, EvalType.SAFETY, EvalType.IDEOLOGY],
    ai_input="Context here",
    ai_output="Response here",
)
for r in results:
    print(f"{r.eval_type.value}: passed={r.passed}")
```

The `EvalResult` returned by `evaluate()`:

```python
class EvalResult:
    eval_type: EvalType
    passed: bool | None      # True/False/None (None = error or skipped)
    score: float | None       # Numeric score where applicable
    details: dict[str, Any]   # Evaluator-specific details
    error: bool               # Whether an error occurred
    error_message: str | None # Error description
```

## Installation

```bash
# Core (hallucination detection, industry evaluators via Gemini)
pip install syncreus-eval

# With optional extras
pip install syncreus-eval[accuracy]          # fastembed for semantic similarity
pip install syncreus-eval[safety]            # Presidio PII scanning
pip install syncreus-eval[prompt-injection]  # LLM Guard injection detection
pip install syncreus-eval[upload]            # Upload results to Syncreus platform
pip install syncreus-eval[all]              # Everything
```

Requires Python 3.10+.

## Configuration

The LLM-as-judge evaluators (hallucination, healthcare, legal, finance, code, ideology, agent task) require a Google Gemini API key. The free tier works.

Set it as an environment variable:

```bash
export GEMINI_API_KEY=your-key-here
```

Or pass it directly:

```python
result = check(output, context=doc, gemini_key="your-key-here")
```

## Upload Results (Optional)

Send evaluation results to the Syncreus platform for dashboards, trend tracking, and regression detection:

```python
from syncreus_eval import upload_results

upload_results(
    results=result,           # EvalResult or list
    api_key="syn_...",        # Syncreus API key
    endpoint="https://api.syncreus.com",
    trace_id="trace-123",     # optional
)
```

Requires: `pip install syncreus-eval[upload]`

## Links

- [Documentation](https://docs.syncreus.com/eval-sdk)
- [GitHub](https://github.com/syncreus/syncreus-eval)
- [PyPI](https://pypi.org/project/syncreus-eval/)
- [Issues](https://github.com/syncreus/syncreus-eval/issues)
- [Syncreus Platform](https://syncreus.com)

## License

MIT
