Metadata-Version: 2.4
Name: aferiq-eval
Version: 0.2.1
Summary: Quality observability for RAG and agents — Brazilian-Portuguese vertical with claim-by-claim hallucination diagnosis, trajectory quality, tool-use correctness, and goal completion metrics.
Project-URL: Homepage, https://github.com/ileoh/aferiq
Project-URL: Documentation, https://github.com/ileoh/aferiq#readme
Project-URL: Issues, https://github.com/ileoh/aferiq/issues
Project-URL: Source, https://github.com/ileoh/aferiq
Author-email: Leonardo Pena <leo94pena@gmail.com>
License: Proprietary
Keywords: agent,brazil,evaluation,lgpd,llm,n8n,observability,portuguese,rag,tool-use,trajectory
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Natural Language :: English
Classifier: Natural Language :: Portuguese (Brazilian)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: httpx>=0.28.0
Requires-Dist: litellm>=1.40
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.30.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: langchain-core>=0.3.0; extra == 'dev'
Requires-Dist: llama-index-core>=0.11.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3.0; extra == 'langchain'
Provides-Extra: langgraph
Requires-Dist: langchain-core>=0.3.0; extra == 'langgraph'
Requires-Dist: langgraph>=0.2.0; extra == 'langgraph'
Provides-Extra: llamaindex
Requires-Dist: llama-index-core>=0.11.0; extra == 'llamaindex'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == 'redis'
Description-Content-Type: text/markdown

# aferiq-eval

[![PyPI](https://img.shields.io/pypi/v/aferiq-eval)](https://pypi.org/project/aferiq-eval/)
[![Python](https://img.shields.io/pypi/pyversions/aferiq-eval)](https://pypi.org/project/aferiq-eval/)
[![License](https://img.shields.io/badge/license-Proprietary-orange)](LICENSE)

Quality observability for RAG and agents — vertical for the Brazilian market.
PT-BR judge prompts hand-tuned for BR patterns (Lei nº, CNPJ/CPF, Receita
Federal, INSS, BACEN, ANPD), claim-by-claim hallucination diagnosis,
trajectory + tool-use evaluation for LangGraph / AgentExecutor / OpenAI
Assistants.

> The internal Python namespace is `rageval` (legacy from when the library
> was called RAG Eval BR). It will be renamed to `aferiq_eval` in v0.2.0 with
> a 90-day import alias. For now, `pip install aferiq-eval` then
> `from rageval import …`.

## Install

```bash
pip install aferiq-eval

# Optional integrations:
pip install 'aferiq-eval[langchain]'      # LangChain (RAG + AgentExecutor)
pip install 'aferiq-eval[langgraph]'      # LangGraph
pip install 'aferiq-eval[openai]'         # OpenAI SDK + Assistants
pip install 'aferiq-eval[anthropic]'      # Anthropic SDK
```

## RAG eval (single-turn)

```python
from rageval import evaluate

result = evaluate(
    queries=["Qual o prazo do CDC pra arrependimento?"],
    contexts=[["Art. 49 do CDC: o consumidor pode desistir em 7 dias."]],
    answers=["O CDC garante 7 dias contados do recebimento."],
    metrics=["faithfulness", "hallucination"],
    language="pt",  # default
)

print(result)            # rich-rendered table
result.to_dict()         # JSON-friendly
result.aggregate         # {"faithfulness": 0.95, "hallucination": 0.92}
```

## Decorator capture

```python
from rageval import trace, register_trace_handler
from rageval.client import CloudClient

# Ship every captured trace to aferiq cloud:
register_trace_handler(CloudClient.from_env().send)  # uses RAGEVAL_API_KEY

@trace
def my_rag(query: str, context: list[str]) -> str:
    chunks = retriever.search(query)
    return llm.generate(query, chunks)

# every call is automatically captured + shipped
my_rag("Qual o aviso prévio mínimo na CLT?", ["..."])
```

## Agent eval (multi-step)

For LangChain `AgentExecutor`, LangGraph state machines, or OpenAI Assistants
— no decorator needed. Run your application under `aferiq-trace-run`, attach
the agent callback, and full trajectories (LLM calls + tool calls + tokens +
latency) get captured automatically.

```bash
aferiq-trace-run python my_agent.py
```

```python
from rageval.integrations.agent_callback import AferiqAgentCallbackHandler

handler = AferiqAgentCallbackHandler(goal="Qual o prazo do CDC?")
agent_executor.invoke(
    {"input": "Qual o prazo do CDC?"},
    config={"callbacks": [handler]},
)
# handler.last_trace is an AgentTrace; also broadcast to registered handlers
```

The dashboard at https://aferiq.com.br auto-detects trace shape: agent runs
render as a step-by-step tree; RAG traces render claim-by-claim with
PT-BR categorisation (lei_inventada, cnpj_fabricado, etc.).

## CLI

```bash
export OPENAI_API_KEY=sk-...

# RAG eval
aferiq-eval evaluate --dataset rag_traces.jsonl --metrics faithfulness,hallucination

# Agent eval
aferiq-eval evaluate-agent --dataset agent_traces.jsonl \
    --metrics trajectory_quality,goal_completion,cost_efficiency
```

## Examples

Runnable demos under [`examples/`](https://github.com/leo94pena/rag_eval/tree/main/packages/lib/examples):

- `langchain_agent_executor.py` — calculator + search agent
- `langgraph_agent.py` — state-machine agent
- `openai_assistants.py` — Assistants API streaming

## Supported judge models

Anything [litellm](https://docs.litellm.ai/) supports — OpenAI, Anthropic,
Google, **Maritaca Sabiá**, Ollama. Default: `gpt-4o-mini` (cheapest with
acceptable quality on PT-BR judging).

## Caching

By default, identical `(prompt, model, temperature)` tuples are cached on
disk under `~/.cache/rageval`. Repeated evals don't re-call the LLM.

```python
from rageval.cache import DiskCache, MemoryCache, NullCache

DiskCache()                          # default location
DiskCache("/custom/path")            # custom location
MemoryCache()                        # in-process, tests
NullCache()                          # disable caching
```

## Dev

```bash
git clone https://github.com/leo94pena/rag_eval
cd rag_eval/packages/lib
pip install -e ".[dev]"

pytest --cov=rageval         # tests
ruff check src tests         # lint
mypy --strict src            # type-check
```

## Documentation

- Full docs: https://docs.aferiq.com.br *(coming soon)*
- Issues + roadmap: https://github.com/ileoh/aferiq/issues
- LGPD + data handling: auth in São Paulo (Supabase) · traces in ClickHouse
  Cloud us-east-1 under SCC + signed DPA per LGPD Art. 33 · `redact_pii=True`
  flag strips CPF/CNPJ/email/phone before transmission · TLS in transit ·
  AES-256-GCM at rest for sensitive fields.

## License

Proprietary — see [`LICENSE`](../../LICENSE).
