Metadata-Version: 2.4
Name: trailrag
Version: 0.1.0
Summary: Paragraph-level compliance audit trails for RAG pipelines
Author-email: Le Quoc Anh <lequocanh@example.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/LeQuocAnh123/RAG_control
Project-URL: Repository, https://github.com/LeQuocAnh123/RAG_control
Project-URL: Bug Tracker, https://github.com/LeQuocAnh123/RAG_control/issues
Project-URL: Changelog, https://github.com/LeQuocAnh123/RAG_control/blob/main/CHANGELOG.md
Keywords: rag,compliance,audit,eu-ai-act,gdpr,hipaa,llm,governance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security
Classifier: Topic :: Office/Business
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: structlog>=24.4.0
Requires-Dist: sqlalchemy>=2.0.49
Requires-Dist: typer>=0.12.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: server
Requires-Dist: fastapi>=0.100.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.23.0; extra == "server"
Provides-Extra: pdf
Requires-Dist: weasyprint>=68.1; extra == "pdf"
Provides-Extra: hybrid
Requires-Dist: rank-bm25>=0.2.2; extra == "hybrid"
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.33.1; extra == "otel"
Requires-Dist: opentelemetry-sdk>=1.33.1; extra == "otel"
Provides-Extra: full
Requires-Dist: fastapi>=0.100.0; extra == "full"
Requires-Dist: uvicorn[standard]>=0.23.0; extra == "full"
Requires-Dist: weasyprint>=68.1; extra == "full"
Requires-Dist: rank-bm25>=0.2.2; extra == "full"
Requires-Dist: opentelemetry-api>=1.33.1; extra == "full"
Requires-Dist: opentelemetry-sdk>=1.33.1; extra == "full"
Dynamic: license-file

# TrailRAG

Compliance audit layer for RAG pipelines built on [rag_control](https://github.com/LeQuocAnh123/RAG_control). Regulators in EU AI Act / GDPR / HIPAA-regulated industries need more than a response — they need to know which document, which page, and which model version produced it. TrailRAG intercepts every retrieval call, traces each chunk to its source, and writes an immutable audit record before returning the result.

## The Problem

> "The AI referenced this document" is not sufficient.
> Regulators need: the AI extracted $2.5M from **page 7, paragraph 3** of **insurance_policy_v2.pdf**, version **2026-01-15**, at **04:32:59 UTC**, by user **analyst_001**.

Starting August 2026, EU AI Act enforcement makes this non-negotiable:

- **Art. 30** — High-risk AI systems must log every inference: input, output, retrieved sources, model identity, timestamp, and the user who triggered it.
- **GDPR Art. 30** — Records of processing activities must identify the data subject, legal basis, and retention period for every automated decision.
- **HIPAA §164.312(b)** — Audit controls must record activity in systems that touch PHI, including AI-generated responses citing medical records.

## Quick Start

```bash
pip install trailrag
```

```python
from trailrag import TrailRAGEngine

engine = TrailRAGEngine(
    base_engine=your_rag_control_engine,
    jurisdiction="GDPR",                    # sets retention period automatically
    db_url="postgresql://user:pw@host/db",  # or "sqlite:///audit.db" for dev
)

result = engine.run(query="What is the fire coverage limit?", user_context=ctx)
print(result.audit_id)       # 01KNZZDS0161Z4VEWVVV0REP0A
print(result.chunk_records)  # [{chunk_id, doc_id, page_number, score, ...}, ...]
```

## What Gets Logged

```json
{
  "audit_id":        "01KNZZDS0161Z4VEWVVV0REP0A",
  "timestamp_utc":   "2026-04-12T04:32:59Z",
  "user_id":         "analyst_001",
  "jurisdiction":    "GDPR",
  "retention_until": "2026-10-09T04:32:59Z",
  "model_name":      "gpt-4o",
  "prompt_hash":     "880d6f35...",
  "query_text":      "What is the fire coverage limit?",
  "retrieved_chunks": [
    {
      "doc_id":           "insurance_policy_v2.pdf",
      "doc_version":      "2026-01-15",
      "page_number":      7,
      "similarity_score": 0.91,
      "retrieval_rank":   1
    }
  ],
  "response_hash":        "7eb675c8...",
  "total_tokens":         252,
  "retrieval_latency_ms": 13.4,
  "total_latency_ms":     161.2
}
```

Audit records are **append-only**. The only deletion path is `purge_expired()`, which removes records past their `retention_until`. `store.delete()` always raises `ComplianceViolationError`.

## Compliance Coverage

| Regulation | Key Requirement | TrailRAG Field |
|---|---|---|
| EU AI Act Art. 30 | Decision traceability per inference | `retrieved_chunks` → `page_number`, `doc_id`, `similarity_score` |
| GDPR Art. 30 | Records of processing activities | `user_id`, `timestamp_utc`, `retention_until` |
| HIPAA §164.312(b) | Audit controls for PHI systems | `audit_id` + append-only store |
| BASEL III | Credit risk model documentation | `model_version`, `prompt_hash`, `temperature` |

Retention is set automatically: GDPR → 180 days, HIPAA → 6 years, EU AI Act → 10 years, SOC 2 → 1 year.

## GDPR Subject Access Requests (Art. 15)

```python
records = store.get_by_user("analyst_001", from_date=start, to_date=end)
store.export_json(audit_id, "/tmp/record_for_dpa.json")
```

## Self-hosting (Docker)

The fastest way to run the compliance dashboard on your own infrastructure.

### Prerequisites

- Docker ≥ 24 and Docker Compose ≥ 2

### 1 — Clone and configure

```bash
git clone https://github.com/LeQuocAnh123/RAG_control.git
cd RAG_control
```

Create a `.env` file in the project root (never commit this file):

```bash
# Required — set a strong random value; used for POST /api/purge
TRAILRAG_API_KEY=change-me-to-a-long-random-secret

# Optional — defaults shown below
TRAILRAG_DB_URL=sqlite:////data/audit.db
TRAILRAG_CORS_ORIGINS=*
```

### 2 — Build and start

```bash
docker compose up -d --build
```

### 3 — Open the dashboard

```
http://localhost:8080/dashboard
```

The API docs (OpenAPI / Swagger) are at `http://localhost:8080/api/docs`.

### 4 — Verify health

```bash
curl http://localhost:8080/health
# {"status": "ok", "total_records": 0}
```

### Persistent storage

Audit records are stored in a Docker named volume (`trailrag_data`) mounted at
`/data` inside the container. Data survives container restarts and upgrades.

To back up:

```bash
docker run --rm -v trailrag_data:/data -v $(pwd):/backup busybox \
  tar czf /backup/trailrag_backup_$(date +%Y%m%d).tar.gz /data
```

### Switching to PostgreSQL

Set `TRAILRAG_DB_URL` in your `.env`:

```bash
TRAILRAG_DB_URL=postgresql://user:password@postgres_host:5432/trailrag
```

TrailRAG will create the `audit_log` table automatically on first start.

### Purging expired records

```bash
# Dry run — count eligible records
curl -X POST "http://localhost:8080/api/purge?dry_run=true" \
     -H "X-TrailRAG-Key: $TRAILRAG_API_KEY"

# Execute purge
curl -X POST "http://localhost:8080/api/purge" \
     -H "X-TrailRAG-Key: $TRAILRAG_API_KEY"
```

---

## Built On

- **[rag_control](https://github.com/LeQuocAnh123/RAG_control)** — runtime governance, RBAC, policy enforcement
- **SQLAlchemy 2.0** — SQLite (dev) / PostgreSQL (production)
- **Pydantic v2** — immutable, validated audit schema

## License

Apache 2.0
