Metadata-Version: 2.4
Name: truva
Version: 0.2.0
Summary: Data curation engine for LLM fine-tuning
Author: Turing Spark
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: transformers>=4.30
Requires-Dist: tqdm>=4.60
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: api
Requires-Dist: openai>=1.0; extra == "api"
Requires-Dist: anthropic>=0.20; extra == "api"
Provides-Extra: local
Requires-Dist: vllm>=0.3; extra == "local"
Requires-Dist: ollama>=0.1; extra == "local"
Provides-Extra: hf
Requires-Dist: datasets>=2.14; extra == "hf"
Provides-Extra: all
Requires-Dist: truva[api,hf,local]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Dynamic: license-file

# Truva

**Truva curates your fine-tuning data so you train on signal, not noise.**

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.

**Goal:** Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

## Quick Install

```bash
pip install truva
```

## 30-Second Example

```bash
# Deduplicate a dataset
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Score for information density
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# Detect contradicting rows
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json

# Full pipeline in one command
truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/
```

## What It Does

| Before | After |
|--------|-------|
| 50,000 rows | 12,000 rows |
| Redundant examples | Unique, representative samples |
| Unknown quality | Scored and filtered |
| Hidden contradictions | Flagged for review |

## Features

### Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

```bash
truva dedupe ./data.jsonl --threshold 0.95
```

- `--threshold 0.95` (default): Aggressive but safe for most fine-tuning datasets
- `--threshold 0.85`: More aggressive, catches paraphrases
- `--threshold 1.0`: Only removes exact semantic matches

### Quality Scoring

Scores each row on a 1–10 scale for educational value using an LLM judge. Drop low-quality rows with `--min-quality`.

```bash
# Free, local (requires Ollama running locally)
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# OpenAI
truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 6

# Anthropic
truva score ./data.jsonl --provider anthropic --model claude-haiku-4-5-20251001 --min-quality 6

# With a domain-specific calibration file
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7

# Interactive dry-run: score 50 samples and give feedback before committing
truva score ./data.jsonl --interactive --sample 50
```

**Fast mode (default):** Scores only cluster representatives from a prior dedup run, then propagates scores to cluster members with a −1 penalty. 5–10x cheaper than scoring every row.

**Thorough mode:** Scores every row individually.

```bash
truva score ./data.jsonl --mode thorough --provider openai --model gpt-4o-mini
```

### Contradiction Detection

Finds rows that teach conflicting information using a local NLI model (`cross-encoder/nli-deberta-v3-base`). Only semantically similar rows (within the same cluster) are compared, keeping the number of checks tractable.

```bash
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json
```

Output is a JSON report listing each contradicting pair, the texts, and the NLI confidence score:

```json
{
  "input_rows": 1000,
  "num_clusters": 312,
  "confidence_threshold": 0.8,
  "contradictions_found": 4,
  "contradictions": [
    {
      "row_a_id": "c1a",
      "row_b_id": "c1b",
      "text_a": "Refunds are processed instantly.",
      "text_b": "Refunds take 3 to 5 business days.",
      "confidence": 0.9341
    }
  ]
}
```

No API key needed — NLI runs entirely on CPU.

### Full Pipeline (audit)

Run deduplication, quality scoring, and contradiction detection in one command. Writes `gold.jsonl` and `report.json` to the output directory.

```bash
truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/
```

After the run, inspect the results:

```bash
# View top duplicate clusters
truva inspect ./results/report.json --what clusters --top 10

# View contradiction pairs
truva inspect ./results/report.json --what contradictions

# Export gold dataset as CSV
truva export ./results/gold.jsonl --format csv -o ./gold.csv

# Quick health check on any dataset
truva health ./data.jsonl
```

### Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

```bash
# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small
```

### Example Dedup Report

When you pass `--report ./report.json` to `truva dedupe`, Truva writes a structured summary:

```json
{
  "input_rows": 50000,
  "kept_rows": 12380,
  "removed_rows": 37620,
  "reduction_pct": 75.24,
  "threshold": 0.95,
  "num_clusters": 12380,
  "clusters": [
    {
      "representative_idx": 41,
      "size": 23,
      "avg_similarity": 0.9812
    }
  ]
}
```

## Connecting to API Providers

### OpenAI

1. Get an API key from https://platform.openai.com/api-keys
2. Set the environment variable:

```bash
export OPENAI_API_KEY=sk-...
```

Then use `--provider openai` with any `truva` command.

### Anthropic

1. Get an API key from https://console.anthropic.com/settings/keys
2. Set the environment variable:

```bash
export ANTHROPIC_API_KEY=sk-ant-...
```

Then use `--provider anthropic` with any `truva` command.

### Ollama (Free, Local)

Install Ollama from https://ollama.com, then pull a model:

```bash
ollama pull llama3:8b
```

No API key needed. Use `--provider ollama` with `truva score`.

## Calibration

Calibration files let you inject domain-specific scoring rules and few-shot examples into the judge prompt. This is useful when the default judge undervalues domain-specific shorthand (e.g., medical abbreviations, legal citations).

```yaml
# calibration_medical.yaml
rubric: |
  Prioritize clinical accuracy over grammar.
  Short medical abbreviations (SOB, CHF, hx) are acceptable.
  Penalize vague or speculative language.

examples:
  - text: "pt presents w/ SOB, hx of CHF, started on furosemide 40mg"
    score: 9
    reasoning: "Clinically precise with actionable treatment detail"
  - text: "The patient is not feeling well and might have a heart problem"
    score: 2
    reasoning: "Vague, no clinical specificity"
```

```bash
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7
```

See `examples/` for ready-made calibration files for medical and customer support domains.

## Supported Formats

- **JSONL** — One JSON object per line (`.jsonl`, `.json`)
- **CSV** — Auto-detects the text column or use `--text-field`
- **Hugging Face Datasets** — Pass a dataset identifier like `username/dataset`

## Configuration

**`truva dedupe`**
```
--threshold FLOAT         Cosine similarity threshold (0.0–1.0), default 0.95
--provider [local|api]    Embedding provider
--model TEXT              Embedding model name
--text-field TEXT         Column/field to embed (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output file path
--report TEXT             Path for JSON report
```

**`truva score`**
```
--provider                Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--mode [fast|thorough]    Scoring mode, default fast
--min-quality FLOAT       Drop rows below this score (0.0 = keep all)
--calibration TEXT        Path to calibration YAML file
--text-field TEXT         Column/field to score (auto-detected if not set)
--output, -o TEXT         Output file path
--report TEXT             Path for scoring report JSON
--interactive             Interactive dry-run mode
--sample INT              Number of rows to sample in interactive mode
```

**`truva contradict`**
```
--confidence FLOAT        Min NLI confidence to flag a contradiction, default 0.8
--nli-model TEXT          NLI model name, default cross-encoder/nli-deberta-v3-base
--text-field TEXT         Column/field to compare (auto-detected if not set)
--dedupe-threshold FLOAT  Similarity threshold for clustering before NLI check, default 0.95
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output path for contradiction report JSON
```

**`truva audit`**
```
--provider TEXT           Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--embed-provider TEXT     Embedding provider: local, api (default: local)
--embed-model TEXT        Embedding model name
--dedupe-threshold FLOAT  Cosine similarity threshold for dedup, default 0.95
--min-quality FLOAT       Drop rows below this score, default 6.0
--mode [fast|thorough]    Scoring mode, default fast
--detect-contradictions   Run NLI contradiction detection after dedup
--contradiction-confidence FLOAT  Min confidence to flag, default 0.8
--text-field TEXT         Column/field name (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--calibration TEXT        Path to calibration YAML file
--output, -o TEXT         Output directory (default: ./truva_output)
```

**`truva inspect`**
```
REPORT_PATH               Path to report.json from a prior audit run
--what [clusters|contradictions]  What to inspect (default: clusters)
--top INT                 How many entries to show (default: 20)
```

**`truva export`**
```
GOLD_PATH                 Path to gold.jsonl from a prior audit run
--format [jsonl|csv]      Output format (default: jsonl)
--columns TEXT            Comma-separated fields to include (default: all)
--output, -o TEXT         Output file path
```

**`truva health`**
```
INPUT_PATH                Dataset to inspect
--text-field TEXT         Column/field to analyse (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
```

## Roadmap

- **Caching** — Skip recomputation on re-runs
- **Data leakage detection** — Flag training rows that overlap with your eval set

## Quickstart Script

A runnable demo script is included in the repo:

```bash
bash examples/quickstart.sh
```

It runs a health check, full audit with local embeddings, cluster inspection, and CSV export on the bundled 50-row fixture — no API key required.

## Development

```bash
# Clone and install in editable mode with dev dependencies
git clone https://github.com/turingspark/truva
cd truva
pip install -e ".[dev]"

# Run fast unit tests (no models required)
pytest tests/ -v -m "not integration"

# Run full test suite including end-to-end integration tests
pytest tests/ -v
```

Integration tests load real embedding (`all-MiniLM-L6-v2`) and NLI (`cross-encoder/nli-deberta-v3-base`) models locally — no API keys needed, but they take a few minutes.

## Requirements

- Python 3.10+
- Works on macOS (Apple Silicon), Linux

## License

Apache 2.0

## Feedback

Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.
