Metadata-Version: 2.4
Name: truva
Version: 0.1.3
Summary: Data curation engine for LLM fine-tuning
Author: Turing Spark
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: transformers>=4.30
Requires-Dist: tqdm>=4.60
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: api
Requires-Dist: openai>=1.0; extra == "api"
Requires-Dist: anthropic>=0.20; extra == "api"
Provides-Extra: local
Requires-Dist: vllm>=0.3; extra == "local"
Requires-Dist: ollama>=0.1; extra == "local"
Provides-Extra: hf
Requires-Dist: datasets>=2.14; extra == "hf"
Provides-Extra: all
Requires-Dist: truva[api,hf,local]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Dynamic: license-file

# Truva

**Truva curates your fine-tuning data so you train on signal, not noise.**

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset — starting with semantic deduplication today, with quality scoring and contradiction detection coming soon.

**Goal:** Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

## Quick Install

```bash
pip install truva
```

## 30-Second Example

```bash
# Deduplicate a dataset with default settings
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Deduplicate with a custom threshold and generate a report
truva dedupe ./data.jsonl --threshold 0.9 --output ./deduped.jsonl --report ./report.json

# Generate embeddings for a dataset
truva embed ./data.jsonl --output ./embeddings.npy
```

## What It Does

| Before | After |
|--------|-------|
| 50,000 rows | 12,000 rows |
| Redundant examples | Unique, representative samples |
| Unknown quality | Scored and filtered *(coming soon)* |
| Hidden contradictions | Flagged for review *(coming soon)* |

## Features

### Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

```bash
truva dedupe ./data.jsonl --threshold 0.95
```

- `--threshold 0.95` (default): Aggressive but safe for most fine-tuning datasets
- `--threshold 0.85`: More aggressive, catches paraphrases
- `--threshold 1.0`: Only removes exact semantic matches

### Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

```bash
# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small
```

### Example Report Output

When you pass `--report ./report.json`, Truva writes a structured summary of what it found:

```json
{
  "input_rows": 50000,
  "kept_rows": 12380,
  "removed_rows": 37620,
  "reduction_pct": 75.24,
  "threshold": 0.95,
  "num_clusters": 12380,
  "clusters": [
    {
      "representative_idx": 41,
      "size": 23,
      "avg_similarity": 0.9812
    },
    {
      "representative_idx": 7,
      "size": 14,
      "avg_similarity": 0.9734
    }
  ]
}
```

Each cluster shows the representative row kept, how many duplicates were merged, and the average pairwise similarity within the group.

## Supported Formats

- **JSONL** — One JSON object per line (`.jsonl`, `.json`)
- **CSV** — Auto-detects the text column or use `--text-field`
- **Hugging Face Datasets** — Pass a dataset identifier like `username/dataset`

## Configuration

All options are available as CLI flags:

```
--threshold FLOAT       Cosine similarity threshold for dedup (0.0–1.0)
--provider [local|api]  Embedding provider
--model TEXT            Model name
--text-field TEXT       Column/field to use (auto-detected if not set)
--format TEXT           Input format: auto, jsonl, csv, hf
--output, -o TEXT       Output file path
--report TEXT           Path for JSON report
```

## Roadmap

- **Quality scoring** — LLM-based information density scoring to filter low-value rows
- **Contradiction detection** — Flag rows that teach conflicting information
- **Calibration** — Human-in-the-loop threshold tuning

## Requirements

- Python 3.10+
- Works on macOS (Apple Silicon), Linux

## License

Apache 2.0

## Feedback

Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.
