Metadata-Version: 2.4
Name: datacruxai
Version: 0.3.0
Summary: Lightweight data quality toolkit for LLM instruction tuning. Deduplication, PII detection, contamination checking, and quality scoring — no GPU required.
Project-URL: Homepage, https://github.com/stef41/datacruxai
Project-URL: Repository, https://github.com/stef41/datacruxai
Project-URL: Issues, https://github.com/stef41/datacruxai/issues
Author: Zacharie B
License: Apache-2.0
License-File: LICENSE
Keywords: contamination,data-quality,deduplication,fine-tuning,instruction-tuning,llm,pii,rlhf,sft,training-data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: xxhash>=3.0
Provides-Extra: all
Requires-Dist: click>=8.0; extra == 'all'
Requires-Dist: datasketch>=1.6; extra == 'all'
Requires-Dist: pyarrow>=12.0; extra == 'all'
Requires-Dist: rich>=13.0; extra == 'all'
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == 'cli'
Requires-Dist: rich>=13.0; extra == 'cli'
Provides-Extra: formats
Requires-Dist: pyarrow>=12.0; extra == 'formats'
Provides-Extra: fuzzy
Requires-Dist: datasketch>=1.6; extra == 'fuzzy'
Provides-Extra: semantic
Requires-Dist: numpy>=1.21; extra == 'semantic'
Requires-Dist: sentence-transformers>=2.0; extra == 'semantic'
Description-Content-Type: text/markdown

# datacruxai

[![CI](https://github.com/stef41/datacruxaiai/actions/workflows/ci.yml/badge.svg)](https://github.com/stef41/datacruxaiai/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**Data quality toolkit for LLM instruction tuning.**

Clean your training data before fine-tuning. No GPU needed.

<p align="center">
  <img src="assets/quality_report.svg" width="720" alt="datacruxai quality report" />
</p>

```python
from datacruxai import load_dataset, exact_dedup, scan_examples, score_dataset

# Load any instruction-tuning dataset
examples = load_dataset("training_data.jsonl")

# Remove exact duplicates
result = exact_dedup(examples)
print(f"Removed {result.n_duplicates} duplicates")

# Scan for PII
pii_results = scan_examples(result.originals)
print(f"Found PII in {len(pii_results)} examples")

# Score quality
scores = score_dataset(result.originals, min_score=0.5)
print(f"{len(scores)} low-quality examples flagged")
```

## Why datacruxai?

If you're fine-tuning an LLM, your training data quality matters more than quantity. Garbage in, garbage out — except now garbage costs you GPU hours and makes your model worse.

Existing options are either overkill (NeMo Curator needs NVIDIA GPUs and processes terabytes of web crawl data) or too narrow (scattered scripts in random repos). datacruxai fills the gap: a single `pip install` that gives you everything needed to validate and clean instruction-tuning datasets on a laptop.

**What it does:**

- **Deduplication** — exact (hash-based) and near-duplicate (MinHash + LSH) detection
- **PII detection** — regex-based scanning for emails, phones, SSNs, credit cards, IPs
- **PII redaction** — replace detected PII with placeholders
- **Benchmark contamination** — n-gram overlap checking against MMLU, GSM8K, HellaSwag, ARC, TruthfulQA, WinoGrande
- **Quality scoring** — heuristic checks for instruction quality, response completeness, repetition, formatting
- **Format support** — Alpaca, ShareGPT, OpenAI chat format; JSONL, JSON, Parquet
- **Dataset statistics** — length distributions, token estimates, field coverage

Everything runs on CPU. Everything is deterministic. No API keys, no signups, no cloud dependencies.

## Install

```bash
pip install datacruxai
```

With fuzzy deduplication (MinHash + LSH):

```bash
pip install datacruxai[fuzzy]
```

With Parquet support:

```bash
pip install datacruxai[formats]
```

Everything:

```bash
pip install datacruxai[all]
```

## Usage

### Load data

datacruxai auto-detects Alpaca, ShareGPT, and OpenAI chat formats.

```python
from datacruxai import load_dataset, detect_format

examples = load_dataset("training_data.jsonl")  # JSONL, JSON, or Parquet
print(f"Loaded {len(examples)} examples")
print(f"Format: {detect_format(examples[0].raw)}")
```

### Deduplicate

```python
from datacruxai import exact_dedup, fuzzy_dedup, dedup

# Fast exact dedup (hash-based)
result = exact_dedup(examples)
print(f"{result.n_total} → {result.n_unique} (removed {result.n_duplicates})")

# Near-duplicate detection (requires datacruxai[fuzzy])
result = fuzzy_dedup(examples, threshold=0.8)

# Combined: exact first, then fuzzy on the remainder
result = dedup(examples, exact=True, fuzzy=True, fuzzy_threshold=0.8)
clean_examples = result.originals
```

### Detect PII

```python
from datacruxai import scan_text, scan_examples, redact_text, redact_examples

# Scan a single string
entities = scan_text("Email me at john@example.com or call 555-0123")
for e in entities:
    print(f"  {e.kind}: '{e.text}' at [{e.start}:{e.end}]")

# Scan an entire dataset
pii_results = scan_examples(examples)
for r in pii_results:
    print(f"  Example {r.example_index}: {[e.kind for e in r.entities]}")

# Redact PII
safe = redact_text("SSN: 123-45-6789")
# "SSN: [SSN]"

safe_examples = redact_examples(examples)
```

### Check benchmark contamination

```python
from datacruxai import check_contamination, list_benchmarks

# Built-in benchmarks
print(list_benchmarks())
# ['arc', 'gsm8k', 'hellaswag', 'mmlu', 'truthfulqa', 'winogrande']

report = check_contamination(examples, ngram_size=8)
print(f"Flagged {report.total_flagged} / {report.total_checked} examples")
for bench, count in report.by_benchmark.items():
    print(f"  {bench}: {count} matches")

# Custom benchmark dataset
custom = {"my_eval": ["question one text", "question two text"]}
report = check_contamination(examples, benchmarks=custom, ngram_size=5)
```

### Score quality

```python
from datacruxai import score_example, score_dataset, filter_by_quality

# Score a single example
score = score_example(examples[0])
print(f"Overall: {score.overall:.2f}")
print(f"Details: {score.details}")
print(f"Flags: {score.flags}")

# Flag low-quality examples
low_quality = score_dataset(examples, min_score=0.5)
for s in low_quality:
    print(f"  [{s.example_index}] {s.overall:.2f} — {s.flags}")

# Filter and keep only good examples
clean = filter_by_quality(examples, min_score=0.5)
print(f"Kept {len(clean)} / {len(examples)}")
```

### Dataset statistics

```python
from datacruxai import compute_stats, length_distribution

stats = compute_stats(examples)
print(f"Examples: {stats['n_examples']}")
print(f"Token estimate: ~{stats['token_estimate']:,}")
print(f"Empty outputs: {stats['empty_outputs']}")
print(f"Avg instruction length: {stats['instruction_lengths']['mean']:.0f} chars")

# Length histogram
hist = length_distribution(examples, field="output", bins=10)
for bucket in hist:
    print(f"  {bucket['range']}: {'█' * bucket['count']}")
```

### Save results

```python
from datacruxai import save_jsonl

save_jsonl(clean_examples, "cleaned_training_data.jsonl")
```

## CLI

```bash
# Dataset statistics
datacruxai stats training_data.jsonl

# Deduplicate
datacruxai dedup training_data.jsonl -o deduped.jsonl

# Fuzzy dedup
datacruxai dedup training_data.jsonl --fuzzy -t 0.8 -o deduped.jsonl

# PII scan
datacruxai pii training_data.jsonl

# PII redact and save
datacruxai pii training_data.jsonl -o redacted.jsonl

# Contamination check
datacruxai contamination training_data.jsonl -n 8

# Quality scoring
datacruxai quality training_data.jsonl -t 0.5 -o filtered.jsonl
```

## Quality Checks

The quality scorer applies these deterministic heuristics:

| Check | Weight | What it catches |
|-------|--------|-----------------|
| Instruction quality | 25% | Empty, trivial, or all-caps instructions |
| Response completeness | 30% | Empty, trivial, or refusal-only responses |
| Length | 15% | Extremely short or long examples |
| Repetition | 20% | Repeated words, repeated n-grams |
| Language | 10% | Excessive special characters, whitespace |

Scores range from 0.0 (terrible) to 1.0 (clean). The default threshold of 0.5 catches the obvious problems without being overly aggressive.

## Supported Formats

| Format | Auto-detected | Key fields |
|--------|--------------|------------|
| Alpaca | `instruction`, `input`, `output` | Standard fine-tuning format |
| ShareGPT | `conversations` | Multi-turn with `from`/`value` |
| OpenAI chat | `messages` | `role`/`content` pairs |

File types: `.jsonl`, `.json`, `.parquet` (with `datacruxai[formats]`)

## Performance

Everything is single-threaded and CPU-only by design. On a typical laptop:

- **Exact dedup**: ~100k examples/sec
- **PII scan**: ~50k examples/sec
- **Quality scoring**: ~80k examples/sec
- **Contamination check**: depends on n-gram size, ~10k examples/sec for n=8

For datasets under 1M examples, everything runs in seconds to minutes. If you're working with larger datasets, consider NeMo Curator (but you'll need GPUs).

## Contributing

PRs welcome — especially:
- Additional PII patterns (non-US phone formats, EU identifiers)
- More benchmark fingerprints
- New quality heuristics
- Performance improvements

```bash
git clone https://github.com/zbhatti/datacruxai.git
cd datacruxai
pip install -e ".[all]"
pip install pytest ruff
pytest
```

## See Also

Part of the **stef41 LLM toolkit** — open-source tools for every stage of the LLM lifecycle:

| Project | What it does |
|---------|-------------|
| [tokonomics](https://github.com/stef41/tokonomix) | Token counting & cost management for LLM APIs |
| [castwright](https://github.com/stef41/castwright) | Synthetic instruction data generation |
| [datamix](https://github.com/stef41/datamix) | Dataset mixing & curriculum optimization |
| [toksight](https://github.com/stef41/toksight) | Tokenizer analysis & comparison |
| [trainpulse](https://github.com/stef41/trainpulse) | Training health monitoring |
| [ckpt](https://github.com/stef41/ckptkit) | Checkpoint inspection, diffing & merging |
| [quantbench](https://github.com/stef41/quantbenchx) | Quantization quality analysis |
| [infermark](https://github.com/stef41/infermark) | Inference benchmarking |
| [modeldiff](https://github.com/stef41/modeldiffx) | Behavioral regression testing |
| [vibesafe](https://github.com/stef41/vibesafex) | AI-generated code safety scanner |
| [injectionguard](https://github.com/stef41/injectionguard) | Prompt injection detection |

## License

Apache 2.0
