Metadata-Version: 2.4
Name: cleanllm
Version: 0.4.0
Summary: Streaming JSONL cleaner for LLM fine-tuning datasets.
Author: cleanllm contributors
License: ﻿MIT License
        
        Copyright (c) 2026 cleanllm contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: alpaca,audit,chatml,cleaning,dataset,deduplication,fine-tuning,jsonl,llm,sampling,sft,sharegpt
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Requires-Dist: orjson>=3.9.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: typer>=0.12.0
Provides-Extra: dev
Requires-Dist: build>=1.0.0; extra == 'dev'
Requires-Dist: twine>=4.0.2; extra == 'dev'
Provides-Extra: hf
Requires-Dist: datasets>=2.14.0; extra == 'hf'
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5.0; extra == 'tiktoken'
Description-Content-Type: text/markdown

# cleanllm

**Streaming JSONL cleaner for LLM fine-tuning datasets.** Minimal dependencies, memory-safe, and fast — processes files line-by-line without loading them into memory.

[![PyPI](https://img.shields.io/pypi/v/cleanllm)](https://pypi.org/project/cleanllm/)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)

---

## What it does

cleanllm gives you a pipeline for cleaning, validating, and profiling JSONL datasets before fine-tuning:

```
raw.jsonl → scan → fix → dedup → validate → stats → audit bundle → shards
```

Every step is streaming (no full-file load), resumable, and produces machine-readable JSON reports for CI gating.

---

## Install

```bash
pip install cleanllm
```

Or from source:

```bash
git clone https://github.com/verma8076/cleanllm
cd cleanllm
pip install -e .
```

---

## Quickstart

```bash
# Scan for issues
cleanllm scan data.jsonl

# Fix: remove URLs, normalize whitespace, redact forbidden patterns
cleanllm fix data.jsonl -o data.cleaned.jsonl

# Deduplicate by prompt content
cleanllm dedup data.cleaned.jsonl -o data.dedup.jsonl --by prompt

# Profile the cleaned dataset
cleanllm stats data.dedup.jsonl --report-json stats.json

# Gate in CI: fail if invalid rows increased
cleanllm gate --compare compare.json --rules gate_rules.json
```

---

## CLI reference

### `scan`
Streaming scan for issues — invalid JSON, missing keys, URLs, forbidden patterns, language distribution, duplicate estimate.

```bash
cleanllm scan data.jsonl
cleanllm scan data.jsonl --report-json scan_report.json --dup-estimate
cleanllm scan data.jsonl --preset usaco_portable
```

### `fix`
Remove URLs, normalize whitespace, redact or drop rows with forbidden patterns.

```bash
cleanllm fix data.jsonl -o cleaned.jsonl
cleanllm fix data.jsonl -o cleaned.jsonl --drop-on forbidden_pattern --drop-on invalid_json
cleanllm fix data.jsonl -o cleaned.jsonl --preset cpp17_clean --report-json fix_report.json
```

Drop rules: `invalid_json`, `missing_required_keys`, `forbidden_pattern`, `empty_assistant`, `placeholder`, `repetitive_response`, `bad_conversation`.

> **Note on `empty_assistant`:** By default this drops assistant responses shorter than 20 characters — calibrated for code datasets where very short responses are almost always errors. For text/chat datasets, set `--min-assistant-chars 1` to only drop truly blank responses.

### `validate`
Schema validation, line by line. Exit code `0` only if all rows pass.

```bash
cleanllm validate data.jsonl --schema basic_sft
cleanllm validate data.jsonl --schema cp_sft_v1
```

| Schema | Required fields |
|---|---|
| `basic_sft` | `id`, `messages` (list of `role`/`content` dicts) |
| `cp_sft_v1` | `id`, `source`, `problem_id`, `messages`, `tests` (non-empty, with `input`/`output`) |

### `dedup`
First-occurrence deduplication — by full record, prompt (system+user), or code (assistant).

```bash
cleanllm dedup data.jsonl -o deduped.jsonl --by record
cleanllm dedup data.jsonl -o deduped.jsonl --by prompt --normalized
cleanllm dedup data.jsonl -o deduped.jsonl --by code --report-json dedup_report.json
```

### `stats`
Single-pass profiler: distributions, structural stats, schema counts, response lengths, language distribution.

```bash
cleanllm stats data.jsonl
cleanllm stats data.jsonl --schema cp_sft_v1 --keys source,difficulty_bucket --top-k 20
cleanllm stats data.jsonl --report-json stats.json
```

### `compare`
Diff two stats reports to catch regressions between dataset versions.

```bash
cleanllm compare old_stats.json new_stats.json
cleanllm compare old_stats.json new_stats.json --report-json compare.json
cleanllm compare old.jsonl new.jsonl --from-jsonl --schema cp_sft_v1
```

### `gate`
CI-friendly quality gating. Nonzero exit on failures.

```bash
cleanllm gate --stats stats.json --rules gate_rules.json
cleanllm gate --compare compare.json --rules gate_rules.json --strict
cleanllm gate --compare compare.json --inline-rule "counts_diff.invalid_json_rows.delta<=0"
```

Gate rules JSON:

```json
{
  "version": 1,
  "mode": "compare",
  "rules": [
    {"name": "no_new_invalid", "metric": "counts_diff.invalid_json_rows.delta", "op": "<=", "value": 0},
    {"name": "enough_valid",   "metric": "counts_diff.valid_json_rows.new",    "op": ">=", "value": 1000}
  ]
}
```

Supported ops: `==`, `!=`, `<`, `<=`, `>`, `>=`. Severities: `error` (default), `warn`.

### `run`
Execute a JSON-defined multi-step pipeline with variable substitution.

```bash
cleanllm run --config pipeline.json
cleanllm run --config pipeline.json --set input_path=data.jsonl --set outdir=out/v2
cleanllm run --config pipeline.json --dry-run
```

Supported step types: `fix`, `validate`, `dedup`, `stats`, `audit`, `sample`, `shard`, `manifest`, `scan`, `compare`.

### `sample`
Reservoir sampling — random or stratified, deterministic with `--seed`.

```bash
cleanllm sample data.jsonl -o sample.jsonl -n 500 --seed 42
cleanllm sample data.jsonl -o sample.jsonl -n 500 --stratify source,difficulty_bucket
```

### `audit`
Build a reproducible audit bundle in one command: sampled JSONL + CSV review index (with original line numbers) + summary + manifest.

```bash
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --seed 42
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --stratify source --schema cp_sft_v1
```

Bundle contents: `audit_sample.jsonl`, `audit_index.csv`, `audit_summary.json`, `AUDIT_README.md`, `manifest.json`.

### `shard` / `manifest`

```bash
cleanllm shard data.jsonl --outdir shards --size 5000 --gzip
cleanllm manifest shards -o manifest.json
```

### `convert`
Convert a JSONL file between `sharegpt`, `alpaca`, and `chatml` formats.

```bash
cleanllm convert data.jsonl -o converted.jsonl --from sharegpt --to chatml
cleanllm convert data.jsonl -o converted.jsonl --from alpaca --to sharegpt
```

Supported formats: `sharegpt` (conversations list), `alpaca` (instruction/output), `chatml` (messages list).

### `merge`
Merge multiple JSONL files into one, with optional deduplication.

```bash
cleanllm merge a.jsonl b.jsonl c.jsonl -o merged.jsonl
cleanllm merge a.jsonl b.jsonl -o merged.jsonl --dedup
```

### `split`
Split a JSONL file into train and val sets.

```bash
cleanllm split data.jsonl --outdir splits/
cleanllm split data.jsonl --outdir splits/ --ratio 0.95 --seed 42 --no-shuffle
```

Outputs `<basename>_train.jsonl` and `<basename>_val.jsonl` in the output directory. Default ratio is 0.9 (90% train).

### `recipes`
Bootstrap pipelines and gate rules from built-in templates.

```bash
cleanllm recipes list
cleanllm recipes show cp_pipeline_usaco_portable
cleanllm recipes write cp_bundle --outdir bootstrap/
```

Built-in recipes: `cp_pipeline_basic`, `cp_pipeline_usaco_portable`, `cp_pipeline_fast_audit`, `gate_stats_basic`, `gate_compare_basic`, `gate_compare_strict`, `cp_bundle`.

---

## Python API

```python
from cleanllm import (
    scan_jsonl, fix_jsonl, FixRules,
    dedup_jsonl, validate_jsonl, stats_jsonl,
    sample_jsonl, audit_bundle,
    shard_jsonl, make_manifest,
    download_from_hub, detect_hf_schema,
)
from cleanllm.convert import convert_jsonl
from cleanllm.merge import merge_jsonl
from cleanllm.split import split_jsonl

# Scan
report = scan_jsonl("data.jsonl")

# Fix (code dataset)
rules = FixRules(
    drop_on={"forbidden_pattern", "empty_assistant"},
    max_tokens=4096,
    keep_language="python",
)
summary = fix_jsonl("data.jsonl", "cleaned.jsonl", rules)

# Fix (text/chat dataset — only drop truly blank responses)
rules = FixRules(drop_on={"empty_assistant"}, min_assistant_chars=1, forbidden_patterns=[])

# Dedup
result = dedup_jsonl("cleaned.jsonl", "deduped.jsonl", by="prompt", normalized=True)

# Stats
stats = stats_jsonl("deduped.jsonl", schema="cp_sft_v1", keys=["source", "difficulty_bucket"])

# Sample + audit
sample_jsonl("deduped.jsonl", "sample.jsonl", num_rows=200, seed=42)
audit_bundle("deduped.jsonl", "audit_bundle", num_rows=200, seed=42, stratify=["source"])

# Shard + manifest
shard_jsonl("deduped.jsonl", "shards", shard_size=5000, gzip_output=True)
make_manifest("shards", "manifest.json")

# Convert between formats
convert_jsonl("data.jsonl", "out.jsonl", from_fmt="sharegpt", to_fmt="chatml")

# Merge + split
merge_jsonl(["a.jsonl", "b.jsonl"], "merged.jsonl", dedup=True)
split_jsonl("merged.jsonl", "splits/", ratio=0.9, seed=42)

# Download from HuggingFace Hub (requires pip install cleanllm[hf])
result = download_from_hub("HuggingFaceH4/ultrachat_200k", "data.jsonl", split="train_sft")
```

---

## Presets

| Preset | Description |
|---|---|
| `general` | URL removal + whitespace normalization, no domain-specific forbidden patterns |
| `security_scan` | Redacts secrets: AWS keys, GitHub tokens, API keys, private keys |
| `pii_scan` | Redacts PII: emails, US phone numbers, SSNs, credit cards, IPv4 addresses |
| `cpp17_clean` | URL removal + whitespace normalization + redact C++ portability issues |
| `usaco_portable` | Strict CP portability — drops rows with forbidden patterns |
| `deterministic_only` | Drops rows with non-deterministic APIs (`rand()`, `random_device`, etc.) |

---

## Defaults

- **Required keys:** `id`, `messages`
- **Forbidden patterns (default):** none — use `--preset cpp17_clean` or `--preset usaco_portable` for CP datasets
- **`empty_assistant` threshold:** 20 characters (responses shorter than this are flagged as empty)

> **CP datasets:** To apply competitive-programming forbidden patterns (`freopen`, `ifstream`, `bits/extc++.h`, etc.) use a preset: `cleanllm fix data.jsonl -o out.jsonl --preset usaco_portable`. In Python, pass `forbidden_patterns=list(DEFAULT_FORBIDDEN_PATTERNS)` explicitly.

---

## Data format

cleanllm expects JSONL where each line is a JSON object. The default schema (`cp_sft_v1`) requires:

```json
{
  "id": "unique-id",
  "messages": [
    {"role": "system",    "content": "..."},
    {"role": "user",      "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}
```

Optional fields: `source`, `difficulty_bucket`, `problem_id`, `tests`.

---

## Development

```bash
pip install -e .[dev]
pytest
python -m build
twine check dist/*
```

See `RELEASE_CHECKLIST.md` for the full release workflow.

---

## License

MIT
