Metadata-Version: 2.4
Name: cureta
Version: 0.1.0
Summary: Cureta: Unified Data Curation Framework by Mira — a pipeline of Taggers that enrich a standard Document object.
Author: Mira Team
License-Expression: MIT
Requires-Python: >=3.10
Requires-Dist: flashtext>=2.7
Requires-Dist: indic-nlp-library>=0.92
Requires-Dist: nltk>=3.9.3
Requires-Dist: pydantic<3.0,>=2.0
Requires-Dist: pyyaml<7.0,>=6.0
Provides-Extra: all
Requires-Dist: fasttext-wheel>=0.9; extra == 'all'
Requires-Dist: huggingface-hub>=0.20; extra == 'all'
Requires-Dist: langdetect>=1.0.9; extra == 'all'
Requires-Dist: numpy<2.0.0; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: pyarrow>=14.0; extra == 'all'
Requires-Dist: ray[data]>=2.10; extra == 'all'
Requires-Dist: skypilot-nightly[kubernetes]; extra == 'all'
Requires-Dist: tokenizers>=0.19; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: transformers>=4.40; extra == 'all'
Provides-Extra: datatrove
Requires-Dist: datatrove>=0.3; extra == 'datatrove'
Provides-Extra: dedup
Requires-Dist: datasketch>=1.6; extra == 'dedup'
Provides-Extra: dev
Requires-Dist: pyright>=1.1; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Provides-Extra: edu
Requires-Dist: torch>=2.0; extra == 'edu'
Requires-Dist: transformers>=4.40; extra == 'edu'
Provides-Extra: embedding
Requires-Dist: sentence-transformers>=3.0; extra == 'embedding'
Requires-Dist: torch>=2.0; extra == 'embedding'
Provides-Extra: guesslang
Requires-Dist: guesslang>=2.2; extra == 'guesslang'
Provides-Extra: langid
Requires-Dist: langdetect>=1.0.9; extra == 'langid'
Provides-Extra: llm
Requires-Dist: fasttext-wheel>=0.9; extra == 'llm'
Requires-Dist: huggingface-hub>=0.20; extra == 'llm'
Requires-Dist: numpy<2.0.0; extra == 'llm'
Requires-Dist: tokenizers>=0.19; extra == 'llm'
Requires-Dist: torch>=2.0; extra == 'llm'
Requires-Dist: transformers>=4.40; extra == 'llm'
Provides-Extra: nemo
Requires-Dist: nemo-curator>=1.1.0; extra == 'nemo'
Provides-Extra: pii
Requires-Dist: presidio-analyzer>=2.2; extra == 'pii'
Requires-Dist: spacy>=3.7; extra == 'pii'
Provides-Extra: quality
Requires-Dist: fasttext-wheel>=0.9; extra == 'quality'
Requires-Dist: numpy<2.0.0; extra == 'quality'
Provides-Extra: ray
Requires-Dist: pandas>=2.0; extra == 'ray'
Requires-Dist: pyarrow>=14.0; extra == 'ray'
Requires-Dist: ray[data]>=2.10; extra == 'ray'
Provides-Extra: readability
Requires-Dist: textstat>=0.7; extra == 'readability'
Provides-Extra: sky
Requires-Dist: skypilot-nightly[kubernetes]; extra == 'sky'
Provides-Extra: toxicity
Requires-Dist: detoxify>=0.5.0; extra == 'toxicity'
Requires-Dist: torch>=2.0; extra == 'toxicity'
Description-Content-Type: text/markdown

# cureta

**Unified Data Curation Framework** — enrich text datasets at scale with a composable pipeline of feature-extraction *Taggers*.

---

## Install

```bash
pip install -e ".[ray,llm,dev]"   # full install (Ray + LLM taggers + dev tools)
pip install -e ".[ray,dev]"        # heuristic taggers only (no GPU required)
```

**Requires:** Python ≥ 3.10

---

## Quick example

```python
from cureta import quick_tag

tags = quick_tag(
    "यह एक नमूना दस्तावेज़ है।",
    pipeline_config="pipelines/example/pipeline_config.yaml",
)
# {'cureta_id': {'cureta_id': 'a3f8d2c...'}, 'num_words': {'num_words': 5}}
```

Or from the command line:

```bash
# Create a small sample dataset
python -c "
import json, pathlib
rows = [{'text': f'This is sample document number {i}.'} for i in range(200)]
pathlib.Path('sample.jsonl').write_text('\n'.join(json.dumps(r) for r in rows))
print('Created sample.jsonl with 200 rows')
"

# Run the example pipeline
cureta run --pipeline example --dataset ./sample.jsonl --limit 100
```

# For HuggingFace datasets see docs/how_to/load_huggingface_data.md

---

## Documentation

| | |
|---|---|
| **[Tutorial: Your first pipeline](docs/tutorials/first_pipeline.md)** | Get a working result in 5 minutes |
| **[Tutorial: Writing a custom tagger](docs/tutorials/writing_a_custom_tagger.md)** | Build and run your own tagger |
| **[Tutorial: Deploying on cloud](docs/tutorials/deploying_on_cloud.md)** | Run on GPU cluster with SkyPilot |
| **[Reference: CLI](docs/reference/cli.md)** | `cureta` command reference |
| **[Reference: Python API](docs/reference/python_api.md)** | `run_pipeline`, `quick_tag`, `Pipeline`, ... |
| **[Reference: Taggers](docs/reference/taggers.md)** | All 45 built-in taggers with output schemas |
| **[Concepts: Execution model](docs/concepts/execution_model.md)** | SIMD vs SDMI, two-phase model, Ray Data |
| **[Writing taggers](docs/writing_taggers.md)** | Tier-1 and Tier-2 tagger authoring guide |

---

## Examples

| | |
|---|---|
| **[01_quickstart](examples/01_quickstart/)** | Local data, four heuristic taggers, Parquet output |
| **[02_custom_tagger](examples/02_custom_tagger/)** | Write and run a custom tagger |
| **[03_huggingface](examples/03_huggingface/)** | Load from HuggingFace, use text_column |
| **[04_programmatic_api](examples/04_programmatic_api/)** | All four Python API surfaces |
| **[05_streaming](examples/05_streaming/)** | stream_tagged_batches → in-memory / Kafka |

---

## Key concepts

- **Document** — canonical data envelope with `raw_content`, `document_id`, `metadata`, and `tags`
- **Tagger** — atomic feature-extraction unit; subclass `CPUTagger` or `GPUTagger`, implement `process_document` (Tier 1) or `run` (Tier 2)
- **Pipeline** — orchestrator that loads config, resolves dependencies, and dispatches taggers

45 built-in taggers ship with the library across three categories:

**Heuristic taggers (CPU, no model weights)** — word count, line/paragraph/word-length statistics, Gopher/C4/FineWeb quality filters, compression ratio, vocabulary diversity, MinHash/SimHash/exact dedup signals, PII regex, HTML density, Markdown structure, content type, URL parsing, domain blocklist, date extraction, and readability metrics.

**Lightweight ML taggers** — GlotLID language ID (1665 languages), paragraph-level language ID, FastText quality scoring, programming language detection, and Microsoft Presidio PII detection.

**GPU taggers** — FineWeb educational quality (0–5 score), domain classification (26 domains), multi-label toxicity, dense sentence embeddings, LLM-as-judge scoring, and perplexity.

See [Reference: Taggers](docs/reference/taggers.md) for the full list with output schemas.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) and [docs/contributing/setup.md](docs/contributing/setup.md).

---

## License

MIT
