Metadata-Version: 2.4
Name: omg-data
Version: 1.0.1
Summary: Automatic HuggingFace dataset download, cleaning and tokenization pipeline for OMGFormer
Author: OMG-Data Contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/fastloraoffical/OMGformers
Project-URL: Repository, https://github.com/fastloraoffical/OMGformers
Project-URL: Bug Tracker, https://github.com/fastloraoffical/OMGformers/issues
Keywords: dataset,nlp,huggingface,tokenization,data-pipeline,omgformer,deep-learning,preprocessing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: transformers>=4.35.0
Requires-Dist: huggingface-hub>=0.19.0
Provides-Extra: langdetect
Requires-Dist: langdetect>=1.0.9; extra == "langdetect"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: langdetect>=1.0.9; extra == "dev"

# omg-data

**Automatic HuggingFace dataset pipeline for OMGFormer — download, clean, tokenize.**

One call handles everything: finding the right datasets for your language & task, downloading them, cleaning the text, and producing a ready-to-train `OMGDataset`.

---

## Installation

```bash
pip install omg-data
# with language detection support
pip install omg-data[langdetect]
```

---

## Quick Start

```python
from omg_data import DataPipeline

# Türkçe, 5 GB, GPT-2 tokenizer
pipe = DataPipeline(
    language="tr",
    size_gb=5,
    tokenizer="gpt2",
)
dataset = pipe.build()

trainer.fit(dataset)  # OMGFormer Trainer ile doğrudan uyumlu
```

---

## Examples

### Task-specific pipeline

```python
pipe = DataPipeline(
    language="en",
    task="chat",               # "text" | "chat" | "instruction" | "qa" | "code" | ...
    size_gb=10,
    tokenizer="meta-llama/Llama-2-7b-hf",
    seq_len=2048,
)
dataset = pipe.build()
```

### Custom datasets

```python
pipe = DataPipeline(
    language="en",
    tokenizer="gpt2",
    custom_datasets=["wikitext", "openwebtext"],
)
dataset = pipe.build()
```

### Cleaning disabled

```python
pipe = DataPipeline(language="en", tokenizer="gpt2", clean=False)
dataset = pipe.build()
```

### Fine-grained cleaning control

```python
pipe = DataPipeline(
    language="tr",
    tokenizer="gpt2",
    clean=True,
    clean_options={
        "dedup": True,
        "min_chars": 50,
        "remove_urls": True,
        "remove_html": True,
        "lang_filter": True,
    },
)
```

### Raw text (no tokenizer)

```python
pipe = DataPipeline(language="de", size_gb=2)
hf_dataset = pipe.build()   # returns HuggingFace Dataset
```

---

## Supported Languages

`tr` `en` `de` `fr` `es` `ar` `ru` `ja` `zh` `ko` `pt` `it` `nl` `pl` `sv`

## Supported Tasks

`text` · `lm` · `chat` · `conversation` · `instruction` · `instruct` · `qa` · `summarization` · `classification` · `code`

---

## Pipeline Steps

1. **Search** — Finds suitable HuggingFace datasets for your language & task  
2. **Download** — Streams & caches datasets via HuggingFace `datasets`  
3. **Clean** — Removes HTML, URLs, duplicates, fixes Unicode, filters by length  
4. **Tokenize** — Chunks text into fixed-length token windows  
5. **Return** — `OMGDataset` (PyTorch `Dataset`) ready for `trainer.fit()`  

---

## OMGDataset API

```python
from omg_data import OMGDataset

# Compatible with any PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Info
print(dataset.info())
# {'num_sequences': 125000, 'seq_len': 512, 'total_tokens': 64000000, 'approx_size_gb': 0.128}
```

---

## License

Apache-2.0
