Metadata-Version: 2.4
Name: suur-data
Version: 1.0.4
Summary: Intelligent data ingestion and tokenization pipeline
Home-page: https://github.com/yourname/suur-data
Author: Your Name
Author-email: your@email.com
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: chardet
Requires-Dist: tqdm
Provides-Extra: pdf
Requires-Dist: pdfminer.six; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx; extra == "docx"
Provides-Extra: epub
Requires-Dist: ebooklib; extra == "epub"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: tokenizers; extra == "hf"
Provides-Extra: all
Requires-Dist: pdfminer.six; extra == "all"
Requires-Dist: python-docx; extra == "all"
Requires-Dist: ebooklib; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: tokenizers; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Suur Data

**Intelligent data ingestion, filtering, and tokenization pipeline.**

## Installation

```bash
pip install suur-data
```

## See It In Action

```python
from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)
```

That one line downloads a full Wikipedia page, filters it down to only the relevant paragraphs, and returns token IDs ready for any ML model.

---

## Full Documentation

### All Installation Options

```bash
pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]
```

### Supported Input Formats

| Format | Notes |
|--------|-------|
| .txt .md .rst | Plain text, auto encoding detection |
| .pdf | Requires suur-data[pdf] |
| .docx | Requires suur-data[docx] |
| .csv .tsv | All cells joined as text |
| .json | Recursively flattened key-value pairs |
| .html .htm | Scripts and styles stripped automatically |
| .epub | Requires suur-data[epub] |
| HTTP/HTTPS URL | Auto-downloaded, parsed by extension |

### Python API

```python
from suur_data import suur_data

# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

# From a local file
tokens = suur_data("data.txt", topic="machine learning")

# Multiple sources at once
tokens = suur_data(
    ["file1.txt", "file2.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
    topic="neural networks"
)

# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

# Strict filter
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

# Save tokenizer to disk
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)
```

### All Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data_location | str or List[str] | required | URL, file path, or list of multiple sources |
| topic | str | "" | Subject for relevance filtering. Empty skips filter |
| tokenizer | str | "pretrained" | "pretrained" or "custom" |
| model | str | "gpt2" | HuggingFace model name or Hub ID |
| vocab_size | int | 8000 | BPE vocab size for custom tokenizer |
| threshold | float | 0.05 | Relevance cutoff between 0.0 and 1.0 |
| save_dir | str | None | Directory to save tokenizer files |
| no_filter | bool | False | Skip the relevance filter |
| verbose | bool | True | Show progress output |

### Pretrained Model Shortcuts

| Shortcut | Model |
|----------|-------|
| gpt2 | GPT-2 (OpenAI) |
| bert | BERT base uncased |
| roberta | RoBERTa base |
| distilbert | DistilBERT base uncased |
| t5 | T5 small |

You can also pass any HuggingFace Hub model ID directly:

```python
tokens = suur_data("data.txt", model="facebook/opt-125m")
```

### How the Filter Works

The filter splits text into paragraph chunks, converts each chunk and the topic into TF-IDF vectors, then scores them using cosine similarity. Chunks below the threshold are deleted. If the threshold is too strict and everything gets dropped, it automatically relaxes and keeps the top 30 percent.

```python
tokens = suur_data("data.txt", topic="AI", threshold=0.10)  # strict
tokens = suur_data("data.txt", topic="AI", threshold=0.02)  # loose
```

### Saving and Loading Tokens

```python
import json

tokens = suur_data("data.txt", topic="neural networks")
with open("tokens.json", "w") as f:
    json.dump(tokens, f)

with open("tokens.json", "r") as f:
    tokens = json.load(f)
```

### Decoding Tokens Back to Text

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = tok.decode(tokens)
print(text)
```

---

## Architecture

```
Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
List[int] — token IDs
```

---

## License

MIT
