Metadata-Version: 2.4
Name: suur-data
Version: 1.1.1
Summary: Intelligent data ingestion, filtering and tokenization pipeline
Home-page: https://github.com/yourname/suur-data
Author: Your Name
Author-email: your@email.com
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: chardet
Requires-Dist: tqdm
Provides-Extra: pdf
Requires-Dist: pdfminer.six; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx; extra == "docx"
Provides-Extra: epub
Requires-Dist: ebooklib; extra == "epub"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: tokenizers; extra == "hf"
Provides-Extra: all
Requires-Dist: pdfminer.six; extra == "all"
Requires-Dist: python-docx; extra == "all"
Requires-Dist: ebooklib; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: tokenizers; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Suur Data

**Intelligent data ingestion, filtering, and tokenization pipeline.**

---

## Installation

```bash
pip install suur-data
```

---

## See It In Action

### One line to fetch, filter and tokenize any web page

```python
from suur_data import suur_data

result = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(result["total_tokens"])
```

### Fetch 3 classic novels and filter by topic in under 10 seconds

```python
from suur_data import suur_data

result = suur_data(
    [
        "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
        "https://www.gutenberg.org/cache/epub/84/pg84.txt",
        "https://www.gutenberg.org/cache/epub/11/pg11.txt",
    ],
    topic="love",
    workers=3,
)

print(f"Total tokens: {result['total_tokens']}")
print(f"Chunks kept:  {result['num_chunks']}")
```

That fetches 3 classic novels totalling 1.3 million characters, filters 3000+ paragraphs down to only the relevant ones, and returns a training-ready tokenized dataset in under 10 seconds — one function call.

---

## What It Returns

```python
result = suur_data("data.txt", topic="neural networks")

result["tokens"]        # flat list of all token IDs
result["batch"]         # list of token ID lists, one per chunk
result["chunks"]        # list of kept text chunks as strings
result["num_chunks"]    # number of chunks kept after filtering
result["total_tokens"]  # total token count across all chunks
```

---

## Full Documentation

### All Installation Options

```bash
# Core — supports .txt, .csv, .json, .html, URLs
pip install suur-data

# Add PDF support
pip install suur-data[pdf]

# Add Word document support
pip install suur-data[docx]

# Add EPUB support
pip install suur-data[epub]

# Add HuggingFace pretrained tokenizers
pip install suur-data[hf]

# Everything
pip install suur-data[all]
```

### Supported Input Formats

| Format | Notes |
|--------|-------|
| .txt .md .rst | Plain text, auto encoding detection |
| .pdf | Requires suur-data[pdf] |
| .docx | Requires suur-data[docx] |
| .csv .tsv | All cells joined as text |
| .json | Recursively flattened key-value pairs |
| .html .htm | Scripts and styles stripped automatically |
| .epub | Requires suur-data[epub] |
| HTTP/HTTPS URL | Auto-downloaded, parsed by extension |

---

## Python API

### Single source

```python
from suur_data import suur_data

# From a local file
result = suur_data("data.txt", topic="machine learning")

# From a URL
result = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")
```

### Multiple sources — NEW in 1.1.0

```python
from suur_data import suur_data

result = suur_data(
    [
        "data.txt",
        "research_paper.pdf",
        "https://en.wikipedia.org/wiki/Deep_learning",
        "https://en.wikipedia.org/wiki/Artificial_neural_network",
    ],
    topic="neural networks",
)
```

All sources are downloaded, merged, filtered together and tokenized in one call.

### Parallel downloading with workers — NEW in 1.1.0

```python
from suur_data import suur_data

result = suur_data(
    [
        "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
        "https://www.gutenberg.org/cache/epub/84/pg84.txt",
        "https://www.gutenberg.org/cache/epub/11/pg11.txt",
    ],
    topic="love",
    workers=3,   # downloads all 3 simultaneously
)
```

Without workers each source downloads one by one. With workers=3 all 3 download at the same time. Speed improvement is roughly 40 percent for 3 sources and grows with more sources.

### Batch output per chunk — NEW in 1.1.0

```python
result = suur_data("data.txt", topic="neural networks")

# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
    print(f"Chunk {i+1} ({len(tokens)} tokens):")
    print(chunk[:80])
    print(tokens[:10])
```

### Custom BPE tokenizer trained on your data

```python
result = suur_data(
    "data.txt",
    topic="machine learning",
    tokenizer="custom",
    vocab_size=4000,
    save_dir="./my_tokenizer",
)
```

### Strict filter — only highly relevant chunks survive

```python
result = suur_data("data.pdf", topic="quantum computing", threshold=0.15)
```

### Loose filter — keep more content

```python
result = suur_data("data.txt", topic="AI", threshold=0.02)
```

### Skip filter entirely

```python
result = suur_data("data.txt", no_filter=True)
```

### Use directly with HuggingFace Transformers

```python
import torch
from transformers import AutoModelForCausalLM

result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

for chunk_tokens in result["batch"]:
    input_ids = torch.tensor([chunk_tokens[:1024]])
    with torch.no_grad():
        outputs = model(input_ids)
    print(outputs.logits.shape)
```

### Save and load tokens

```python
import json

result = suur_data("data.txt", topic="neural networks")

# Save
with open("tokens.json", "w") as f:
    json.dump(result["tokens"], f)

# Load
with open("tokens.json", "r") as f:
    tokens = json.load(f)
```

### Decode tokens back to text

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = tok.decode(result["tokens"])
print(text)
```

---

## All Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data_location | str or List[str] | required | URL, file path, or list of multiple sources |
| topic | str | "" | Subject for relevance filtering. Empty skips filter |
| tokenizer | str | "pretrained" | "pretrained" or "custom" |
| model | str | "gpt2" | HuggingFace model name or Hub ID |
| vocab_size | int | 8000 | BPE vocab size for custom tokenizer |
| threshold | float | 0.05 | Relevance cutoff between 0.0 and 1.0 |
| save_dir | str | None | Directory to save tokenizer files |
| no_filter | bool | False | Skip the relevance filter |
| verbose | bool | True | Show progress output |
| workers | int | 1 | Number of parallel download workers |

---

## Pretrained Model Shortcuts

| Shortcut | Model |
|----------|-------|
| gpt2 | GPT-2 (OpenAI) |
| bert | BERT base uncased |
| roberta | RoBERTa base |
| distilbert | DistilBERT base uncased |
| t5 | T5 small |

You can also pass any HuggingFace Hub model ID directly:

```python
result = suur_data("data.txt", model="facebook/opt-125m")
```

---

## How the Filter Works

The filter splits text into paragraph chunks using blank lines as boundaries. If no paragraphs are found it automatically falls back to sentence level splitting grouping every 3 sentences into a chunk.

Each chunk is scored against the topic using TF-IDF cosine similarity. A gentle length penalty is applied to very short chunks. Chunks below the threshold are dropped.

If the threshold is too strict and everything gets dropped it auto relaxes and keeps the top 10 percent so you never get empty output.

```python
result = suur_data("data.txt", topic="AI", threshold=0.10)  # strict
result = suur_data("data.txt", topic="AI", threshold=0.02)  # loose
```

---

## Architecture

```
Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Parallel downloading with workers parameter.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Strips boilerplate headers and footers.
Splits text into paragraph chunks.
Falls back to sentence splitting if no paragraphs found.
Scores each chunk against topic via TF-IDF cosine similarity.
Applies length penalty to very short chunks.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
Auto relaxes if threshold is too strict.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer with caching (loads once reuses for all chunks).
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
{tokens, batch, chunks, num_chunks, total_tokens}
```

---

## Changelog

### 1.1.0 — Major Update

- **Multiple sources** — pass a list of URLs and files, all merged into one dataset
- **Parallel workers** — workers parameter downloads all sources simultaneously, 40 percent faster
- **Batch output** — result is now a dict with tokens per chunk not just a flat list
- **Tokenizer caching** — pretrained tokenizer loads once and reuses for all chunks
- **Sentence level splitting** — automatically falls back to sentence chunking if no paragraphs found
- **Boilerplate stripping** — removes Gutenberg headers and footers before filtering
- **Length penalty** — gentle penalty on very short chunks improves filter quality
- **Auto relax** — if threshold drops everything keeps top 10 percent instead of returning empty

### 1.0.0 — Initial Release

- Single source ingestion from URL or file
- 8 supported file formats
- TF-IDF relevance filter
- Pretrained HuggingFace tokenizer
- Custom BPE tokenizer
- CLI interface

---

## License

MIT
