Metadata-Version: 2.4
Name: suur-data
Version: 1.0.6
Summary: Intelligent data ingestion and tokenization pipeline
Home-page: https://github.com/yourname/suur-data
Author: Your Name
Author-email: your@email.com
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: chardet
Requires-Dist: tqdm
Provides-Extra: pdf
Requires-Dist: pdfminer.six; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx; extra == "docx"
Provides-Extra: epub
Requires-Dist: ebooklib; extra == "epub"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: tokenizers; extra == "hf"
Provides-Extra: all
Requires-Dist: pdfminer.six; extra == "all"
Requires-Dist: python-docx; extra == "all"
Requires-Dist: ebooklib; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: tokenizers; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Suur Data

**Intelligent data ingestion, filtering, and tokenization pipeline.**

## Installation

```bash
pip install suur-data
```

## See It In Action

### Single Source

```python
from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)
```

### Multiple Sources

```python
from suur_data import suur_data

tokens = suur_data(
    [
        "data.txt",
        "research_paper.pdf",
        "https://en.wikipedia.org/wiki/Deep_learning",
        "https://en.wikipedia.org/wiki/Artificial_neural_network",
    ],
    topic="neural networks",
    threshold=0.05,
)
print(f"Total tokens: {len(tokens)}")
```

All sources are downloaded, merged, filtered together, and tokenized in one call.

---

## Full Documentation

### All Installation Options

```bash
pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]
```

### Supported Input Formats

| Format | Notes |
|--------|-------|
| .txt .md .rst | Plain text, auto encoding detection |
| .pdf | Requires suur-data[pdf] |
| .docx | Requires suur-data[docx] |
| .csv .tsv | All cells joined as text |
| .json | Recursively flattened key-value pairs |
| .html .htm | Scripts and styles stripped automatically |
| .epub | Requires suur-data[epub] |
| HTTP/HTTPS URL | Auto-downloaded, parsed by extension |

### Python API

```python
from suur_data import suur_data

# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

# From a local file
tokens = suur_data("data.txt", topic="machine learning")

# Multiple sources at once
tokens = suur_data(
    ["data.txt", "paper.pdf", "https://en.wikipedia.org/wiki/Deep_learning"],
    topic="neural networks"
)

# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

# Strict filter
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

# Save tokenizer to disk
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

# Skip filter entirely
tokens = suur_data("data.txt", no_filter=True)
```

### Batch Output

```python
result = suur_data("data.txt", topic="neural networks")

print(result["total_tokens"])    # total token count
print(result["num_chunks"])      # number of chunks kept

# Iterate chunk by chunk
for i, (chunk, tokens) in enumerate(zip(result["chunks"], result["batch"])):
    print(f"Chunk {i+1} ({len(tokens)} tokens):")
    print(chunk[:80])
    print(tokens[:10])
```

### Use Directly With Transformers

```python
import torch
from transformers import AutoModelForCausalLM

result = suur_data("data.txt", topic="neural networks", model="gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

for chunk_tokens in result["batch"]:
    input_ids = torch.tensor([chunk_tokens[:1024]])
    with torch.no_grad():
        outputs = model(input_ids)
    print(outputs.logits.shape)
```

### All Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data_location | str or List[str] | required | URL, file path, or list of multiple sources |
| topic | str | "" | Subject for relevance filtering. Empty skips filter |
| tokenizer | str | "pretrained" | "pretrained" or "custom" |
| model | str | "gpt2" | HuggingFace model name or Hub ID |
| vocab_size | int | 8000 | BPE vocab size for custom tokenizer |
| threshold | float | 0.05 | Relevance cutoff between 0.0 and 1.0 |
| save_dir | str | None | Directory to save tokenizer files |
| no_filter | bool | False | Skip the relevance filter |
| verbose | bool | True | Show progress output |

### Pretrained Model Shortcuts

| Shortcut | Model |
|----------|-------|
| gpt2 | GPT-2 (OpenAI) |
| bert | BERT base uncased |
| roberta | RoBERTa base |
| distilbert | DistilBERT base uncased |
| t5 | T5 small |

---

## Architecture

```
Source (URL or file or list of sources)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Merges all sources into one text string.
        |
        v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
{tokens, batch, chunks, num_chunks, total_tokens}
```

---

## License

MIT
