Metadata-Version: 2.4
Name: suur-data
Version: 1.0.2
Summary: Intelligent data ingestion and tokenization pipeline
Home-page: https://github.com/yourname/suur-data
Author: Your Name
Author-email: your@email.com
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: chardet
Requires-Dist: tqdm
Provides-Extra: pdf
Requires-Dist: pdfminer.six; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx; extra == "docx"
Provides-Extra: epub
Requires-Dist: ebooklib; extra == "epub"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: tokenizers; extra == "hf"
Provides-Extra: all
Requires-Dist: pdfminer.six; extra == "all"
Requires-Dist: python-docx; extra == "all"
Requires-Dist: ebooklib; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: tokenizers; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Suur Data

**Intelligent data ingestion, filtering, and tokenization pipeline.**

## Installation

pip install suur-data

## See It In Action

from suur_data import suur_data

tokens = suur_data("https://en.wikipedia.org/wiki/Neural_network", topic="neural networks")
print(tokens)

That one line downloads a full Wikipedia page, filters it down to only the relevant paragraphs, and returns token IDs ready for any ML model.

---

## Full Documentation

### All Installation Options

pip install suur-data
pip install suur-data[pdf]
pip install suur-data[docx]
pip install suur-data[epub]
pip install suur-data[hf]
pip install suur-data[all]

### Supported Input Formats

Format              Notes
.txt .md .rst       Plain text, auto encoding detection
.pdf                Requires suur-data[pdf]
.docx               Requires suur-data[docx]
.csv .tsv           All cells joined as text
.json               Recursively flattened key-value pairs
.html .htm          Scripts and styles stripped automatically
.epub               Requires suur-data[epub]
HTTP/HTTPS URL      Auto-downloaded, parsed by extension

### Python API

from suur_data import suur_data

# From a URL
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience", topic="brain neurons")

# From a local file
tokens = suur_data("data.txt", topic="machine learning")

# Custom BPE tokenizer trained on your data
tokens = suur_data("data.txt", topic="machine learning", tokenizer="custom", vocab_size=4000)

# Strict filter — only highly relevant chunks survive
tokens = suur_data("data.pdf", topic="quantum computing", threshold=0.15)

# Save the tokenizer to disk for reuse
tokens = suur_data("data.txt", topic="biology", save_dir="./my_tokenizer")

# Skip the filter entirely
tokens = suur_data("data.txt", no_filter=True)

### All Parameters

Parameter       Type    Default         Description
data_location   str     required        URL or local file path
topic           str     ""              Subject for relevance filtering. Empty skips filter
tokenizer       str     "pretrained"    "pretrained" or "custom"
model           str     "gpt2"          HuggingFace model name or Hub ID
vocab_size      int     8000            BPE vocab size for custom tokenizer
threshold       float   0.05            Relevance cutoff between 0.0 and 1.0
save_dir        str     None            Directory to save tokenizer files
no_filter       bool    False           Skip the relevance filter
verbose         bool    True            Show progress output

### Pretrained Model Shortcuts

Shortcut        Model
gpt2            GPT-2 (OpenAI)
bert            BERT base uncased
roberta         RoBERTa base
distilbert      DistilBERT base uncased
t5              T5 small

You can also pass any HuggingFace Hub model ID directly:
tokens = suur_data("data.txt", model="facebook/opt-125m")

### How the Filter Works

The filter splits text into paragraph chunks, converts each chunk and the topic into TF-IDF vectors, then scores them using cosine similarity. Chunks below the threshold are deleted. If the threshold is too strict and everything gets dropped, it automatically relaxes and keeps the top 30 percent.

Raise the threshold for stricter filtering, lower it to keep more content:
tokens = suur_data("data.txt", topic="AI", threshold=0.10)  # strict
tokens = suur_data("data.txt", topic="AI", threshold=0.02)  # loose

### Saving and Loading Tokens

import json

tokens = suur_data("data.txt", topic="neural networks")
with open("tokens.json", "w") as f:
    json.dump(tokens, f)

with open("tokens.json", "r") as f:
    tokens = json.load(f)

### Decoding Tokens Back to Text

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = tok.decode(tokens)
print(text)

---

## Architecture

Source (URL or file)
        |
        v
Stage 1 — Ingest
Handles 8 file types and HTTP download.
Outputs a single raw text string.
        |
        v
Stage 2 — Neural Filter
Splits text into paragraph chunks.
Scores each chunk against topic via TF-IDF cosine similarity.
Shows progress bar while scoring.
Drops chunks below the relevance threshold.
        |
        v
Stage 3 — Tokenize
Pretrained: HuggingFace AutoTokenizer (GPT-2, BERT, etc.)
Custom: trains a BPE tokenizer on the filtered corpus.
        |
        v
List[int] — token IDs

---

## License

MIT
