Metadata-Version: 2.4
Name: suur_data
Version: 1.0.1
Summary: Intelligent data ingestion and tokenization pipeline
Author: Your Name
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: scikit-learn
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: chardet
Provides-Extra: pdf
Requires-Dist: pdfminer.six; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx; extra == "docx"
Provides-Extra: epub
Requires-Dist: ebooklib; extra == "epub"
Provides-Extra: hf
Requires-Dist: transformers; extra == "hf"
Requires-Dist: tokenizers; extra == "hf"
Provides-Extra: all
Requires-Dist: pdfminer.six; extra == "all"
Requires-Dist: python-docx; extra == "all"
Requires-Dist: ebooklib; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: tokenizers; extra == "all"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Suur Data

**Intelligent data ingestion and tokenization pipeline.**

Suur Data fetches text from any source, filters it by topic using a neural relevance scorer, then tokenizes it using either a pretrained HuggingFace tokenizer or a custom-trained BPE tokenizer.

---

## Installation

```bash
# Core
pip install suur-data

# With all optional formats + HuggingFace tokenizers
pip install suur-data[all]
```

---

## Python API

```python
from suur_data import suur_data

# Minimal — fetches URL, no filter, GPT-2 tokenizer
tokens = suur_data("https://en.wikipedia.org/wiki/Neuroscience")

# Filter by topic, custom BPE tokenizer
tokens = suur_data(
    "research_paper.pdf",
    topic="quantum computing",
    tokenizer="custom",
    vocab_size=4000,
    save_dir="./my_tokenizer",
)

# Local file, pretrained BERT tokenizer, strict filter
tokens = suur_data(
    "~/corpus/biology.txt",
    topic="cell biology",
    tokenizer="pretrained",
    model="bert",
    threshold=0.10,
)

# Skip the filter entirely
tokens = suur_data("data.csv", no_filter=True)

print(tokens[:20])   # list of integer token IDs
print(len(tokens))   # total token count
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data_location` | str | — | URL or local file path |
| `topic` | str | `""` | Subject for relevance filtering (empty = skip filter) |
| `tokenizer` | str | `"pretrained"` | `"pretrained"` or `"custom"` |
| `model` | str | `"gpt2"` | HuggingFace model shortcut or full ID |
| `vocab_size` | int | `8000` | BPE vocab size for custom tokenizer |
| `threshold` | float | `0.05`` | Cosine similarity cutoff (0.0–1.0) |
| `save_dir` | str | `None` | Path to save tokenizer files |
| `no_filter` | bool | `False` | Skip the relevance filter |
| `verbose` | bool | `True` | Show progress output |

### Returns
`List[int]` — flat list of integer token IDs.

---

## CLI

```bash
# Basic URL fetch
suur_data fetch https://example.com/article --topic "machine learning"

# PDF with custom BPE tokenizer
suur_data fetch paper.pdf --topic "protein folding" --tokenizer custom --vocab-size 6000

# Local file, pretrained BERT, save tokenizer
suur_data fetch corpus.txt --tokenizer pretrained --model bert --save-dir ./bert_tok

# Skip filter, save tokens to file
suur_data fetch data.json --no-filter --output tokens.json

# See supported models
suur_data models

# See supported file formats
suur_data formats
```

---

## Supported Input Formats

| Format | Notes |
|--------|-------|
| `.txt`, `.md`, `.rst` | Plain text |
| `.pdf` | Requires `pdfminer.six` |
| `.docx` | Requires `python-docx` |
| `.csv`, `.tsv` | All cells joined as text |
| `.json` | Recursively flattened key-value pairs |
| `.html`, `.htm` | Scripts/styles stripped (requires `beautifulsoup4`) |
| `.epub` | E-books (requires `ebooklib` + `beautifulsoup4`) |
| HTTP/HTTPS URL | Auto-downloaded, then parsed by extension |

---

## Pretrained Model Shortcuts

| Shortcut | Model |
|----------|-------|
| `gpt2` | GPT-2 (OpenAI) |
| `bert` | BERT base uncased |
| `roberta` | RoBERTa base |
| `distilbert` | DistilBERT base uncased |
| `t5` | T5 small |

You can also pass any HuggingFace Hub model ID directly:
```
--model "facebook/opt-125m"
```

---

## Architecture

```
Source (URL / file)
        │
        ▼
  Stage 1: Ingest
  Handles 8 file types + HTTP download
        │
        ▼
  Stage 2: Neural Filter
  Splits into paragraph chunks
  Scores each chunk against topic via TF-IDF cosine similarity
  Drops chunks below threshold
        │
        ▼
  Stage 3: Tokenize
  ┌─────────────────────┐  ┌────────────────────────────┐
  │  Pretrained mode    │  │  Custom mode               │
  │  HuggingFace        │  │  BPE trainer (HF library   │
  │  AutoTokenizer      │  │  or pure-Python fallback)  │
  └─────────────────────┘  └────────────────────────────┘
        │
        ▼
  List[int]  ←  token IDs
```

---

## Dependency Matrix

| Feature | Required packages |
|---------|------------------|
| Core pipeline | `requests`, `beautifulsoup4`, `scikit-learn`, `numpy`, `click`, `chardet` |
| PDF support | `pdfminer.six` |
| .docx support | `python-docx` |
| .epub support | `ebooklib` |
| Pretrained tokenizers | `transformers` |
| Fast BPE training | `tokenizers` |

All optional — the tool degrades gracefully with built-in fallbacks when optional packages are missing.

