Metadata-Version: 2.4
Name: trollfab-data-cleaner
Version: 1.0.0
Summary: Production-quality data cleaning, validation, similarity, anonymization, and Swedish-specific utilities — pure stdlib core
Author-email: Trollfabriken AITrix AB <dev@trollfabriken.se>
License: MIT
Keywords: data-cleaning,validation,text,swedish,nlp,anonymization,deduplication,html,json
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: similarity
Requires-Dist: rapidfuzz>=3.0; extra == "similarity"
Provides-Extra: html
Requires-Dist: beautifulsoup4>=4.12; extra == "html"
Provides-Extra: all
Requires-Dist: rapidfuzz>=3.0; extra == "all"
Requires-Dist: beautifulsoup4>=4.12; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.9; extra == "dev"
Dynamic: license-file

# trollfab-data-cleaner

Production-quality data cleaning, validation, similarity, anonymization,
and Swedish-specific utilities — built for
[Trollfabriken AITrix AB](https://trollfabriken.se) document processing pipelines.
Pure Python, no required dependencies.

---

## Modules

| Module | Purpose |
|---|---|
| `TextCleaner` | HTML strip, Unicode normalize, whitespace, Swedish mojibake |
| `ContentCleaner` | OCR/web content cleaning with 88+ Swedish legal patterns |
| `HTMLConverter` | Parse, sanitize, extract from HTML |
| `MarkdownConverter` | HTML→Markdown with LLM-optimized output |
| `TextSimilarity` | Levenshtein, n-gram, cosine, deduplication |
| `DuplicateDetector` | Weighted similarity, Union-Find clustering |
| `Schema` + validators | 30+ composable validators (Email, URL, Range, UUID…) |
| `Anonymizer` | GDPR-compliant, role-aware, 6 strategies |
| `repair_json` | Fix truncated/malformed LLM JSON output |
| `swedish.*` | Personnummer, org numbers, phone, banking normalization |

---

## Installation

```bash
# Core (no external deps)
pip install trollfab-data-cleaner

# With fuzzy matching acceleration
pip install "trollfab-data-cleaner[similarity]"

# With HTML parser (better HTML→Markdown conversion)
pip install "trollfab-data-cleaner[html]"

# Everything
pip install "trollfab-data-cleaner[all]"
```

---

## Quick start

### Text cleaning

```python
from data_cleaner import TextCleaner, CleanerConfig

cleaner = TextCleaner()
result = cleaner.clean("<p>Göteborg &amp; Stockholm  </p>")
print(result.text)  # "Göteborg & Stockholm"

# Custom config
cfg = CleanerConfig(strip_html=True, fix_swedish_mojibake=True, normalize_unicode=True)
cleaner = TextCleaner(cfg)
```

### Validation framework

```python
from data_cleaner.validation import Schema, Required, Email, MinLength, Range

schema = Schema({
    "name":  Required() | MinLength(2),
    "email": Required() | Email(),
    "age":   Required() | Range(0, 150),
})
result = schema.validate({"name": "Anna", "email": "anna@example.com", "age": 30})
print(result.valid, result.errors)
```

### Text similarity & deduplication

```python
from data_cleaner import TextSimilarity, DuplicateDetector

sim = TextSimilarity()
print(sim.levenshtein_ratio("Göteborg", "Goteborg"))   # ~0.88
print(sim.cosine_tfidf(["doc one", "doc two"], "doc one"))

# Dedup a list of strings
deduped = sim.deduplicate(texts, threshold=0.85)

# Cluster similar documents
detector = DuplicateDetector()
groups = detector.find_duplicates(documents)
```

### HTML to Markdown (LLM-optimized)

```python
from data_cleaner import MarkdownConverter

conv = MarkdownConverter()
result = conv.convert(html_string)
print(result.markdown, result.compression_ratio)
```

### JSON repair

```python
from data_cleaner import repair_json, safe_parse

# Fix truncated LLM output
fixed = repair_json('{"name": "Anna", "items": [1, 2')
data = safe_parse(llm_response_text)  # returns None on failure
```

### GDPR anonymization

```python
from data_cleaner import Anonymizer

anon = Anonymizer()
result = anon.anonymize(text)
print(result.anonymized_text, result.pii_found)
```

### Swedish utilities

```python
from data_cleaner.swedish import (
    validate_personnummer,
    mask_personnummer,
    extract_personnummer,
    validate_org_number,
    classify_org_number,
    validate_phone_se,
    parse_swedish_amount,
    normalize_vendor,
)

# Personnummer
r = validate_personnummer("19850312-4564")
print(r.valid, r.birth_date, r.age, r.gender)
print(mask_personnummer("19850312-4564"))  # "19850312-XXXX"

# Org number
print(validate_org_number("556703-7687"))   # True
info = classify_org_number("556703-7687")
print(info.type_name)  # "Aktiebolag"

# Swedish phone
result = validate_phone_se("+46 31 123 456")
print(result.valid, result.normalized)

# Swedish amounts
print(parse_swedish_amount("1 234 567,89"))  # 1234567.89
print(normalize_vendor("PAYPAL *ADOBE"))     # "ADOBE"
```

---

## Package structure

```
data_cleaner/
├── __init__.py             ← Public API
├── py.typed                ← PEP 561
├── text_cleaner.py         ← TextCleaner (HTML, unicode, whitespace, mojibake)
├── content_cleaner.py      ← OCR/web content cleaner (88+ Swedish patterns)
├── html_converter.py       ← HTMLConverter (sanitize, extract, XSS prevention)
├── html_to_markdown.py     ← MarkdownConverter (LLM-optimized, ~80% token reduction)
├── text_similarity.py      ← TextSimilarity (levenshtein, n-gram, cosine, dedup)
├── duplicate_detector.py   ← DuplicateDetector (weighted, Union-Find clustering)
├── validation.py           ← Schema + 30 validators (Email, URL, Range, UUID…)
├── validators.py           ← Functional validators (email, url, json, credit card)
├── anonymizer.py           ← Anonymizer (GDPR, 6 strategies, role-aware)
├── json_repair.py          ← repair_json / safe_parse (5-strategy LLM JSON fixer)
└── swedish/
    ├── __init__.py
    ├── personnummer.py      ← validate/mask/extract/format (Luhn, age, gender)
    ├── org_number.py        ← validate/classify org numbers + public record URLs
    ├── validators.py        ← validate_phone_se, validate_personnummer, sanitize_text
    ├── anonymizer.py        ← Swedish PII anonymizer (role-aware, 6 levels)
    └── banking.py           ← safe_str/float/int, parse_swedish_amount, normalize_vendor
```

---

© 2025 Trollfabriken AITrix AB — MIT License
