Metadata-Version: 2.4
Name: SqueakyCleanText
Version: 0.6.1
Summary: Text preprocessing & PII anonymization pipeline for NLP/ML: ONNX NER ensemble, language detection, stopword removal, and configurable token replacement.
Author: Rehan Fazal
License: MIT
Project-URL: Homepage, https://github.com/rhnfzl/SqueakyCleanText
Project-URL: Repository, https://github.com/rhnfzl/SqueakyCleanText
Project-URL: Issues, https://github.com/rhnfzl/SqueakyCleanText/issues
Project-URL: Changelog, https://github.com/rhnfzl/SqueakyCleanText/releases
Project-URL: llms.txt, https://github.com/rhnfzl/SqueakyCleanText/blob/main/llms.txt
Keywords: text cleaning,text preprocessing,NLP,natural language processing,named entity recognition,NER,anonymization,PII removal,data cleaning,machine learning,ONNX,language detection
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lingua-language-detector>=2.1.0
Requires-Dist: stop-words>=2025.11.4
Requires-Dist: emoji>=2.14
Requires-Dist: ftfy>=6.3
Requires-Dist: Unidecode>=1.4
Requires-Dist: beautifulsoup4>=4.14
Requires-Dist: onnxruntime>=1.24.2
Requires-Dist: tokenizers>=0.22.2
Requires-Dist: huggingface-hub>=1.5.0
Requires-Dist: numpy>=2.2.0
Requires-Dist: presidio_anonymizer>=2.2.360
Requires-Dist: regex>=2024.11.6
Provides-Extra: gpu
Requires-Dist: onnxruntime-gpu>=1.24.2; extra == "gpu"
Provides-Extra: fuzzy
Requires-Dist: rapidfuzz>=3.12; extra == "fuzzy"
Provides-Extra: torch
Requires-Dist: torch>=2.6.0; extra == "torch"
Requires-Dist: transformers>=4.48; extra == "torch"
Provides-Extra: gliner
Requires-Dist: gliner<1.0,>=0.2.25; extra == "gliner"
Provides-Extra: gliner2
Requires-Dist: gliner2<2.0,>=1.0; extra == "gliner2"
Provides-Extra: synthetic
Requires-Dist: faker>=33.0; extra == "synthetic"
Provides-Extra: presidio
Requires-Dist: presidio-analyzer>=2.2.360; extra == "presidio"
Provides-Extra: classify
Requires-Dist: gliclass>=0.1.16; extra == "classify"
Provides-Extra: classify-onnx
Requires-Dist: onnxruntime>=1.24.2; extra == "classify-onnx"
Provides-Extra: all-ner
Requires-Dist: torch>=2.6.0; extra == "all-ner"
Requires-Dist: transformers>=4.48; extra == "all-ner"
Requires-Dist: gliner<1.0,>=0.2.25; extra == "all-ner"
Requires-Dist: gliner2<2.0,>=1.0; extra == "all-ner"
Provides-Extra: dev
Requires-Dist: hypothesis>=6.130; extra == "dev"
Requires-Dist: faker>=33.0; extra == "dev"
Requires-Dist: ruff>=0.11; extra == "dev"
Requires-Dist: pytest>=8.3; extra == "dev"
Requires-Dist: pytest-timeout>=2.3; extra == "dev"
Requires-Dist: rapidfuzz>=3.12; extra == "dev"
Provides-Extra: test
Requires-Dist: coverage>=7.8; extra == "test"
Requires-Dist: pytest-cov>=6.0; extra == "test"
Dynamic: license-file

<div align="center">

# SqueakyCleanText

[![PyPI](https://img.shields.io/pypi/v/squeakycleantext.svg)](https://pypi.org/project/squeakycleantext/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/squeakycleantext)](https://pypistats.org/packages/squeakycleantext)
[![Python package](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml/badge.svg)](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml)
[![Python Versions](https://img.shields.io/badge/Python-3.11%20|%203.12%20|%203.13-blue)](https://pypi.org/project/squeakycleantext/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
</div>

> **Using an AI coding assistant?** This repo includes an [`llms.txt`](https://github.com/rhnfzl/SqueakyCleanText/blob/main/llms.txt) with the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.

### Key Features

- **Named Entity Recognition (NER)**:
  - Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
  - Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
  - Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
  - Ensemble voting across backends for improved accuracy
  - Configurable confidence thresholds
  - Lazy model loading (models load on demand per language)
  - Shared ONNX sessions across same-model languages (~600 MB RAM saved)
  - Automatic text chunking for long documents (CJK/Arabic safe)
  - GPU acceleration support (CUDA for ONNX and PyTorch)
  - Model warm-up API to pre-load on startup
- **Text Normalization**:
  - Corrects text encoding problems and handles bad Unicode characters
  - Removes or replaces HTML tags and URLs with configurable tokens
  - Handles emails, phone numbers, and other contact details
  - Multilingual date detection and replacement (ISO 8601, month names, common formats)
  - Fuzzy date matching for misspelled months (requires `[fuzzy]` extra)
  - Year and number standardization
  - Configurable emoji removal
  - Configurable bracket/brace content removal
  - Removes isolated letters and symbols
  - Normalizes whitespace and handles currency symbols
  - Smart case folding (preserves NER tokens like `<PERSON>`)
- **Language Support**:
  - Automatic language detection (English, Dutch, German, Spanish)
  - Language-specific NER models; French, Portuguese, Italian via multilingual model
  - Language-aware stopword removal
  - Extensible: add custom languages with stopwords, month names, and NER models
- **Dual Output Formats**:
  - Language Model format (preserves structure with tokens)
  - Statistical Model format (optimized for classical ML)
- **Performance**:
  - ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
  - Thread-parallel batch processing via `ThreadPoolExecutor`
  - Async batch processing (`aprocess_batch`) for FastAPI / aiohttp
  - Lazy model loading (only loads models as needed)
  - Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
  - Memory-efficient processing of large texts
  - GPU acceleration (CUDA) for both ONNX and PyTorch backends

![Default Flow of cleaning Text](resources/sct_flow.png)

### Benefits

#### For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking

#### For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks

#### Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents

## Installation

```sh
pip install SqueakyCleanText
```

The base install uses **ONNX Runtime** for NER inference - no PyTorch or Transformers required.

### Optional Extras

| Extra | Command | What it adds |
|-------|---------|--------------|
| GPU | `pip install SqueakyCleanText[gpu]` | CUDA-accelerated ONNX inference |
| Fuzzy dates | `pip install SqueakyCleanText[fuzzy]` | Fuzzy month name matching ([rapidfuzz](https://github.com/rapidfuzz/RapidFuzz)) |
| PyTorch NER | `pip install SqueakyCleanText[torch]` | PyTorch/Transformers NER backend |
| GLiNER | `pip install SqueakyCleanText[gliner]` | [GLiNER](https://github.com/urchade/GLiNER) zero-shot NER |
| GLiNER2 | `pip install SqueakyCleanText[gliner2]` | [GLiNER2](https://github.com/Knowledgator/GLiNER) (knowledgator) backend |
| Synthetic | `pip install SqueakyCleanText[synthetic]` | Faker-based synthetic replacement (realistic fake values instead of `<TAG>` tokens) |
| Presidio | `pip install SqueakyCleanText[presidio]` | Presidio-analyzer for `presidio_gliner` backend |
| Classify | `pip install SqueakyCleanText[classify]` | GLiClass document-level pre-classification |
| All NER | `pip install SqueakyCleanText[all-ner]` | All NER backends combined |
| Development | `pip install SqueakyCleanText[dev]` | Testing and linting tools |

You can combine extras: `pip install SqueakyCleanText[gpu,fuzzy,gliner]`

## Usage

### Basic Usage

```python
from sct import TextCleaner

# Initialize the TextCleaner
cleaner = TextCleaner()

# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."

# Process the text
lm_text, stat_text, lang = cleaner.process(text)

print(f"Language Model format:    {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."

print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"

print(f"Detected Language: {lang}")
# Output: "ENGLISH"
```

### Using TextCleanerConfig

```python
from sct import TextCleaner, TextCleanerConfig

# Create an immutable configuration
cfg = TextCleanerConfig(
    check_ner_process=True,
    ner_confidence_threshold=0.85,
    positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_numbers="<PHONE>",
    language="en",  # Pin to English (also accepts 'ENGLISH', 'eng')
)

# Initialize with config
cleaner = TextCleaner(cfg=cfg)
```

#### Language Specification

All language parameters accept Lingua names (`'ENGLISH'`), ISO 639-1 (`'en'`), or ISO 639-3 (`'eng'`) codes:

```python
# Pin to one language (skip auto-detection)
cfg = TextCleanerConfig(language='de', check_ner_process=False)

# Restrict detection to specific languages (auto-detect among them)
cfg = TextCleanerConfig(language=('en', 'nl', 'de'), check_ner_process=False)

# Add extra languages for detection
cfg = TextCleanerConfig(extra_languages=('fr', 'pt'), check_ner_process=False)
```

### GLiNER: Zero-Shot Custom NER

Use [GLiNER](https://github.com/urchade/GLiNER) to recognize any entity type without retraining:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location', 'product', 'event'),
    gliner_label_map={
        'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
        # 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
    },
    gliner_threshold=0.4,
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."
```

### Ensemble NER

Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='ensemble_onnx',  # or 'ensemble_torch'
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location'),
    gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")
```

### PII Detection Mode

Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(ner_mode='pii')

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
)
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types
```

PII mode auto-configures: `ner_backend='gliner'`, uses [`knowledgator/gliner-pii-base-v1.0`](https://huggingface.co/knowledgator/gliner-pii-base-v1.0), sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.

**Alternative PII models** (pass as `gliner_model`):

| Model | Type | Size | Labels | F1 |
|-------|------|------|--------|-----|
| [`knowledgator/gliner-pii-base-v1.0`](https://huggingface.co/knowledgator/gliner-pii-base-v1.0) | Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% |
| [`nvidia/gliner-PII`](https://huggingface.co/nvidia/gliner-PII) | Bi-encoder | 570MB | 55+ | — |
| [`gretelai/gretel-gliner-bi-base-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-base-v1.0) | Bi-encoder | ~800MB | 40+ | 95% |
| [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) | Multilingual | — | — | — |

### Synthetic Replacement

Replace detected entities with realistic fake values (via [Faker](https://faker.readthedocs.io/)) instead of `<TAG>` placeholder tokens:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_mode='pii',
    replacement_mode='synthetic',  # pip install squeakycleantext[synthetic]
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "Contact John Smith at john.smith@company.com or +1-555-0123"
)
# Output: "Contact Jennifer Williams at lisa45@example.net or +1-555-0198"
# Same entity always maps to same fake value within a document
```

> **Note**: Synthetic replacement preserves data utility for downstream ML tasks but is NOT GDPR-compliant anonymization. Same-document consistency is maintained (same entity text always maps to the same fake value).

### Reversible Anonymization

Replace entities with indexed placeholders (`<PERSON_0>`, `<LOCATION_1>`) and get a mapping for round-trip deanonymization:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_mode='pii',
    replacement_mode='reversible',
)

cleaner = TextCleaner(cfg=cfg)
result = cleaner.process("John Smith works at Google in London.")

print(result.lm_text)
# "<PERSON_0> works at <ORGANISATION_0> in <LOCATION_0>."

# Access the anonymization map via metadata
anon_map = result.metadata['anon_map']
restored = anon_map.deanonymize(result.lm_text)
# "John Smith works at Google in London."

# Serialize the map for storage
import json
json.dumps(anon_map.to_dict())
```

> **Note**: `ProcessResult` from `process()` unpacks as a 3-tuple (`lm_text, stat_text, language`) for backward compatibility, but also exposes `.metadata` for reversible maps and document classification.

### Document Classification (GLiClass)

Classify documents before processing using zero-shot classification with [GLiClass](https://github.com/Knowledgator/GLiClass):

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    check_classify_document=True,
    gliclass_labels=('email', 'code', 'legal', 'medical'),
    # gliclass_model defaults to 'knowledgator/gliclass-edge-v3.0' (32.7M params)
)

cleaner = TextCleaner(cfg=cfg)  # pip install squeakycleantext[classify]
result = cleaner.process("Dear Sir, please find attached the contract...")

# Classification results in metadata
print(result.metadata['classes'])
# [{"label": "email", "score": 0.92}, {"label": "legal", "score": 0.78}]
```

### Bi-Encoder GLiNER Models

Bi-encoder models (ModernBERT, etc.) are auto-detected and leverage pre-computed label embeddings for faster inference with larger context windows:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='knowledgator/gliner-bi-base-v2.0',
    gliner_labels=('person', 'organization', 'location'),
)

cleaner = TextCleaner(cfg=cfg)
# Auto-detects bi-encoder → caches label embeddings → uses 2048+ token context window
```

### Entity Description Labels (ZERONER-Style)

Provide natural-language descriptions for labels to improve zero-shot recognition accuracy:

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='knowledgator/gliner-bi-base-v2.0',
    gliner_label_descriptions={
        'person': "a person's full legal name",
        'location': "a geographical place or address",
        'organization': "a company, institution, or government body",
    },
)

cleaner = TextCleaner(cfg=cfg)
# Descriptions are used for inference, results are mapped back to original label names
```

### Batch Processing

```python
from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    check_remove_stopwords=True,
    check_remove_punctuation=True,
    check_ner_process=True,
    positional_tags=('PER', 'ORG', 'LOC'),
    ner_confidence_threshold=0.90,
)

cleaner = TextCleaner(cfg=cfg)

# Sample texts
texts = [
    "Email maria.garcia@example.es for more info.",  # Spanish
    "Besuchen Sie uns im Büro in Berlin.",           # German
    "Voor vragen, bel +31 20 123 4567.",             # Dutch
]

# Process texts in batch (uses ThreadPoolExecutor for parallel processing)
results = cleaner.process_batch(texts, batch_size=2)

for lm_text, stat_text, lang in results:
    print(f"Language: {lang}")
    print(f"LM Format:    {lm_text}")
    print(f"Stat Format:  {stat_text}")
    print("-" * 40)
```

<details>
<summary>Legacy Configuration (backward compatible)</summary>

```python
from sct import sct, config

# Customize settings via module-level variables
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.LANGUAGE = "ENGLISH"

# Initialize (reads from module-level config)
cleaner = sct.TextCleaner()
```

> **Note**: The legacy module-level configuration is not thread-safe. For concurrent processing, use `TextCleanerConfig` instead.

</details>

## NER Backends

SqueakyCleanText supports six NER backends, selectable via the `ner_backend` config field:

| Backend | Description | Dependencies | Best for |
|---------|-------------|-------------|----------|
| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free |
| `torch` | PyTorch/Transformers pipeline with full XLM-RoBERTa models | `[torch]` extra | Compatibility with existing PyTorch workflows |
| `gliner` | GLiNER zero-shot NER with custom entity labels | `[gliner]` or `[gliner2]` extra | Custom entity types, PII detection, bi-encoder models |
| `ensemble_onnx` | ONNX + GLiNER ensemble voting | `[gliner]` extra | Maximum recall with custom entities |
| `ensemble_torch` | Torch + GLiNER ensemble voting | `[torch,gliner]` extra | Maximum recall with PyTorch |
| `presidio_gliner` | Presidio + GLiNER recognizer (beta) | `presidio-analyzer`, `[gliner]` | Context-aware NER via Presidio's pipeline |

### Default NER Models (ONNX)

| Language | Model |
|----------|-------|
| English | [`rhnfzl/xlm-roberta-large-conll03-english-onnx`](https://huggingface.co/rhnfzl/xlm-roberta-large-conll03-english-onnx) |
| Dutch | [`rhnfzl/xlm-roberta-large-conll02-dutch-onnx`](https://huggingface.co/rhnfzl/xlm-roberta-large-conll02-dutch-onnx) |
| German | [`rhnfzl/xlm-roberta-large-conll03-german-onnx`](https://huggingface.co/rhnfzl/xlm-roberta-large-conll03-german-onnx) |
| Spanish | [`rhnfzl/xlm-roberta-large-conll02-spanish-onnx`](https://huggingface.co/rhnfzl/xlm-roberta-large-conll02-spanish-onnx) |
| French / Portuguese / Italian | [`rhnfzl/wikineural-multilingual-ner-onnx`](https://huggingface.co/rhnfzl/wikineural-multilingual-ner-onnx) (shared session) |
| Multilingual (fallback) | [`rhnfzl/wikineural-multilingual-ner-onnx`](https://huggingface.co/rhnfzl/wikineural-multilingual-ner-onnx) |

### GLiNER Model Recommendations

| Model | Architecture | Context | Languages | Best for |
|-------|-------------|---------|-----------|----------|
| `knowledgator/gliner-bi-base-v2.0` | Bi-encoder (ModernBERT) | 2048 | Multi | General NER, long documents |
| `knowledgator/gliner-pii-base-v1.0` | Bi-encoder | 2048 | Multi | PII detection (60+ entity types) |
| `urchade/gliner_large-v2.1` | Uni-encoder (DeBERTa) | 512 | Multi | Legacy, high accuracy on short texts |
| `MatteoFasulo/ModernBERT-base-NER` | ModernBERT | 8192 | English | English-only, very long context |

> **GLiNER2 note**: `pip install squeakycleantext[gliner2]` installs [Knowledgator's gliner2 package](https://github.com/Knowledgator/GLiNER), not Fastino AI's GLiNER2 from EMNLP 2025 (different API).

### GLiNER Label Mapping

GLiNER uses lowercase free-text labels (e.g., `'person'`, `'product'`). To map them to standard NER tags used by the anonymizer, use `gliner_label_map`:

```python
gliner_label_map={
    'person': 'PER',          # → <PERSON>
    'organization': 'ORG',    # → <ORGANISATION>
    'location': 'LOC',        # → <LOCATION>
}
# Unmapped labels are uppercased automatically:
# 'product' → <PRODUCT>, 'event' → <EVENT>, 'skill' → <SKILL>
```

## API

### `TextCleaner`

#### `process(text: str) -> Tuple[str, Optional[str], Optional[str]]`

Processes the input text and returns a tuple containing:
  - Cleaned text formatted for language models.
  - Cleaned text formatted for statistical models (`None` if `check_statistical_model_processing` is `False`).
  - Detected language of the text (`None` if language detection is disabled).

#### `process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

Processes multiple texts using thread-parallel execution. Each result follows the same format as `process()`.

#### `aprocess_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

Async version of `process_batch` for use with asyncio-based frameworks (FastAPI, aiohttp). Runs the batch in a thread-pool executor so it does not block the event loop:

```python
from sct import TextCleaner

cleaner = TextCleaner()

# In an async context (FastAPI route, aiohttp handler, etc.)
results = await cleaner.aprocess_batch(texts)
```

#### `warmup(languages: Optional[List[str]] = None) -> None`

Pre-loads NER models to avoid first-request latency. Call once during application startup:

```python
cleaner = TextCleaner()
cleaner.warmup(['ENGLISH', 'DUTCH'])  # or warmup() for all supported languages
```

### `TextCleanerConfig`

Immutable (frozen) dataclass. Create modified copies with `dataclasses.replace()`:

```python
import dataclasses
new_cfg = dataclasses.replace(cfg, check_ner_process=False)
```

<details>
<summary>Full configuration reference</summary>

**Pipeline toggles** (all `bool`, default shown):

| Field | Default | Description |
|-------|---------|-------------|
| `check_detect_language` | `True` | Auto-detect language |
| `check_fix_bad_unicode` | `True` | Fix encoding issues via ftfy |
| `check_to_ascii_unicode` | `True` | Transliterate to ASCII |
| `check_replace_html` | `True` | Strip/replace HTML tags |
| `check_replace_urls` | `True` | Replace URLs with token |
| `check_replace_emails` | `True` | Replace emails with token |
| `check_replace_years` | `True` | Replace years (1900-2099) |
| `check_replace_dates` | `False` | Replace full dates (ISO 8601, month names) |
| `check_fuzzy_replace_dates` | `False` | Fuzzy match misspelled months (requires `[fuzzy]`) |
| `check_replace_phone_numbers` | `True` | Replace phone numbers |
| `check_replace_numbers` | `True` | Replace standalone numbers |
| `check_replace_currency_symbols` | `True` | Replace currency symbols |
| `check_ner_process` | `True` | Run NER entity recognition |
| `check_remove_isolated_letters` | `True` | Remove single letters |
| `check_remove_isolated_special_symbols` | `True` | Remove isolated symbols |
| `check_remove_bracket_content` | `True` | Remove `[...]` content |
| `check_remove_brace_content` | `True` | Remove `{...}` content |
| `check_normalize_whitespace` | `True` | Normalize whitespace |
| `check_statistical_model_processing` | `True` | Generate stat model output |
| `check_casefold` | `True` | Lowercase stat output |
| `check_smart_casefold` | `False` | Lowercase but preserve NER tokens |
| `check_remove_stopwords` | `True` | Remove stopwords from stat output |
| `check_remove_punctuation` | `True` | Remove punctuation from stat output |
| `check_remove_stext_custom_stop_words` | `True` | Remove custom stop words from stat output |
| `check_remove_emoji` | `False` | Remove emoji characters |

**Replacement tokens** (all `str`):

| Field | Default |
|-------|---------|
| `replace_with_url` | `"<URL>"` |
| `replace_with_html` | `"<HTML>"` |
| `replace_with_email` | `"<EMAIL>"` |
| `replace_with_years` | `"<YEAR>"` |
| `replace_with_dates` | `"<DATE>"` |
| `replace_with_phone_numbers` | `"<PHONE>"` |
| `replace_with_numbers` | `"<NUMBER>"` |
| `replace_with_currency_symbols` | `None` |

**NER settings**:

| Field | Default | Description |
|-------|---------|-------------|
| `ner_backend` | `'onnx'` | Backend: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`, `presidio_gliner` |
| `ner_mode` | `'standard'` | `'standard'` or `'pii'` (auto-configures GLiNER for PII detection) |
| `replacement_mode` | `'placeholder'` | `'placeholder'`, `'synthetic'` (Faker), or `'reversible'` (indexed placeholders + deanonymize map) |
| `positional_tags` | `('PER', 'LOC', 'ORG', 'MISC')` | Entity types to recognize |
| `ner_confidence_threshold` | `0.85` | Minimum confidence score |
| `ner_batch_size` | `8` | Inference batch size (must be >= 1) |
| `ner_models` | `None` | Language-keyed dict of ONNX model repo IDs |
| `torch_ner_models` | `None` | Language-keyed dict of PyTorch model repo IDs |
| `gliner_model` | `None` | GLiNER model ID (required for gliner/ensemble backends) |
| `gliner_variant` | `'gliner'` | `'gliner'` or `'gliner2'` |
| `gliner_labels` | `('person', 'organization', 'location')` | GLiNER entity labels |
| `gliner_label_map` | `None` | Maps GLiNER labels to NER tags |
| `gliner_threshold` | `0.4` | GLiNER confidence threshold |
| `gliner_label_descriptions` | `None` | ZERONER-style: `{label: "description"}` for improved zero-shot accuracy |
| `fuzzy_date_score_cutoff` | `85` | Fuzzy matching threshold (0-100) for misspelled months |
| `custom_pipeline_steps` | `()` | Tuple of `(text: str) -> str` callables appended after all built-in steps |

**Language settings**:

| Field | Default | Description |
|-------|---------|-------------|
| `language` | `None` | Pin language (`'en'`), restrict detection to a set (`('en','nl')`), or `None` for auto-detect. Accepts Lingua names, ISO 639-1, ISO 639-3 codes. |
| `extra_languages` | `()` | Additional language names/codes for detection |
| `custom_stopwords` | `None` | `{LANG: frozenset({...})}` custom stopword sets |
| `custom_month_names` | `None` | `{LANG: ('Jan', 'Feb', ...)}` for date detection |

</details>

## Architecture

SqueakyCleanText processes text through a configurable pipeline of sequential steps:

```
Input Text
  │
  ├─ Fix Unicode (ftfy)
  ├─ ASCII transliteration (unidecode)
  ├─ Emoji removal
  ├─ HTML replacement
  ├─ URL / Email / Phone replacement
  ├─ Date & Year replacement
  ├─ Number & Currency replacement
  ├─ Isolated letter/symbol removal
  ├─ Whitespace normalization
  │
  ├─ NER Processing (ONNX / Torch / GLiNER / Ensemble)
  │   ├─ Language detection (Lingua)
  │   ├─ Text chunking (token-bounded)
  │   ├─ Entity recognition (per-chunk)
  │   ├─ Ensemble voting (cross-model)
  │   └─ Entity anonymization (Presidio)
  │
  └─ Statistical Model Output
      ├─ Case folding
      ├─ Stopword removal
      └─ Punctuation removal

  ▼
(lm_text, stat_text, language)
```

Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once at initialization; disabled steps are skipped entirely (zero overhead).

## What's New

**v0.6.0**
- **PII detection mode** (`ner_mode='pii'`): auto-configures GLiNER with 60+ PII entity labels (personal, financial, healthcare, identity, digital)
- **Synthetic replacement** (`replacement_mode='synthetic'`): Faker-generated realistic values instead of `<TAG>` placeholders, with per-document consistency
- **Reversible anonymization** (`replacement_mode='reversible'`): indexed placeholders (`<PERSON_0>`) with `AnonymizationMap` for round-trip deanonymization
- **Document classification** (`check_classify_document=True`): zero-shot GLiClass pre-classification before text processing
- **ProcessResult**: `process()` returns `ProcessResult` (backward-compatible 3-tuple) with `.metadata` for anonymization maps and classification results
- **GLiNER ONNX mode** (`gliner_onnx=True`): load GLiNER with pre-built ONNX weights from HuggingFace Hub (auto-set for PII + ONNX backend)
- **Bi-encoder support**: auto-detects ModernBERT and other bi-encoder GLiNER models, caches label embeddings, dynamic context windows (2048-8192 tokens)
- **Entity description labels**: ZERONER-style natural-language descriptions for improved zero-shot accuracy
- **Presidio GLiNER backend** (beta): opt-in `ner_backend='presidio_gliner'` for Presidio's context-aware recognition pipeline
- **ModernBERT ONNX export**: updated export script with ModernBERT support (English, 8192 token context)
- **Dynamic chunk sizing**: GLiNER chunk size adapts to model's actual context window instead of hardcoded 384

**v0.5.x**
- `aprocess_batch()`: async batch processing for FastAPI / aiohttp integrations
- `warmup(languages)`: pre-load NER models at startup to eliminate first-request latency
- `custom_pipeline_steps`: attach arbitrary `(text: str) -> str` callables after the built-in pipeline
- French, Portuguese, and Italian NER support via a shared multilingual ONNX session
- Improved NER sentence boundary detection with abbreviation guard

**v0.4.5**
- Frozen `TextCleanerConfig` dataclass: immutable, thread-safe, per-instance configuration
- ONNX-first NER inference: torch-free base install (~400 MB models vs ~7 GB)
- Thread-parallel batch processing via `ThreadPoolExecutor`
- Five NER backends: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`
- GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
- Ensemble voting across backends for improved recall
- Lazy per-language model loading
- Multilingual date detection and fuzzy date matching
- Configurable emoji removal, bracket/brace content removal, and smart case folding
- `stop-words` replaces NLTK (50 KB bundled vs 30 MB download)
- PyTorch and Transformers moved to optional extras
- Migrated to `pyproject.toml` (PEP 517), Python 3.11-3.13, ruff linter

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements

The package took inspirations from the following repo:

- [clean-text](https://github.com/jfilter/clean-text)
