Metadata-Version: 2.4
Name: ansemo
Version: 0.1.0
Summary: Romanian PII detection and anonymization
License: MIT License
        
        Copyright (c) 2026 Sigmoid
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: presidio-analyzer>=2.2
Requires-Dist: spacy>=3.5
Requires-Dist: phonenumbers>=8.13
Requires-Dist: schwifty>=2024.1
Requires-Dist: gliner>=0.2
Requires-Dist: openai>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Dynamic: license-file

# ansemo - Usage Guide

Romanian PII detection and anonymization library. Detects and anonymizes personal data in Romanian and Moldovan documents — names, addresses, phone numbers, IBANs, organizations, codes, usernames, passwords, and more.

## Table of Contents

- [ansemo - Usage Guide](#ansemo---usage-guide)
  - [Table of Contents](#table-of-contents)
  - [System Requirements](#system-requirements)
  - [Installation](#installation)
  - [Quick Start](#quick-start)
  - [Python API](#python-api)
    - [Convenience Function](#convenience-function)
    - [Reusable Pipeline](#reusable-pipeline)
    - [Selective Entity Detection](#selective-entity-detection)
    - [De-anonymization](#de-anonymization)
    - [Custom Denylist](#custom-denylist)
    - [Adjusting Detection Sensitivity](#adjusting-detection-sensitivity)
    - [Low-Level Access](#low-level-access)
  - [SLM Fallback (AI Verification)](#slm-fallback-ai-verification)
    - [Setting Up Ollama](#setting-up-ollama)
      - [GPU acceleration](#gpu-acceleration)
    - [SLM Configuration](#slm-configuration)
    - [Using a Different LLM Server](#using-a-different-llm-server)
    - [Advanced: Per-Entity Thresholds](#advanced-per-entity-thresholds)
    - [Overriding Request Parameters](#overriding-request-parameters)
  - [CLI Usage](#cli-usage)
    - [CLI Flags](#cli-flags)
    - [CLI Output](#cli-output)
  - [Supported Entity Types](#supported-entity-types)
  - [Pipeline Architecture](#pipeline-architecture)
  - [Output Format](#output-format)
    - [Anonymized Text](#anonymized-text)
    - [Entity Mapping](#entity-mapping)
  - [Environment Variables](#environment-variables)
  - [Logging](#logging)
  - [Troubleshooting](#troubleshooting)
    - [GLiNER model download fails](#gliner-model-download-fails)
    - [SLM server connection error](#slm-server-connection-error)
    - [High memory usage](#high-memory-usage)
    - [Wrong wheel platform](#wrong-wheel-platform)
    - [Entity not detected](#entity-not-detected)

---

## System Requirements

| Requirement | Details |
|---|---|
| Python | 3.10 or higher |
| RAM | ~2-3 GB (for model loading) |
| Disk | ~1.5 GB (for downloaded models) |
| OS | Windows (x64), Linux (x86_64), macOS (ARM / Apple Silicon) |
| Network | Required on first run to download the GLiNER model (~1 GB) |

## Installation

```bash
pip install ansemo
python -m spacy download ro_core_news_sm
```

All Python dependencies (Presidio, spaCy, GLiNER, etc.) are installed automatically. The spaCy language model must be installed separately (see above).

Alternatively, install from a wheel file directly:

```bash
pip install ansemo-<version>-<platform>.whl
python -m spacy download ro_core_news_sm
```

On first use, the GLiNER NER model (~1 GB) is downloaded from HuggingFace. Set the `HF_HOME` environment variable to control where it is cached (see [Environment Variables](#environment-variables)).

## Quick Start

```python
from ansemo import anonymize

text, mapping = anonymize(
    "Dl. Ion Popescu (CNP 1850612345674) locuiește pe "
    "Str. Eminescu nr. 45, București. Tel: 0745 123 456."
)

print(text)
# Dl. [PERSON_NAME_1] (CNP [CODE_1]) locuiește pe
# [STREET_ADDRESS_1], București. Tel: [PHONE_NUMBER_1].

print(mapping)
# {"PERSON_NAME": {"Ion Popescu": "[PERSON_NAME_1]"}, ...}
```

> **Note:** The `anonymize()` convenience function runs **without SLM fallback**. For AI-verified anonymization, use `build_pipeline()` — see [Reusable Pipeline](#reusable-pipeline).

## Python API

### Convenience Function

```python
from ansemo import anonymize

text, mapping = anonymize("some text with PII")
```

- Creates a singleton pipeline on first call (loads models once, reuses them).
- **SLM fallback is disabled** — for SLM verification, use `build_pipeline()`.
- Thread-safe.

### Reusable Pipeline

For explicit control and SLM support, build your own pipeline:

```python
from ansemo import build_pipeline

pipeline = build_pipeline()
text, mapping = pipeline.anonymize("some text with PII")
text2, mapping2 = pipeline.anonymize("another document")
```

- Models are loaded once and reused across calls.
- **SLM fallback is enabled by default** — requires an Ollama server (see [SLM Fallback](#slm-fallback-ai-verification)).
- To disable SLM: `build_pipeline(slm_fallback=None)`

### Selective Entity Detection

Anonymize only specific entity types:

```python
pipeline = build_pipeline(slm_fallback=None)
text, mapping = pipeline.anonymize(
    "Ion Popescu, ion@firma.ro, IBAN RO89BCRL0000000123456789",
    entities=["PERSON_NAME", "EMAIL_ADDRESS"],  # only these two
)
# IBAN is left untouched
```

See [Supported Entity Types](#supported-entity-types) for the full list.

### De-anonymization

Reverse the anonymization using the returned mapping:

```python
from ansemo import build_pipeline

pipeline = build_pipeline(slm_fallback=None)
anonymized, mapping = pipeline.anonymize("Dl. Ion Popescu, tel 0745 123 456")
original = pipeline.deanonymize(anonymized, mapping)
# "Dl. Ion Popescu, tel 0745 123 456"
```

A standalone function is also available:

```python
from ansemo import anonymize, deanonymize_text

anonymized, mapping = anonymize("Dl. Ion Popescu, tel 0745 123 456")
original = deanonymize_text(anonymized, mapping)
```

### Custom Denylist

Suppress false positives with domain-specific terms that should not be anonymized:

```python
pipeline = build_pipeline(
    slm_fallback=None,
    extra_denylist={
        "PERSON_NAME": {"fidejusor", "cesionar", "mandatar"},
        "ORGANIZATION": {"filială", "agenție"},
    },
)

text, mapping = pipeline.anonymize("Fidejusor: Ion Popescu")
# "fidejusor" won't be detected as a person name
```

- Entries are **merged** with the built-in denylists (they add, never replace).
- Matching is case-insensitive and diacritic-insensitive.

### Adjusting Detection Sensitivity

Override the default confidence thresholds per entity type. Lower values detect more entities (but may increase false positives); higher values are stricter.

Default thresholds:

| Entity Type | Default Threshold |
|---|---|
| `PERSON_NAME` | 0.20 |
| `ORGANIZATION` | 0.20 |
| `STREET_ADDRESS` | 0.50 |
| `USERNAME` | 0.20 |
| `PASSWORD` | 0.45 |

```python
from ansemo import build_pipeline

# Override specific thresholds (others keep their defaults)
pipeline = build_pipeline(
    slm_fallback=None,
    entity_thresholds={"PERSON_NAME": 0.35, "STREET_ADDRESS": 0.40},
)
```

> **Note:** Thresholds only apply to GLiNER-detected entities (PERSON_NAME, ORGANIZATION, STREET_ADDRESS, USERNAME, PASSWORD). Structured-data entities (EMAIL_ADDRESS, PHONE_NUMBER, IBAN_CODE, etc.) use deterministic pattern matching and are not affected by thresholds.

### Low-Level Access

For full control over the detection and anonymization steps:

```python
from ansemo import build_analyzer, ENTITIES, GLINER_ENTITY_MAPPING, GLINER_ENTITY_THRESHOLDS
from ansemo.processing import filter_entity_denylist, resolve_overlaps, anonymize_text, postprocess_anonymized

analyzer = build_analyzer(
    gliner_entity_mapping=GLINER_ENTITY_MAPPING,
    entity_thresholds=GLINER_ENTITY_THRESHOLDS,
)

results = analyzer.analyze(text=text, language="ro", entities=ENTITIES, score_threshold=0.2)
results = filter_entity_denylist(text, results)
results = resolve_overlaps(results, text)
anonymized, mapping = anonymize_text(text, results)
anonymized = postprocess_anonymized(anonymized, mapping)
```

---

## SLM Fallback (AI Verification)

The SLM (Small Language Model) fallback routes low-confidence detections through a local LLM for context-aware verification. This improves accuracy by filtering out false positives that pattern-based detection alone cannot resolve.

- **Enabled by default** when using `build_pipeline()`.
- **Disabled** when using the `anonymize()` convenience function.
- Applies by default to: `CODE`, `USERNAME`, `PASSWORD`, `ORGANIZATION`.

### Setting Up Ollama

1. Install Ollama from https://ollama.com/download
2. Pull the required model:
   ```bash
   ollama pull qwen3.5:9b
   ```
3. Ollama runs as a background service automatically — no manual start needed.

The pipeline connects to Ollama on its default port (`http://127.0.0.1:11434/v1`).

#### GPU acceleration

For best performance, the model should be loaded on the GPU. Verify with `ollama ps` — the `PROCESSOR` column should show `gpu` (not `cpu`).

If the model doesn't fit entirely in GPU memory, create a custom model configuration with a reduced context window and partial GPU offloading:

1. Create a file named `Modelfile`:
   ```
   FROM qwen3.5:9b

   PARAMETER num_ctx 4096
   PARAMETER num_gpu 28
   ```
   - `num_ctx 4096` — context window in tokens (4096 is sufficient for ansemo)
   - `num_gpu 28` — number of model layers offloaded to GPU (reduce if you run out of VRAM)

2. Build and use the custom model:
   ```bash
   ollama create qwen3.5-ansemo -f Modelfile
   ```

3. Use it in the pipeline:
   ```python
   pipeline = build_pipeline(slm_model="qwen3.5-ansemo")
   ```

### SLM Configuration

```python
from ansemo import build_pipeline

# Disable SLM fallback entirely
pipeline = build_pipeline(slm_fallback=None)

# Allow pipeline to start without SLM (graceful degradation)
pipeline = build_pipeline(slm_required=False, slm_on_failure="accept")

# SLM for all entity types (not just the default 4)
pipeline = build_pipeline(slm_entity_types=None)

# SLM for specific entity types only
pipeline = build_pipeline(slm_entity_types={"PERSON_NAME", "CODE"})
```

**`slm_required`** (default: `True`):
- `True` — raises an error if the SLM server is unreachable at startup.
- `False` — allows the pipeline to work without SLM. Ambiguous entities are handled based on `slm_on_failure`.

**`slm_on_failure`** (default: `"accept"`):
- `"accept"` — keep ambiguous entities (fewer missed detections, more false positives).
- `"reject"` — discard ambiguous entities (fewer false positives, more missed detections).
- `"error"` — raise an error when the SLM server is unavailable.

A circuit breaker automatically disables SLM after 2 consecutive batch failures to avoid timeout delays.

### Using a Different LLM Server

Any OpenAI-compatible server works (vLLM, llama.cpp, etc.):

```python
pipeline = build_pipeline(
    slm_base_url="http://custom-host:8080/v1",
    slm_model="my-model",
    slm_api_key="sk-...",  # optional
)
```

> **Context window:** A context window of **4096 tokens** is sufficient. The SLM does not process the full document — it only receives short context snippets around each detected entity.

### Advanced: Per-Entity Thresholds

For fine-grained control, construct `SLMFallback` directly:

```python
from ansemo import build_pipeline
from ansemo.slm import SLMFallback

slm = SLMFallback(
    per_entity_config={
        "PERSON_NAME": {"slm_threshold": 0.6},
        "ORGANIZATION": {"slm_threshold": 0.9},
    },
    required=True,
    on_failure="accept",
)

pipeline = build_pipeline(slm_fallback=slm)
```

Detections with scores below `slm_threshold` are sent to the SLM for verification. Scores at or above `slm_threshold` are accepted without SLM.

Default `slm_threshold` values:

| Entity Type | slm_threshold |
|---|---|
| `PERSON_NAME` | 0.70 |
| `CODE` | 0.65 |
| `ORGANIZATION` | 0.90 |
| `USERNAME` | 1.00 (all sent to SLM) |
| `STREET_ADDRESS` | 0.65 |
| `PASSWORD` | 1.00 (all sent to SLM) |

### Overriding Request Parameters

By default, reasoning/thinking is disabled via `reasoning_effort: "none"` in the request body. If your LLM provider uses a different parameter to control reasoning, override it with `extra_body`:

```python
from ansemo.slm import SLMFallback

# Example: provider that uses a different reasoning control
slm = SLMFallback(
    extra_body={"reasoning": {"enabled": False}},
)

pipeline = build_pipeline(slm_fallback=slm)
```

---

## CLI Usage

The `ansemo` command is installed automatically with the package:

```bash
# Anonymize a file (SLM enabled by default)
ansemo document.txt

# Multiple files
ansemo file1.txt file2.txt

# Inline text
ansemo --text "Dl. Ion Popescu, email ion@firma.ro"

# Only specific entities
ansemo --entities PERSON_NAME,EMAIL_ADDRESS document.txt

# JSON output to stdout
ansemo --json --text "Dl. Ion Popescu, tel 0745 123 456"

# Without SLM fallback
ansemo --no-slm document.txt

# Custom SLM server
ansemo --slm-url http://localhost:8080/v1 --slm-model my-model document.txt

# Debug logging (verbose output)
ansemo --debug document.txt
```

### CLI Flags

| Flag | Description |
|---|---|
| `files` | One or more file paths to anonymize |
| `--text` | Anonymize inline text instead of files |
| `--entities` | Comma-separated entity types to detect (default: all) |
| `--json` | Output results as JSON to stdout |
| `--slm-url` | SLM server URL (default: `http://127.0.0.1:11434/v1`) |
| `--slm-model` | SLM model name (default: `qwen3.5:9b`) |
| `--no-slm` | Disable SLM fallback |
| `--debug` | Enable debug logging |
| `--quiet`, `-q` | Suppress info logs (warnings and errors only) |

### CLI Output

Results are saved to `results/anonymization/{filename}_{timestamp}/`:

```
results/anonymization/document_20250523_143021/
  anonymized.txt              # Anonymized text
  mapping.json                # Entity mapping (original -> placeholder)
  slm_verdicts.json           # SLM decisions summary (if SLM enabled)
  slm_verdicts_detailed.json  # Detailed SLM decisions (if SLM enabled)
```

---

## Supported Entity Types

| Entity Type | Source | Description | Example |
|---|---|---|---|
| `PERSON_NAME` | GLiNER | Full and partial person names | Ion Popescu, Maria |
| `ORGANIZATION` | GLiNER | Companies, institutions, NGOs | SC Alfa Solutions SRL |
| `STREET_ADDRESS` | GLiNER + regex | Romanian/Moldovan addresses | Str. Eminescu nr. 45, Bl. A3 |
| `USERNAME` | GLiNER | Social media handles, login names | @ion_popescu |
| `PASSWORD` | GLiNER | Credential strings | Secure@2025! |
| `EMAIL_ADDRESS` | Presidio | Email addresses | ion@firma.ro |
| `PHONE_NUMBER` | phonenumbers | RO/MD phone numbers | 0745 123 456, +40 745 123 456 |
| `IBAN_CODE` | Presidio + schwifty | IBAN codes (compact and spaced) | RO89BCRL0000000123456789 |
| `CREDIT_CARD` | Presidio | Credit card numbers | 4532 1111 2222 3336 |
| `BIC` | Registry lookup | BIC/SWIFT codes | BTRLRO22 |
| `IP_ADDRESS` | Presidio | IP addresses | 192.168.1.1 |
| `URL` | Presidio | URLs | https://example.com |
| `CODE` | Regex | CUI, license plates, case numbers, etc. | J40/1234/2020, B 123 ABC |
| `CNP` | Regex + checksum | Romanian personal numeric code (13 digits) | 1850612345674 |

> **SLM-dependent entities:** `USERNAME`, `PASSWORD`, and `ORGANIZATION` are excluded by default when SLM fallback is disabled (i.e., when using `anonymize()` or `build_pipeline(slm_fallback=None)`). These entity types produce too many false positives without AI verification. To include them without SLM, pass them explicitly: `entities=["PERSON_NAME", "ORGANIZATION", ...]`.

---

## Pipeline Architecture

```
Text
 │
 ├─ 1. Presidio Analyzer (GLiNER + regex recognizers)
 ├─ 2. Denylist filter (remove known false positives)
 ├─ 3. Overlap resolution (greedy, score-based)
 ├─ 4. SLM fallback (enabled by default — verify low-confidence detections)
 ├─ 5. Anonymize (replace entities with [TYPE_N] placeholders)
 └─ 6. Post-process (replace remaining occurrences of detected entities via regex)
         │
         ▼
  (anonymized_text, entity_map)
```

---

## Output Format

### Anonymized Text

Detected entities are replaced with numbered placeholders in the format `[ENTITY_TYPE_N]`:

```
Input:  Dl. Ion Popescu, tel 0745 123 456, email ion@firma.ro
Output: Dl. [PERSON_NAME_1], tel [PHONE_NUMBER_1], email [EMAIL_ADDRESS_1]
```

If the same entity appears multiple times in the text, all occurrences receive the same placeholder.

### Entity Mapping

The mapping is a nested dictionary: `{entity_type: {original_text: placeholder}}`:

```json
{
  "PERSON_NAME": {
    "Ion Popescu": "[PERSON_NAME_1]"
  },
  "PHONE_NUMBER": {
    "0745 123 456": "[PHONE_NUMBER_1]"
  },
  "EMAIL_ADDRESS": {
    "ion@firma.ro": "[EMAIL_ADDRESS_1]"
  }
}
```

This mapping can be used for:
- **Auditing** which entities were detected and replaced.
- **De-anonymization** to restore the original text (see [De-anonymization](#de-anonymization)).
- **Downstream processing** that needs to reference anonymized entities.

---

## Environment Variables

| Variable | Description | Default |
|---|---|---|
| `HF_HOME` | Directory for HuggingFace model cache (GLiNER, ~1 GB) | `~/.cache/huggingface` |
| `SLM_BASE_URL` | SLM server URL (used by CLI) | `http://127.0.0.1:11434/v1` |
| `SLM_API_KEY` | SLM API key (used by CLI) | empty |

---

## Logging

Enable logging to see pipeline step details:

```python
import logging
logging.basicConfig(level=logging.INFO)
```

The pipeline logs each step — detection counts, denylist filtering, overlap resolution, SLM verdicts, and final entity counts.

For verbose output:

```python
logging.basicConfig(level=logging.DEBUG)
```

In CLI mode, use the `--debug` flag. To suppress info logs, use `--quiet` (`-q`) — only warnings and errors will be shown.

---

## Troubleshooting

### GLiNER model download fails

**Symptom:** Error on first run about failing to download from HuggingFace.

**Solutions:**
- Ensure internet access is available on first run.
- If behind a proxy, configure `HTTP_PROXY` / `HTTPS_PROXY` environment variables.
- Set `HF_HOME` to a writable directory with sufficient disk space (~1 GB).
- To pre-download the model on a machine with internet access, run:
  ```python
  from gliner import GLiNER
  GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
  ```
  Then copy the `HF_HOME` cache directory to the target machine.

### SLM server connection error

**Symptom:** `ConnectionError` or `SLMUnavailableError` when using `build_pipeline()`.

**Solutions:**
- Ensure Ollama is installed and running: `ollama list` should show available models.
- Verify the model is pulled: `ollama pull qwen3.5:9b`
- If not using SLM, disable it: `build_pipeline(slm_fallback=None)`
- For non-critical use, allow graceful degradation:
  ```python
  pipeline = build_pipeline(slm_required=False, slm_on_failure="accept")
  ```

### High memory usage

**Symptom:** Process uses excessive RAM or runs out of memory.

**Solutions:**
- The GLiNER model requires ~2 GB RAM. Ensure the machine has at least 4 GB available.
- Reuse the pipeline object across calls instead of creating new ones.
- For the `anonymize()` convenience function, the pipeline is automatically reused (singleton).

### Wrong wheel platform

**Symptom:** `pip install` fails with platform compatibility error.

**Solution:** Ensure the wheel matches your Python version and OS (see the [naming breakdown in Installation](#installation)).

### Entity not detected

**Possible causes:**
- The entity type may not be in the default set. Check [Supported Entity Types](#supported-entity-types).
- The text may be matching a denylist entry (known false positive). Check if the term is a common Romanian word.
- For low-confidence detections with SLM enabled, the SLM may have rejected it. Try with `--debug` or `logging.DEBUG` to see SLM verdicts.
