Metadata-Version: 2.4
Name: kvkk-pii
Version: 0.1.0
Summary: KVKK uyumlu Türkçe PII detection kütüphanesi
License: MIT
Keywords: pii,kvkk,gdpr,turkish,nlp,privacy,data-protection
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: ner
Requires-Dist: transformers>=4.40; extra == "ner"
Requires-Dist: torch>=2.0; extra == "ner"
Requires-Dist: huggingface-hub>=0.20; extra == "ner"
Provides-Extra: full
Requires-Dist: transformers>=4.40; extra == "full"
Requires-Dist: torch>=2.0; extra == "full"
Requires-Dist: huggingface-hub>=0.20; extra == "full"
Requires-Dist: gliner>=0.2; extra == "full"
Provides-Extra: server
Requires-Dist: fastapi>=0.110; extra == "server"
Requires-Dist: uvicorn>=0.29; extra == "server"
Provides-Extra: ui
Requires-Dist: fastapi>=0.110; extra == "ui"
Requires-Dist: uvicorn>=0.29; extra == "ui"
Requires-Dist: jinja2>=3.1; extra == "ui"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"

# kvkk-pii

**KVKK-compliant Turkish PII detection library — fully on-premise, no cloud.**

Detect, anonymize, and protect personally identifiable information in Turkish text. Built for [KVKK](https://www.kvkk.gov.tr/) (Turkish data protection law) compliance, with a 3-layer architecture that combines regex, NER, and zero-shot classification.

```python
from kvkk_pii import PiiDetector

detector = PiiDetector()
result = detector.analyze("Ali Veli, TC: 10000000146, tel: 0532 123 45 67")

for e in result.entities:
    print(e)
# PiiEntity(type='TC_KIMLIK', text='10000000146', start=12, end=23, score=1.00, layer='regex')
# PiiEntity(type='TELEFON_TR', text='0532 123 45 67', start=30, end=44, score=1.00, layer='regex')
```

---

## Features

- **Zero cloud** — all models run locally, no data leaves your machine
- **3-layer detection**: Regex + checksum → XLM-RoBERTa NER → GLiNER zero-shot
- **KVKK Madde 6** support — special categories: health, religion, biometrics, political opinion
- **LLM proxy** — mask PII before sending to AI, restore in the response, detect leakage
- **Compliance report** — maps detected entities to KVKK articles and risk levels
- **Pluggable** — add custom recognizers, tune thresholds per entity type
- **Async** — `AsyncPiiDetector` for FastAPI / async applications
- **CLI** — `kvkk-pii scan`, `kvkk-pii anonymize`

---

## Installation

```bash
# Layer 1 only — regex + checksum (no dependencies)
pip install kvkk-pii

# + Layer 2 — XLM-RoBERTa NER (~450 MB, Turkish NER)
pip install kvkk-pii[ner]

# + Layer 3 — GLiNER zero-shot (~180 MB, KVKK Madde 6)
pip install kvkk-pii[full]
```

Models are downloaded from HuggingFace on first use and cached at `~/.cache/huggingface/hub`.

---

## Quickstart

### Detect & Anonymize

```python
from kvkk_pii import PiiDetector

detector = PiiDetector()  # regex only (Layer 1)

text = "Müşteri Ali Veli, IBAN: TR33 0006 1005 1978 6457 8413 26, e-posta: ali@example.com"
result = detector.analyze(text)

print(result.entities)
# [PiiEntity(type='IBAN_TR', ...), PiiEntity(type='EMAIL', ...)]

print(detector.anonymize(text))
# "Müşteri Ali Veli, IBAN: [IBAN_TR], e-posta: [EMAIL]"
```

### With NER (Person, Location, Organization)

```python
detector = PiiDetector(layers=["regex", "ner"])
# First run: prompts to download akdeniz27/xlm-roberta-base-turkish-ner (~450 MB)

result = detector.analyze("Ahmet Yılmaz, İstanbul'daki Türk Telekom şubesine gitti.")
# Detects: KISI_ADI (Ahmet Yılmaz), KONUM (İstanbul), KURUM (Türk Telekom)
```

### With GLiNER — KVKK Madde 6 Special Categories

```python
detector = PiiDetector(layers=["regex", "ner", "gliner"])

result = detector.analyze("Hasta diyabet tedavisi görüyor, Sünni mezhebine mensup.")
# Detects: SAGLIK_VERISI, DINI_INANC
```

### Ready-Made Presets

```python
from kvkk_pii import presets

detector = presets.turkish()      # Regex + NER (TR) + GLiNER — full KVKK coverage
detector = presets.german()       # Regex (DE) + GLiNER — DSGVO
detector = presets.french()       # Regex (FR) + GLiNER — RGPD
detector = presets.multilingual() # TR + DE + FR together
```

---

## Layer Architecture

| Layer | Method | Model | Speed | Detects |
|-------|--------|-------|-------|---------|
| 1 | Regex + checksum | — | <1ms | TC Kimlik, IBAN, VKN, phone, plate, email, passport |
| 2 | NER | `akdeniz27/xlm-roberta-base-turkish-ner` | ~30ms | Person, Location, Organization |
| 3 | Zero-shot NER | `urchade/gliner_multi-v2.1` | ~80ms | KVKK Madde 6 special categories |

Each layer only processes spans not already found by a previous layer, avoiding double-detection.

---

## LLM Proxy

Protect PII when sending text to external AI services. Mask before sending, restore after, detect any leakage.

### Session-Based Masking

```python
detector = PiiDetector(layers=["regex", "ner"])

session = detector.create_session("Ali Veli TC: 10000000146 hakkında bilgi ver.")
masked = session.mask()
# → "[KISI_ADI_x7k] TC: [TC_KIMLIK_a3f] hakkında bilgi ver."

ai_response = call_openai(masked)  # your AI call

restored = session.restore(ai_response)
# Placeholders in AI response replaced back with originals
```

### Two-Way Proxy (mask → AI → leakage check → restore)

```python
result = detector.two_way(
    prompt="Ali Veli'nin TC numarası 10000000146, özet çıkar.",
    call_fn=lambda masked: openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": masked}]
    ).choices[0].message.content,
    on_leak="warn",  # "raise" | "warn" | "ignore"
)

print(result.output)          # restored AI response
print(result.report.safe)     # True if no PII leaked
print(result.report.summary()) # leakage summary
```

### Leakage Detection

```python
from kvkk_pii import LeakageAnalyzer

analyzer = detector.leakage_analyzer()
report = analyzer.analyze(session, raw_ai_response)

report.safe            # bool
report.leaked          # entities that leaked through placeholders
report.new_pii         # PII in AI output not present in input (hallucination?)
report.risk_score      # 0.0–1.0
print(report.summary())
```

---

## Compliance Report

Maps detected entities to KVKK articles with risk levels and recommendations.

```python
report = detector.compliance_report(text)

print(report.summary())
# KVKK Uyum Raporu — 4 veri, genel risk: YÜKSEK
# KVKK Madde 6 (Özel Nitelikli Veri) tespit edildi!
#
#   [KRİTİK] SAGLIK_VERISI x 1
#     Dayanak: KVKK Madde 6 — Özel Nitelikli Kişisel Veri
#     Öneri  : Açık rıza zorunlu. Yetkili kurum olmadan işlenemez.
#   [YÜKSEK] TC_KIMLIK x 1
#     ...

report.has_madde6      # True if KVKK Article 6 data found
report.overall_risk    # "düşük" | "orta" | "yüksek" | "kritik"
report.to_dict()       # JSON-serializable
```

---

## Async

```python
from kvkk_pii import AsyncPiiDetector

detector = AsyncPiiDetector(layers=["regex", "ner"])

# FastAPI example
@app.post("/scan")
async def scan(text: str):
    result = await detector.analyze(text)
    return [e.__dict__ for e in result.entities]

# Parallel processing
import asyncio
results = await asyncio.gather(*[detector.analyze(t) for t in texts])

# Async two_way
result = await detector.two_way(prompt, async_call_fn)
```

---

## CLI

```bash
# Scan text
kvkk-pii scan "Ali Veli TC: 10000000146"

# Scan file
kvkk-pii scan belge.txt

# Pipe
cat belge.txt | kvkk-pii scan

# With NER layer
kvkk-pii scan --layer ner "Ahmet Yılmaz İstanbul'da"

# JSON output
kvkk-pii scan --format json "TC: 10000000146"

# Anonymize
kvkk-pii anonymize "Ali Veli TC: 10000000146"
# → "Ali Veli TC: [TC_KIMLIK]"

# Version
kvkk-pii version
```

---

## Custom Recognizers

```python
from kvkk_pii import BaseRecognizer, PiiEntity

class SicilNoRecognizer(BaseRecognizer):
    entity_type = "SICIL_NO"

    def find(self, text: str) -> list[PiiEntity]:
        import re
        return [
            self._entity(m.group(), m.start(), m.end(), score=1.0)
            for m in re.finditer(r"\bSCL-\d{6}\b", text)
        ]

from kvkk_pii.layers.regex_layer import DEFAULT_RECOGNIZERS
detector = PiiDetector(recognizers=DEFAULT_RECOGNIZERS + [SicilNoRecognizer()])
```

---

## Configuration

Fine-tune recognizer strictness via config dataclasses:

```python
from kvkk_pii import PiiDetector
from kvkk_pii.config import NerConfig, GlinerConfig, TcKimlikConfig
from kvkk_pii.recognizers.tc_kimlik import TcKimlikRecognizer
from kvkk_pii.layers.regex_layer import DEFAULT_RECOGNIZERS

detector = PiiDetector(
    layers=["regex", "ner", "gliner"],
    recognizers=DEFAULT_RECOGNIZERS + [
        TcKimlikRecognizer(TcKimlikConfig(allow_spaced=True, require_checksum=True))
    ],
    download_policy="auto",   # "confirm" (default) | "auto" | "never"
    ner_config=NerConfig(
        min_score=0.85,       # higher = fewer false positives
        chunk_size=400,       # chars per chunk for long texts
    ),
    gliner_config=GlinerConfig(
        threshold=0.5,
    ),
)
```

---

## Detected Entity Types

### Layer 1 — Regex

| Entity | Description | Validation |
|--------|-------------|------------|
| `TC_KIMLIK` | Turkish national ID (11 digits) | Checksum |
| `VKN` | Tax ID (10 digits) | Checksum |
| `IBAN_TR` | IBAN (all country codes) | Mod97 |
| `KREDI_KARTI` | Credit card number | Luhn |
| `TELEFON_TR` | Turkish phone numbers | — |
| `EMAIL` | Email address | — |
| `IP_ADRESI` | IPv4 address | — |
| `PLAKA_TR` | Turkish license plate | — |
| `PASAPORT_TR` | Turkish passport | — |
| `SGK_NO` | Social security number | — |
| `ADRES` | Street address | — |
| `TARIH` | Date | — |
| `KISI_ADI` | Person name (title-based) | — |

### Layer 2 — NER (`akdeniz27/xlm-roberta-base-turkish-ner`)

| Entity | Description |
|--------|-------------|
| `KISI_ADI` | Person name |
| `KONUM` | Location |
| `KURUM` | Organization |

### Layer 3 — GLiNER (`urchade/gliner_multi-v2.1`, KVKK Madde 6)

| Entity | KVKK Article |
|--------|-------------|
| `SAGLIK_VERISI` | Health data |
| `DINI_INANC` | Religious belief |
| `SIYASI_GORUS` | Political opinion |
| `SENDIKA_UYELIGII` | Trade union membership |
| `BIYOMETRIK_VERI` | Biometric / genetic data |

---

## Requirements

- Python 3.10+
- `pip install kvkk-pii` — no dependencies (regex only)
- `pip install kvkk-pii[ner]` — `transformers`, `torch`, `huggingface-hub`
- `pip install kvkk-pii[full]` — above + `gliner`
- `pip install kvkk-pii[server]` — above + `fastapi`, `uvicorn`

---

## License

MIT
