Metadata-Version: 2.4
Name: apii
Version: 0.1.0rc1
Summary: Arabic/GCC PII detection, tokenization, and streaming-interception gateway.
Project-URL: Homepage, https://github.com/Aajil-Labs/arabic-pii-py
Project-URL: Repository, https://github.com/Aajil-Labs/arabic-pii-py
Project-URL: NER models, https://huggingface.co/aajil-labs-sa/arabic-pii-ner
Author-email: Aajil Labs <omar@aajil.sa>
License: MIT OR Apache-2.0
License-File: LICENSE-APACHE
License-File: LICENSE-MIT
Keywords: anonymization,arabic,gcc,llm,ner,pii,privacy,redaction,tokenization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Arabic
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: msgspec>=0.18
Requires-Dist: regex>=2024.0
Provides-Extra: cli
Requires-Dist: cryptography>=42.0; extra == 'cli'
Requires-Dist: typer>=0.12; extra == 'cli'
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: documents
Requires-Dist: pypdf>=4.0; extra == 'documents'
Provides-Extra: ner
Requires-Dist: huggingface-hub>=0.20; extra == 'ner'
Requires-Dist: numpy>=1.24; extra == 'ner'
Requires-Dist: onnxruntime>=1.17; extra == 'ner'
Requires-Dist: tokenizers>=0.15; extra == 'ner'
Provides-Extra: perf
Requires-Dist: onnxruntime>=1.17; extra == 'perf'
Requires-Dist: pyahocorasick>=2.0; extra == 'perf'
Provides-Extra: proxy
Requires-Dist: fastapi>=0.110; extra == 'proxy'
Requires-Dist: httpx>=0.27; extra == 'proxy'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'proxy'
Description-Content-Type: text/markdown

# arabic-pii-py · `apii`

> **Keep Arabic & GCC personal data off the LLM — without changing how you work.**
> Names, IBANs, national IDs, phones, emails, addresses get swapped for
> reversible tokens **on your machine**, before anything reaches Claude / GPT.
> You keep seeing the real values. The model never does.

Everything runs **locally**. No cloud, no account, no data leaves your laptop
(the only network call is a one-time, optional model download you can replace).

---

## Why this exists

If you work with GCC customer data — banks, telcos, government, clinics — you
often legally **cannot** send that PII to a US-hosted LLM. `apii` lets you use
Claude / GPT on that data anyway: the personal data stays on your machine, the
model only ever sees placeholders like `EMAIL_C7E2…`, and the real values are
restored locally for *you*.

- 🇸🇦 **Built for Arabic & the GCC** — Saudi / Emirati / Qatari / Kuwaiti /
  Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and
  on-device Arabic + English NER for names & organizations.
- 💻 **100% local** — no service to run, no API key required, nothing uploaded.
- 🪶 **Lightweight** — pure Python, **no PyTorch**; NER runs as int8 ONNX.
- 🔁 **Reversible & stable** — the same value always maps to the same token, and
  only *your* secret can turn it back.

---

## ⭐ The headline: transparent PII protection inside Claude Code

This is the part most tools can't do. With one command, `apii` wires two hooks
into Claude Code:

| | what happens | result |
|---|---|---|
| **redact-on-read** | when Claude reads a file, PII in it is tokenized *before the model sees it* | Claude only ever sees `EMAIL_…`, `IBAN_…`, `PERSON_…` |
| **restore-on-write** | when Claude writes/edits a file, the tokens are turned back into real values *before the bytes hit disk* | your **files come out correct**, the chat stays tokens |
| **`apii watch`** | a side pane restores Claude's tokenized replies locally | **you** read the real values; Anthropic still only got tokens |

It's **non-blocking** — you work normally; Claude just never receives the PII.

```bash
# in the project where you keep customer data:
apii install-claude-hook          # one time — wires both hooks

# open a fresh Claude Code session there (it loads the hook at startup)
claude

# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch          # follows this folder's session; --once dumps it so far
```

Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the
`apii watch` pane you'll see the real data; and your customer's information never
left your machine.

---

## Install (from source — it's yours, no PyPI needed)

Requires **Python 3.10+**.

```bash
git clone https://github.com/Aajil-Labs/arabic-pii-py.git
cd arabic-pii-py

python3 -m venv .venv && source .venv/bin/activate     # a 3.10+ interpreter
pip install -e ".[ner,cli,proxy,documents]"            # editable, all features
```

That gives you the `apii` command. To use it outside the venv, either keep the
venv active, or symlink it onto your `PATH`:

```bash
ln -sf "$PWD/.venv/bin/apii" ~/.local/bin/apii         # if ~/.local/bin is on PATH
```

**NER models** (names & organizations) auto-download once (~210 MB, int8 ONNX)
from Hugging Face and cache under `~/.cache/huggingface`. Without them, every
*structured* kind (email, phone, IBAN, ID, CR, VAT, address) still works — only
`PERSON` / `ORGANIZATION` need the models. Point at your own copy any time with
`APII_NER_MODEL` / `APII_NER_EN_MODEL`, or change the source repo with
`APII_NER_HF_REPO`.

Extras you can pick: `ner`, `cli`, `proxy` (streaming gateway), `documents`
(pdf/docx/xlsx). Core (`pip install -e .`) is just regex + checksums.

---

## Ways to use it — pick what fits

### 1. Claude Code (above) — the transparent, zero-friction path.

### 2. CLI — text, files, and folders
```bash
# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault   # the model's tokens → real values

apii detect notes.txt                        # audit only: detections as JSON

# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault
```
> `apii redact <file>` reads the file as text. For **documents**
> (pdf/docx/xlsx/json) use `redact-dir` or the UI — they preserve layout.

### 3. Local UI — paste-in / paste-out (+ file upload)
```bash
apii ui    # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
           # take the tokens to any LLM, paste the reply back to restore.
```

### 4. As a library — embed it in your own app
```python
from apii.anonymizer import Anonymizer

a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text)                      # the model sees only tokens
show_user(a.deanonymize(model_reply))    # real values restored locally
```

### 5. Drop-in proxy — protect an app you can't modify
```bash
pip install "apii[proxy]"
apii serve    # local OpenAI-/Anthropic-compatible gateway on 127.0.0.1:8720
# point your client's base URL at it (e.g. ANTHROPIC_BASE_URL=http://127.0.0.1:8720);
# it anonymizes the request and de-anonymizes the streamed response, transparently.
```

---

## What it detects

| kind | how |
|---|---|
| `EMAIL` | format |
| `PHONE` | GCC country codes, Saudi 05X, intl shapes |
| `IBAN` | ISO-7064 **MOD-97** checksum (all 6 GCC countries) |
| `TAX_NUMBER` | 15-digit Saudi / GCC VAT |
| `COMMERCIAL_REGISTRATION` | 10-digit CR, label-cued |
| `NATIONAL_ID` | UAE-784 / Saudi-Iqama / GCC |
| `PERSON` | **on-device NER** (no name lists, no regex) |
| `ORGANIZATION` | **on-device NER** |
| `ADDRESS` | PO-box / street regex + NER locations |

Quality is measured against a **1,340-span corpus of real, publicly-sourced
values** (`tests/eval/`) — `pytest tests/python -q` runs it.

---

## Command reference

| command | what it does |
|---|---|
| `apii redact [file]` | Anonymize text (stdin or a text file) → stdout; save the token↔value map to `--vault`. |
| `apii restore [file] --vault V` | Reverse it: tokens → real values, using the vault. |
| `apii detect [file]` | Audit mode — list detections as JSON, redact nothing. |
| `apii scan-dir DIR --out F` | Detect across a folder; write per-file JSONL summaries + totals. |
| `apii redact-dir DIR --out-dir D` | Redact every matching file (format-aware) into `--out-dir`, merging records into one `--vault`. |
| `apii ui` | Local paste-in / paste-out web UI + file upload (`127.0.0.1:8765`). |
| `apii serve` | Local anonymizing LLM proxy — `/v1/messages`, `/v1/chat/completions`, `/v1/responses` (needs `[proxy]`). |
| `apii watch` | Side-viewer: tail the **current folder's** Claude session, restoring tokens for *your* screen. `--once` dumps the session so far. |
| `apii install-claude-hook` | Wire redact-on-read + restore-on-write into Claude Code in one command (`--global` for all projects). |
| `apii hook` | The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks. |
| `apii daemon` | Long-lived local hook daemon (`POST /hook`) — avoids a process spawn per event. |
| `apii hook-client` | Thin bridge that relays a hook event to a running `daemon`. |

Common flags: `--secret` (or `$APII_SECRET`), `--tenant`, `--vault`,
`--policy strict|balanced|audit`, `--no-ner`. Run `apii <cmd> --help` for the rest.

---

## Environment variables

| var | purpose |
|---|---|
| `APII_SECRET` | Vault HMAC / encryption key. Falls back to the managed `~/.apii/secret` (auto-created, `chmod 600`). |
| `APII_HOME` | Config + vault directory (default `~/.apii`). |
| `APII_POLICY` | Default policy: `strict` (default) / `balanced` / `audit`. |
| `APII_NER_CASE_AUG` | Lowercase-name recovery: `auto` (default — fires on fully-lowercase input) / `always` (mixed-case too) / `off`. |
| `APII_NER_THRESHOLD` | NER minimum confidence (default `0.85`). |
| `APII_NER_MODEL` / `APII_NER_EN_MODEL` | Use your own local Arabic / English ONNX model dirs (override the auto-download). |
| `APII_NER_HF_REPO` | Hugging Face repo to fetch models from (default `aajil-labs-sa/arabic-pii-ner`). |
| `APII_NER_NO_DOWNLOAD` | Set to disable the model auto-download (fully offline). |
| `APII_ANTHROPIC_BASE` / `APII_OPENAI_BASE` | Upstream targets for `apii serve`. |
| `APII_SUPPRESS_PHRASES` | Path to a phrase file of structural vocabulary to never tokenize. |
| `APII_GEO_GAZETTEER` | Path to an optional geo gazetteer for address detection. |

---

## How it stays private (the model)

Two separate boundaries — that's the whole trick:

- **Privacy boundary** = what the LLM receives → **only tokens**, always.
- **Display boundary** = what *you* see → real values, because it's your data on
  your machine.

The bridge is a local, encrypted **vault** (`~/.apii/default.vault`, ChaCha20)
plus a secret (`~/.apii/secret`, `chmod 600`). Tokens are
`HMAC-SHA256(secret, value)` — deterministic, and irreversible without your
secret. Restoration is applied at the **last mile** (your screen, your files) and
**never re-enters the model's context**.

---

## Make it your own

This is a normal, self-contained Python package — **it's yours to run, change,
and ship privately. You never have to publish it anywhere or run a server.**

- **Customize detection:** the recognizers live in `apii/recognizers/` — edit a
  regex, tune a checksum, add a country shape.
- **Swap the NER models:** point `APII_NER_MODEL` / `APII_NER_EN_MODEL` at your
  own ONNX models, or set `APII_NER_HF_REPO` to your own Hugging Face repo.
- **Change token formats, policy, vault location** (`APII_HOME`), tenants, etc.
- **Stay fully offline:** clone, `pip install -e .`, bring the NER models locally
  — no cloud, no PyPI, no service, ever.

It's built to be forked and made internal. Keep it private; it's yours.

---

## NER models & credit

The bundled models are int8-ONNX redistributions of two open models — **please
keep crediting the original authors**:

- **Arabic** — [`hatmimoha/arabic-ner`](https://huggingface.co/hatmimoha/arabic-ner)
  by Hatim Mohamed (on [`asafaya/bert-base-arabic`](https://huggingface.co/asafaya/bert-base-arabic) by Ali Safaya).
- **English** — [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER)
  by David S. Lim (MIT, CoNLL-2003).

Hosted (quantized) at
[`aajil-labs-sa/arabic-pii-ner`](https://huggingface.co/aajil-labs-sa/arabic-pii-ner)
with full provenance + SHAs.

---

## License

**© Aajil Labs.** Dual-licensed — your choice of **MIT** *or* **Apache-2.0**
(see `LICENSE-MIT` and `LICENSE-APACHE`).

You may use, modify, and redistribute this software (including privately and
commercially) under either license. **You must keep the copyright and license
notices** in copies and substantial portions. The bundled NER models are
redistributed under their original authors' terms — credit them as above.

This is yours to build on — just respect the license.
