Metadata-Version: 2.4
Name: apii
Version: 0.1.1
Summary: Arabic/GCC PII detection, tokenization, and streaming-interception gateway.
Project-URL: Homepage, https://github.com/Aajil-Labs/arabic-pii-py
Project-URL: Repository, https://github.com/Aajil-Labs/arabic-pii-py
Project-URL: NER models, https://huggingface.co/aajil-labs-sa/arabic-pii-ner
Author-email: Aajil Labs <omar@aajil.sa>
License: MIT OR Apache-2.0
License-File: LICENSE-APACHE
License-File: LICENSE-MIT
Keywords: anonymization,arabic,gcc,llm,ner,pii,privacy,redaction,tokenization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Arabic
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: msgspec>=0.18
Requires-Dist: regex>=2024.0
Requires-Dist: typer>=0.12
Provides-Extra: all
Requires-Dist: cryptography>=42.0; extra == 'all'
Requires-Dist: fastapi>=0.110; extra == 'all'
Requires-Dist: httpx>=0.27; extra == 'all'
Requires-Dist: huggingface-hub>=0.20; extra == 'all'
Requires-Dist: numpy>=1.24; extra == 'all'
Requires-Dist: onnxruntime>=1.17; extra == 'all'
Requires-Dist: pypdf>=4.0; extra == 'all'
Requires-Dist: tokenizers>=0.15; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'all'
Provides-Extra: cli
Requires-Dist: cryptography>=42.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: documents
Requires-Dist: pypdf>=4.0; extra == 'documents'
Provides-Extra: ner
Requires-Dist: huggingface-hub>=0.20; extra == 'ner'
Requires-Dist: numpy>=1.24; extra == 'ner'
Requires-Dist: onnxruntime>=1.17; extra == 'ner'
Requires-Dist: tokenizers>=0.15; extra == 'ner'
Provides-Extra: perf
Requires-Dist: onnxruntime>=1.17; extra == 'perf'
Requires-Dist: pyahocorasick>=2.0; extra == 'perf'
Provides-Extra: proxy
Requires-Dist: fastapi>=0.110; extra == 'proxy'
Requires-Dist: httpx>=0.27; extra == 'proxy'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'proxy'
Description-Content-Type: text/markdown

# arabic-pii-py · `apii`

[![PyPI](https://img.shields.io/pypi/v/apii)](https://pypi.org/project/apii/) [![Python](https://img.shields.io/pypi/pyversions/apii)](https://pypi.org/project/apii/) [![License](https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue)](#license)

```bash
pip install apii
```

> **Keep Arabic & GCC personal data off the LLM — without changing how you work.**
> Names, IBANs, national IDs, phones, emails, addresses get swapped for
> reversible tokens **on your machine**, before anything reaches Claude / GPT.
> You keep seeing the real values. The model never does.

Everything runs **locally**. No cloud, no account, no data leaves your laptop
(the only network call is a one-time, optional model download you can replace).

---

## Why this exists

If you work with GCC customer data — banks, telcos, government, clinics — you
often legally **cannot** send that PII to a US-hosted LLM. `apii` lets you use
Claude / GPT on that data anyway: the personal data stays on your machine, the
model only ever sees placeholders like `EMAIL_C7E2…`, and the real values are
restored locally for *you*.

- 🇸🇦 **Built for Arabic & the GCC** — Saudi / Emirati / Qatari / Kuwaiti /
  Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and
  on-device Arabic + English NER for names & organizations.
- 💻 **100% local** — no service to run, no API key required, nothing uploaded.
- 🪶 **Lightweight** — pure Python, **no PyTorch**; NER runs as int8 ONNX.
- 🔁 **Reversible & stable** — the same value always maps to the same token, and
  only *your* secret can turn it back.

---

## ⭐ The headline: transparent PII protection inside Claude Code

This is the part most tools can't do. With one command, `apii` wires two hooks
into Claude Code:

| | what happens | result |
|---|---|---|
| **redact-on-read** | when Claude reads a file, PII in it is tokenized *before the model sees it* | Claude only ever sees `EMAIL_…`, `IBAN_…`, `PERSON_…` |
| **restore-on-write** | when Claude writes/edits a file, the tokens are turned back into real values *before the bytes hit disk* | your **files come out correct**, the chat stays tokens |
| **`apii watch`** | a side pane restores Claude's tokenized replies locally | **you** read the real values; Anthropic still only got tokens |

It's **non-blocking** — you work normally; Claude just never receives the PII.

```bash
# in the project where you keep customer data:
apii install-claude-hook          # one time — wires both hooks

# open a fresh Claude Code session there (it loads the hook at startup)
claude

# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch          # follows this folder's session; --once dumps it so far
```

Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the
`apii watch` pane you'll see the real data; and your customer's information never
left your machine.

---

## 🤖 Or just hand it to your coding agent

Don't want to wire anything up? **Give your agent the skill** and it learns to
use `apii` on its own — redacting PII before it reads a file or calls a model.
[`skills/apii/SKILL.md`](skills/apii/SKILL.md) is the portable
[Agent Skills](https://agentskills.io) format, so the same file works in
**Claude Code, Codex, Cursor**, and any tool that reads it.

```bash
# Claude Code — every project:
cp -r skills/apii ~/.claude/skills/     # then just talk to it (auto-loads), or run /apii
# or only this project:
cp -r skills/apii .claude/skills/
# any other agent: paste skills/apii/SKILL.md into its context or skills folder.
```

**Which path do I pick?** They stack — use as many as you like:

| | what it is | best when |
|---|---|---|
| **Skill** `skills/apii/` | teaches the agent to *deliberately* redact / restore | the agent drives; any tool; zero setup |
| **Hook** `apii install-claude-hook` | automatic redact-on-read + restore-on-write | you want it invisible & enforced — Claude Code only |
| **Proxy** `apii serve` | a hard *transport* boundary — the provider literally can't see PII | you don't control the client, or want it provider-wide |

My take: **skill + hook** is the sweet spot for daily Claude Code work — the
hook guarantees protection even when the agent forgets, the skill makes the
agent *smart* about using it. Reach for the **proxy** when the guarantee has to
hold at the wire, for clients you don't control.

---

## Install

```bash
pip install "apii[all]"     # the whole tool: CLI + NER + proxy + documents
```

Requires **Python 3.10+**. `apii[all]` is what you want to *use* apii — the CLI,
the Claude Code hook, the proxy. **Embedding it as a library instead?** Stay lean
and add only what you touch:

| install | what you get |
|---|---|
| `pip install apii` | core detection (regex + checksums) + the `apii` CLI |
| `pip install "apii[ner]"` | + on-device **PERSON / ORGANIZATION** (names & orgs) |
| `pip install "apii[cli]"` | + encrypted-vault persistence (`--vault`) |
| `pip install "apii[proxy]"` | + the streaming `apii serve` gateway |
| `pip install "apii[documents]"` | + PDF text (docx / xlsx / csv / json are built in) |
| `pip install "apii[all]"` | **everything above** |

**NER models** (names & organizations) auto-download once (~210 MB, int8 ONNX)
from Hugging Face and cache under `~/.cache/huggingface`. Without them, every
*structured* kind (email, phone, IBAN, ID, CR, VAT, address) still works — only
`PERSON` / `ORGANIZATION` need the models. Point at your own copy any time with
`APII_NER_MODEL` / `APII_NER_EN_MODEL`, or change the source repo with
`APII_NER_HF_REPO`.

To hack on it instead, clone the repo and `pip install -e ".[all]"` — see
**Make it your own** below.

---

## Ways to use it — pick what fits

### 1. Claude Code (above) — the transparent, zero-friction path.

### 2. CLI — text, files, and folders
```bash
# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault   # the model's tokens → real values

apii detect notes.txt                        # audit only: detections as JSON

# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault
```
> `apii redact <file>` reads the file as text. For **documents**
> (pdf/docx/xlsx/json) use `redact-dir` or the UI — they preserve layout.

### 3. Local UI — paste-in / paste-out (+ file upload)
```bash
apii ui    # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
           # take the tokens to any LLM, paste the reply back to restore.
```

### 4. As a library — embed it in your own app
```python
from apii.anonymizer import Anonymizer

a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text)                      # the model sees only tokens
show_user(a.deanonymize(model_reply))    # real values restored locally
```

### 5. Drop-in proxy — one local endpoint, every provider 🔌

Run one gateway and point **any** LLM client at it. `apii` tokenizes each
request, sends **only tokens** upstream, and restores the (streamed) reply — the
client never changes, the provider never sees PII, and your own API key just
passes through (`apii` never stores it).

```bash
pip install "apii[proxy]"
apii serve                 # → http://127.0.0.1:8720   (--host / --port to change)
```

One port speaks three wire formats, so the same gateway fronts OpenAI,
Anthropic, **Codex**, and anything OpenAI-compatible — **OpenRouter, LiteLLM**,
Together, vLLM, …:

| your client | point it at | upstream env |
|---|---|---|
| OpenAI SDK / chat apps | `base_url = http://127.0.0.1:8720/v1` | `APII_OPENAI_BASE` *(default `api.openai.com`)* |
| **Codex CLI** *(Responses API)* | a custom model-provider → `:8720/v1`, `wire_api = "responses"` | `APII_OPENAI_BASE` |
| Claude Code / Anthropic SDK | `ANTHROPIC_BASE_URL = http://127.0.0.1:8720` | `APII_ANTHROPIC_BASE` *(default `api.anthropic.com`)* |
| OpenRouter · LiteLLM · any OpenAI-compatible | the OpenAI `base_url` above | set `APII_OPENAI_BASE` to that provider |

```bash
# e.g. route OpenAI-style traffic through OpenRouter, PII-safe:
APII_OPENAI_BASE=https://openrouter.ai/api/v1 apii serve
# your app: base_url = http://127.0.0.1:8720/v1  + your OpenRouter key, as usual
```

```toml
# e.g. point the real Codex CLI at apii — ~/.codex/config.toml
model = "gpt-4o-mini"           # any model your upstream serves
model_provider = "apii"
[model_providers.apii]
base_url = "http://127.0.0.1:8720/v1"
wire_api = "responses"
env_key  = "OPENAI_API_KEY"     # OPENROUTER_API_KEY if APII_OPENAI_BASE → OpenRouter
```

> Every route is verified end-to-end — **streaming and non-streaming** — against
> a live provider, including the real Codex CLI: the provider gets only tokens,
> your client gets the real values back.

---

## What it detects

| kind | how |
|---|---|
| `EMAIL` | format |
| `PHONE` | GCC country codes, Saudi 05X, intl shapes |
| `IBAN` | ISO-7064 **MOD-97** checksum (all 6 GCC countries) |
| `TAX_NUMBER` | 15-digit Saudi / GCC VAT |
| `COMMERCIAL_REGISTRATION` | 10-digit CR, label-cued |
| `NATIONAL_ID` | UAE-784 / Saudi-Iqama / GCC |
| `PERSON` | **on-device NER** (no name lists, no regex) |
| `ORGANIZATION` | **on-device NER** |
| `ADDRESS` | PO-box / street regex + NER locations |

Quality is measured against a **1,340-span corpus of real, publicly-sourced
values** (`tests/eval/`) — `pytest tests/python -q` runs it.

---

## Command reference

| command | what it does |
|---|---|
| `apii redact [file]` | Anonymize text (stdin or a text file) → stdout; save the token↔value map to `--vault`. |
| `apii restore [file] --vault V` | Reverse it: tokens → real values, using the vault. |
| `apii detect [file]` | Audit mode — list detections as JSON, redact nothing. |
| `apii scan-dir DIR --out F` | Detect across a folder; write per-file JSONL summaries + totals. |
| `apii redact-dir DIR --out-dir D` | Redact every matching file (format-aware) into `--out-dir`, merging records into one `--vault`. |
| `apii ui` | Local paste-in / paste-out web UI + file upload (`127.0.0.1:8765`). |
| `apii serve` | Local anonymizing LLM proxy — `/v1/messages`, `/v1/chat/completions`, `/v1/responses` (needs `[proxy]`). |
| `apii watch` | Side-viewer: tail the **current folder's** Claude session, restoring tokens for *your* screen. `--once` dumps the session so far. |
| `apii install-claude-hook` | Wire redact-on-read + restore-on-write into Claude Code in one command (`--global` for all projects). |
| `apii hook` | The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks. |
| `apii daemon` | Long-lived local hook daemon (`POST /hook`) — avoids a process spawn per event. |
| `apii hook-client` | Thin bridge that relays a hook event to a running `daemon`. |

Common flags: `--secret` (or `$APII_SECRET`), `--tenant`, `--vault`,
`--policy strict|balanced|audit`, `--no-ner`. Run `apii <cmd> --help` for the rest.

---

## Environment variables

| var | purpose |
|---|---|
| `APII_SECRET` | Vault HMAC / encryption key. Falls back to the managed `~/.apii/secret` (auto-created, `chmod 600`). |
| `APII_HOME` | Config + vault directory (default `~/.apii`). |
| `APII_POLICY` | Default policy: `strict` (default) / `balanced` / `audit`. |
| `APII_NER_CASE_AUG` | Lowercase-name recovery: `auto` (default — fires on fully-lowercase input) / `always` (mixed-case too) / `off`. |
| `APII_NER_THRESHOLD` | NER minimum confidence (default `0.85`). |
| `APII_NER_MODEL` / `APII_NER_EN_MODEL` | Use your own local Arabic / English ONNX model dirs (override the auto-download). |
| `APII_NER_HF_REPO` | Hugging Face repo to fetch models from (default `aajil-labs-sa/arabic-pii-ner`). |
| `APII_NER_NO_DOWNLOAD` | Set to disable the model auto-download (fully offline). |
| `APII_ANTHROPIC_BASE` / `APII_OPENAI_BASE` | Upstream targets for `apii serve`. |
| `APII_SUPPRESS_PHRASES` | Path to a phrase file of structural vocabulary to never tokenize. |
| `APII_GEO_GAZETTEER` | Path to an optional geo gazetteer for address detection. |

---

## How it stays private (the model)

Two separate boundaries — that's the whole trick:

- **Privacy boundary** = what the LLM receives → **only tokens**, always.
- **Display boundary** = what *you* see → real values, because it's your data on
  your machine.

The bridge is a local, encrypted **vault** (`~/.apii/default.vault`, ChaCha20)
plus a secret (`~/.apii/secret`, `chmod 600`). Tokens are
`HMAC-SHA256(secret, value)` — deterministic, and irreversible without your
secret. Restoration is applied at the **last mile** (your screen, your files) and
**never re-enters the model's context**.

---

## Make it your own

This is a normal, self-contained Python package — **it's yours to run, change,
and ship privately. You never have to publish it anywhere or run a server.**

- **Customize detection:** the recognizers live in `apii/recognizers/` — edit a
  regex, tune a checksum, add a country shape.
- **Swap the NER models:** point `APII_NER_MODEL` / `APII_NER_EN_MODEL` at your
  own ONNX models, or set `APII_NER_HF_REPO` to your own Hugging Face repo.
- **Change token formats, policy, vault location** (`APII_HOME`), tenants, etc.
- **Stay fully offline:** clone, `pip install -e .`, bring the NER models locally
  — no cloud, no PyPI, no service, ever.

It's built to be forked and made internal. Keep it private; it's yours.

---

## NER models & credit

The bundled models are int8-ONNX redistributions of two open models — **please
keep crediting the original authors**:

- **Arabic** — [`hatmimoha/arabic-ner`](https://huggingface.co/hatmimoha/arabic-ner)
  by Hatim Mohamed (on [`asafaya/bert-base-arabic`](https://huggingface.co/asafaya/bert-base-arabic) by Ali Safaya).
- **English** — [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER)
  by David S. Lim (MIT, CoNLL-2003).

Hosted (quantized) at
[`aajil-labs-sa/arabic-pii-ner`](https://huggingface.co/aajil-labs-sa/arabic-pii-ner)
with full provenance + SHAs.

---

## License

**© Aajil Labs.** Dual-licensed — your choice of **MIT** *or* **Apache-2.0**
(see `LICENSE-MIT` and `LICENSE-APACHE`).

You may use, modify, and redistribute this software (including privately and
commercially) under either license. **You must keep the copyright and license
notices** in copies and substantial portions. The bundled NER models are
redistributed under their original authors' terms — credit them as above.

This is yours to build on — just respect the license.
