Metadata-Version: 2.4
Name: claude-anonymizer
Version: 0.2.1
Summary: Drop-in redaction proxy for the Anthropic API — anonymize prompts, deanonymize responses, log the redacted form for compliance.
Author-email: Mikhail Shchegolev <mshegolev@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mshegolev/claude-anonymizer
Project-URL: Documentation, https://github.com/mshegolev/claude-anonymizer#readme
Project-URL: Source, https://github.com/mshegolev/claude-anonymizer
Project-URL: Issues, https://github.com/mshegolev/claude-anonymizer/issues
Project-URL: Changelog, https://github.com/mshegolev/claude-anonymizer/blob/main/CHANGELOG.md
Project-URL: Releases, https://github.com/mshegolev/claude-anonymizer/releases
Keywords: anthropic,claude,anonymizer,gdpr,redaction,proxy,compliance,pii,mitmproxy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Internet :: Proxy Servers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: aiohttp>=3.9; extra == "dev"
Requires-Dist: httpx[socks]>=0.27; extra == "dev"
Requires-Dist: cryptography>=42; extra == "dev"
Requires-Dist: jsonschema>=4; extra == "dev"
Requires-Dist: pyyaml>=6; extra == "dev"
Requires-Dist: prometheus_client>=0.20; extra == "dev"
Provides-Extra: proxy
Requires-Dist: cryptography>=42; extra == "proxy"
Requires-Dist: mitmproxy<12,>=11; extra == "proxy"
Requires-Dist: pyyaml>=6; extra == "proxy"
Dynamic: license-file

# claude-anonymizer

[![ci](https://github.com/mshegolev/claude-anonymizer/actions/workflows/ci.yml/badge.svg)](https://github.com/mshegolev/claude-anonymizer/actions/workflows/ci.yml)
[![python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg)](https://pypi.org/project/claude-anonymizer)
[![license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/mshegolev/claude-anonymizer/blob/main/LICENSE)

Drop-in redaction for the Anthropic API (and OpenAI, Gemini, anything that
talks HTTPS). Anonymize prompts before they leave your perimeter,
deanonymize responses before they reach the user, and keep a tamper-evident
audit trail for compliance review.

```
user                proxy                       upstream LLM
 |  "fix mts auth"  →  |
 |                     |  "fix Acme auth"      →   |
 |                     |     (audit JSONL appended) |
 |                     | ←  "Acme uses OAuth"      |
 |                     |     (audit JSONL appended) |
 |  "mts uses OAuth" ← |                           |
```

Two surfaces:

1. **Library** — wrap any Python callable that talks to an LLM and the
   redaction round-trip happens inline.
2. **System proxy** — TLS-intercepting HTTPS proxy daemon you point your
   CLI tools at (`HTTPS_PROXY=http://127.0.0.1:8080`). Works with Claude
   Code, OpenCode, Codex CLI, Gemini CLI, plain `curl`, etc.

The base package is **pure stdlib** at runtime; the proxy daemon
opts in to `cryptography`, `mitmproxy`, and `pyyaml` via the `[proxy]` extra.

---

## Install

```bash
# Library only (pure stdlib, no extras)
pip install claude-anonymizer

# Library + proxy daemon
pip install 'claude-anonymizer[proxy]'

# Development
pip install -e '.[dev,proxy]'
```

For the full **zero → Claude-Code-via-proxy walkthrough** (install CA,
start daemon, configure HTTPS_PROXY, verify redaction, inspect, uninstall),
see [**INSTALL.md**](https://github.com/mshegolev/claude-anonymizer/blob/main/INSTALL.md). For a one-shot bash bootstrap that
automates every step, run [`./install.sh`](install.sh).

---

## Library quick start

### Wrap an existing function

```python
from claude_anonymizer import Anonymizer, wrap_callable

anon = Anonymizer()  # defaults: mts | MTS | МТС → Acme

def call_claude(prompt: str) -> SomeResult:
    # your existing function — must return an object with a `.output: str` attr
    ...

safe_call = wrap_callable(call_claude, anonymizer=anon)
result = safe_call("я из компании mts")
# result.output is deanonymized; logs show the anonymized round-trip
```

The wrapper handles sync **and** async callables, dataclasses (frozen or
mutable) and plain classes. If the prompt contains no sensitive tokens,
the original result object is returned unchanged (`is`-equality preserved).

### Run the `claude` CLI through it

```python
from claude_anonymizer import AnonymizingClaudeRunner

runner = AnonymizingClaudeRunner(model="claude-opus-4-7")
result = runner.run_sync("я из компании mts, как называется?")
print(result.output)            # "...МТС..."
```

### Custom mappings + canonical form

```python
from claude_anonymizer import Anonymizer

anon = Anonymizer(
    company_mappings={
        "mts": "Acme",
        "МТС": "Acme",
        "MTS": "Acme",
        "Internal-Project-Aurora": "Project-Y",
    },
    canonical_form="МТС",     # always restore to Russian uppercase
)
```

`canonical_form` collapses every original variant onto a single
user-facing string when deanonymizing — useful when multiple inputs
map to one placeholder upstream.

---

## Proxy daemon

The system proxy intercepts HTTPS via a generated root CA, redacts
outbound JSON bodies, restores inbound responses (buffered **and**
streamed SSE), and writes a tamper-evident JSONL audit log.

### One-shot install

```bash
pip install 'claude-anonymizer[proxy]'
anonymizer-proxy install-ca                    # generates CA, installs into OS trust store
anonymizer-proxy run                           # listens on 127.0.0.1:8080
```

Point your tool at the proxy:

```bash
export HTTPS_PROXY=http://127.0.0.1:8080
export SSL_CERT_FILE=$HOME/.compliance-proxy/ca/cert.pem
```

That's it. The first run writes a starter `~/.compliance-proxy/config.yaml`
you can customize.

### Subcommands

| Command | What it does |
|---------|-------------|
| `anonymizer-proxy install-ca [--dry-run] [--force] [--name-constraints HOSTS]` | Generate root CA + register with OS trust store |
| `anonymizer-proxy uninstall-ca [--keep-files]` | Unregister + optionally delete the keypair |
| `anonymizer-proxy run [--config PATH] [--host HOST] [--port N] [--health-port N] [--fail-mode strict\|pass-through]` | Start the proxy daemon |
| `anonymizer-proxy reload [--sock PATH]` | Hot-reload config via UNIX socket (also accepts SIGHUP) |
| `anonymizer-proxy status [--config PATH] [--json]` | Show config + audit-log rollup + CA state |
| `anonymizer-proxy analyze [--audit PATH] [--config PATH] [--top N] [--json] [--include-redacted]` | Surface PII-shaped tokens the detector chain missed (audit-log discovery) |

### Observability

The proxy exposes two HTTP endpoints on the health port (default 8081):

```bash
curl -s http://127.0.0.1:8081/healthz                 # → {"status": "ok"}
curl -s http://127.0.0.1:8081/metrics                 # Prometheus exposition
```

Metric families:

* `compliance_proxy_redacted_total{category="..."}` — counter, per category
* `compliance_proxy_redaction_latency_seconds_*` — histogram (phase = `redact`)
* `compliance_proxy_active_flows` — gauge
* `compliance_proxy_failures_total{reason="..."}` — counter

### Configuration

Full reference: [docs/CONFIG.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/CONFIG.md). Minimal `~/.compliance-proxy/config.yaml`:

```yaml
listen:
  host: 127.0.0.1
  port: 8080
upstreams:
  - host: api.anthropic.com
  - host: api.openai.com
  - host: generativelanguage.googleapis.com
detectors:
  static_mapper:
    enabled: true
    mappings:
      mts: Acme
      MTS: Acme
      МТС: Acme
    canonical_form: МТС
  regex_matcher:
    enabled: true
    patterns: {}   # empty = all built-in Tier 1/2/3 defaults
audit:
  path: ~/.compliance-proxy/audit.jsonl
  rotation: daily
  retention_days: 90
policy:
  fail_mode: strict
```

A failed reload (broken YAML, unknown keys, bad enum value) logs ERROR
and keeps the previously-loaded config — in-flight connections are never
dropped.

### Audit log

Every completed request lands as exactly one line in
`~/.compliance-proxy/audit.YYYY-MM-DD.jsonl` with:

* `request.match_counts` — per-category counts only; **never** the original tokens
* `request.redacted_preview` / `response.raw_preview` — first 200 bytes (post-redaction / pre-restore)
* `prev_hash` + `entry_hash` — SHA-256 chain across records; tampering breaks the chain

Verify chain integrity offline:

```python
from claude_anonymizer.proxy_server.audit import AuditWriter
AuditWriter.verify_chain(Path("~/.compliance-proxy/audit.2026-05-18.jsonl"))
# True | False
```

Files older than `retention_days` are deleted at file granularity (never
line-by-line) on startup and after each rotation.

### Deploying as a service

User-mode templates ship in [`deploy/`](deploy/):

* `deploy/launchd/com.compliance-proxy.plist` — macOS `~/Library/LaunchAgents/`
* `deploy/systemd/compliance-proxy.service` — Linux `~/.config/systemd/user/`

See [deploy/README.md](https://github.com/mshegolev/claude-anonymizer/blob/main/deploy/README.md) for per-OS install and the
HTTPS_PROXY client setup.

### Built-in detector tiers

| Tier | Detector | Patterns / behaviour |
|------|----------|----------------------|
| 1 (ПДн) | `regex_matcher` | MSISDN, passport, SNILS, INN, bank card (Luhn-validated), RS account, email |
| 2 (КТ)  | `regex_matcher` | Bearer token, JWT, API key (sk/pk/ghp/glpat/xox), password-in-URL, AWS access key, TUZ service account |
| 3 (infra) | `regex_matcher` | `*.mts-corp.ru`, `*.mts.ru`, `10.*` / `11.*` IPs, Jira codes (EORD/CLBIZPL/EP/EINVY) |
| company | `static_mapper` | Exact-string substitution from YAML map |
| PII opt-in | `pii.RussianNameDetector` | Two/three-token Cyrillic name heuristic (disabled by default; ~12% FP rate; deny-list for known false-positives) |

Add your own by implementing the [`Detector`](claude_anonymizer/detectors/base.py)
protocol — `name`, `category`, `scan(text) -> list[Match]`.

### Streaming (SSE)

Anthropic and OpenAI stream tokens via `text/event-stream`. The proxy
detects this in `responseheaders` and installs a per-flow rolling-buffer
rewriter — placeholders that straddle chunk boundaries are restored
without buffering the full response. Algorithm: ARCHITECTURE.md §3.2.

---

## Logging contract

The library emits these four INFO lines on every call — they are the
GDPR audit artefact and **wording is stable**:

| Log message (`claude_anonymizer.proxy`) | What it proves |
|---|---|
| `prompt anonymized: N -> M byte(s)` | The transform ran. |
| `anonymized prompt sent to API: …` | Exact bytes that left the perimeter (first 200). |
| `anonymized response from API: …`  | Exact bytes that came back (first 200, pre-restore). |
| `response deanonymized: N -> M byte(s)` | The restore ran. |

Together, the two `… sent to API` / `… from API` lines prove the wire
never carried the canonical form.

---

## Performance

Local benchmark on the reference dev laptop (M-series Mac, Python 3.10):

| Prompt size | p50 | p95 | p99 | Target |
|-------------|-----|-----|-----|--------|
| 128 KB (~32k tokens), full detector chain | 41 ms | 43 ms | **44 ms** | ≤ 50 ms |

```bash
python bench/redactor_bench.py --iters 200
```

---

## Tests

```bash
pytest -q                                          # full suite
pytest tests/proxy_server/test_audit.py            # one area
ruff check .                                       # lint
ruff format --check .                              # format
```

The proxy tests do **not** spawn the real `claude` CLI — they wire up
a fake shell script as `--claude-bin` and assert argv shape, env
discovery, and the full anonymize / spawn / deanonymize cycle. Streaming
tests use synthesised Anthropic/OpenAI SSE fixtures.

CI matrix runs lint → tests (3.10, 3.11, 3.12) → bench → package build
on every push and PR. See [`.github/workflows/ci.yml`](.github/workflows/ci.yml).

To run the same lint + format gates locally before every commit:

```bash
pip install pre-commit
pre-commit install      # one-time per clone
pre-commit run --all-files   # ad-hoc on the whole tree
```

The hooks pin the same `ruff` version as CI so a green pre-commit run
will not be re-flagged in CI.

---

## Documentation

| Doc | Audience |
|-----|----------|
| [docs/ARCHITECTURE.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/ARCHITECTURE.md) | Engineering — design decisions, threat model, streaming algorithm |
| [docs/CONFIG.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/CONFIG.md) | Operators — every `config.yaml` key with validation rules |
| [docs/PRD.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/PRD.md) | Product — problem statement, success metrics, scope |
| [docs/IMPLEMENTATION_PLAN.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/IMPLEMENTATION_PLAN.md) | Engineering — phase-by-phase delivery plan |
| [docs/VERIFICATION_PLAN.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/VERIFICATION_PLAN.md) | QA — test pyramid, CI gates, manual checklist |
| [deploy/README.md](https://github.com/mshegolev/claude-anonymizer/blob/main/deploy/README.md) | Operators — launchd / systemd install |
| [docs/PYPI_RELEASE.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/PYPI_RELEASE.md) | Maintainers — PyPI trusted-publisher setup + release workflow |

---

## History

Originally extracted from
[whilly-orchestrator](https://github.com/mshegolev/whilly-orchestrator)
(JIRA-EORD-9843) and refactored to be orchestrator-agnostic. The proxy
daemon was added in Phases 0–4 as documented in
[docs/IMPLEMENTATION_PLAN.md](https://github.com/mshegolev/claude-anonymizer/blob/main/docs/IMPLEMENTATION_PLAN.md). See
[CHANGELOG.md](https://github.com/mshegolev/claude-anonymizer/blob/main/CHANGELOG.md) for the per-release feature list.

## Contributing

See [CONTRIBUTING.md](https://github.com/mshegolev/claude-anonymizer/blob/main/CONTRIBUTING.md) for the dev setup, the local
gates contributors must run before pushing, and the architecture
decisions that are load-bearing across versions.

## Security

Please **do not** open a public issue for security problems. Follow
the disclosure policy in [SECURITY.md](https://github.com/mshegolev/claude-anonymizer/blob/main/SECURITY.md).

## License

[MIT](https://github.com/mshegolev/claude-anonymizer/blob/main/LICENSE).
