Metadata-Version: 2.4
Name: docs-anonymizer
Version: 0.2.29
Summary: Offline document anonymizer for legal teams
Project-URL: Homepage, https://anonymizer.site
License-Expression: AGPL-3.0
License-File: LICENSE
Keywords: anonymization,legal,pii,redaction
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business
Requires-Python: <3.13,>=3.11
Requires-Dist: babel>=2.14
Requires-Dist: dateparser>=1.2
Requires-Dist: en-core-web-lg
Requires-Dist: fastapi>=0.110
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: lingua-language-detector>=2.0
Requires-Dist: lxml>=5.1
Requires-Dist: natasha>=1.6
Requires-Dist: openpyxl>=3.1
Requires-Dist: packaging>=24.0
Requires-Dist: phonenumbers>=8.13
Requires-Dist: pydantic>=2.6
Requires-Dist: pymorphy3>=2.0
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-docx>=1.1
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: setuptools<81
Requires-Dist: spacy>=3.7
Requires-Dist: sse-starlette>=2.0
Requires-Dist: transliterate>=1.10
Requires-Dist: uvicorn>=0.27
Description-Content-Type: text/markdown

# anonymizer

Offline document anonymizer for legal teams. Replaces personally identifiable information (PII) in documents with structured tokens before sending them to external AI services.

**Status:** MVP-0 release candidate.

## What it does

Drag a file (`docx` / `xlsx` / `pdf`, including scanned PDFs when local OCR is available) into the local web UI and get an anonymized document where:

- Names, companies, financial details, addresses, emails, phones are replaced with structured tokens like `[Person_1]`, `[Company_1]`, `[ADDRESS_1]`, ...
- Document metadata is cleared
- No network calls during processing — runs entirely on your machine

Then send the result to your AI tool of choice.

## MVP-0 scope

- Formats: `docx`, `xlsx`, `pdf` with text layer, scanned PDF, and hybrid PDF
- Languages: Russian, English (NER); language-agnostic detectors for emails, phones, IBAN, cards, IP/MAC/URL, dates, geocoordinates
- Platforms: Windows + macOS
- UI: local web app at `127.0.0.1` in your browser
- Install: single curl one-liner → `uv tool install docs-anonymizer`

Scanned and hybrid PDFs use local Tesseract OCR with English and Russian language packs. Password-protected files, additional languages, and editable recognized-DOCX export remain planned for later iterations.

## Installation

```bash
# macOS / Linux
curl -fsSL https://anonymizer.site/install.sh | sh

# Windows (PowerShell)
iwr -useb https://anonymizer.site/install.ps1 | iex
```

Then run `anonymize` — your browser will open at `http://127.0.0.1:<port>`.

### OCR setup for scanned PDFs

Scanned and hybrid PDFs require system Tesseract with English and Russian
language packs. The anonymizer installer offers to install Tesseract
interactively and shows an approximate download/install size before asking.
If you skip it, DOCX, XLSX, and PDFs with a text layer still work.

```bash
# macOS
brew install tesseract tesseract-lang

# Ubuntu / Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-rus

# Windows (PowerShell)
winget install UB-Mannheim.TesseractOCR
```

On macOS, Homebrew's `tesseract-lang` package is large because it bundles all
extra languages; expect up to roughly 720 MB on disk. Ubuntu/Debian and Windows
downloads are usually smaller, and the package manager may show the exact
download size.

After installing Tesseract, run:

```bash
anonymize doctor --no-network
```

If OCR is unavailable, scanned PDF processing is rejected with installation
guidance instead of silently skipping scanned pages.

## Stack

Python 3.11+, FastAPI + htmx, spaCy + Natasha, PyMuPDF, python-docx, openpyxl, lxml. Full details in the technical spec.

## Architecture

Three-layer design — `core` (headless Python library), `cli`, `webapp` (FastAPI on loopback) — plus `testkit` for synthetic test corpus generation and feedback loop tooling. Detectors are pluggable; language packs are drop-in. Manual masking + audit logging without PII leakage.

## Licenses

The project is released under AGPL-3.0 because it depends on PyMuPDF (AGPL).
All other dependencies are permissive open-source (MIT / Apache 2.0 / BSD /
MPL). The source distribution published with each release contains the project
source needed to satisfy AGPL source-availability obligations.

A page in the application UI will list all bundled libraries and models with their individual licenses.
