Metadata-Version: 2.4
Name: docs-anonymizer
Version: 0.2.18
Summary: Offline document anonymizer for legal teams
Project-URL: Homepage, https://anonymizer.site
License-Expression: AGPL-3.0
License-File: LICENSE
Keywords: anonymization,legal,pii,redaction
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business
Requires-Python: <3.13,>=3.11
Requires-Dist: babel>=2.14
Requires-Dist: dateparser>=1.2
Requires-Dist: en-core-web-lg
Requires-Dist: fastapi>=0.110
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: lingua-language-detector>=2.0
Requires-Dist: lxml>=5.1
Requires-Dist: natasha>=1.6
Requires-Dist: openpyxl>=3.1
Requires-Dist: packaging>=24.0
Requires-Dist: phonenumbers>=8.13
Requires-Dist: pydantic>=2.6
Requires-Dist: pymorphy3>=2.0
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-docx>=1.1
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: setuptools<81
Requires-Dist: spacy>=3.7
Requires-Dist: sse-starlette>=2.0
Requires-Dist: transliterate>=1.10
Requires-Dist: uvicorn>=0.27
Description-Content-Type: text/markdown

# anonymizer

Offline document anonymizer for legal teams. Replaces personally identifiable information (PII) in documents with structured tokens before sending them to external AI services.

**Status:** MVP-0 release candidate.

## What it does

Drag a file (`docx` / `pdf` with text layer / `xlsx`) into the local web UI and get an anonymized document where:

- Names, companies, financial details, addresses, emails, phones are replaced with structured tokens like `[Person_1]`, `[Company_1]`, `[ADDRESS_1]`, ...
- Document metadata is cleared
- No network calls during processing — runs entirely on your machine

Then send the result to your AI tool of choice.

## MVP-0 scope

- Formats: `docx`, `pdf` with text layer, `xlsx`
- Languages: Russian, English (NER); language-agnostic detectors for emails, phones, IBAN, cards, IP/MAC/URL, dates, geocoordinates
- Platforms: Windows + macOS
- UI: local web app at `127.0.0.1` in your browser
- Install: single curl one-liner → `uv tool install docs-anonymizer`

OCR for scanned PDFs, password-protected files, additional languages — planned for later iterations (MVP-1+).

## Installation

```bash
# macOS / Linux
curl -fsSL https://anonymizer.site/install.sh | sh

# Windows (PowerShell)
iwr -useb https://anonymizer.site/install.ps1 | iex
```

Then run `anonymize` — your browser will open at `http://127.0.0.1:<port>`.

## Stack

Python 3.11+, FastAPI + htmx, spaCy + Natasha, PyMuPDF, python-docx, openpyxl, lxml. Full details in the technical spec.

## Architecture

Three-layer design — `core` (headless Python library), `cli`, `webapp` (FastAPI on loopback) — plus `testkit` for synthetic test corpus generation and feedback loop tooling. Detectors are pluggable; language packs are drop-in. Manual masking + audit logging without PII leakage.

## Licenses

The project is released under AGPL-3.0 because it depends on PyMuPDF (AGPL).
All other dependencies are permissive open-source (MIT / Apache 2.0 / BSD /
MPL). The source distribution published with each release contains the project
source needed to satisfy AGPL source-availability obligations.

A page in the application UI will list all bundled libraries and models with their individual licenses.
