Metadata-Version: 2.4
Name: anonypii
Version: 0.1.0
Summary: Production-grade PII detection, masking, and reversible anonymization library backed by fine-tuned DeBERTa models.
Project-URL: Homepage, https://github.com/pritesh-2711/anonypii
Project-URL: Repository, https://github.com/pritesh-2711/anonypii
Project-URL: Bug Tracker, https://github.com/pritesh-2711/anonypii/issues
Project-URL: Documentation, https://github.com/pritesh-2711/anonypii/blob/main/README.md
Author-email: Pritesh Jha <priteshjha2711@gmail.com>
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship made available under
              the License.
        
              "Derivative Works" shall mean any work that is based on the Work.
        
              "Contribution" shall mean any work of authorship submitted to
              the Licensor for inclusion in the Work.
        
              "Contributor" shall mean Licensor and any Legal Entity on behalf
              of whom a Contribution has been received by the Licensor.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              patent license to make, use, sell, and distribute the Work.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works in any medium, with or without modifications,
              provided that You meet the conditions stated in Section 4 (a)-(d).
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution submitted for inclusion in the Work shall be under
              the terms of this License.
        
           6. Trademarks. This License does not grant permission to use the trade
              names or trademarks of the Licensor.
        
           7. Disclaimer of Warranty. The Work is provided on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND.
        
           8. Limitation of Liability. In no event shall any Contributor be
              liable for damages arising out of this License or use of the Work.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work, You may offer acceptance of support or warranty obligations.
        
           Copyright 2026 Pritesh Jha
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
License-File: LICENSE
Keywords: anonymization,deberta,gdpr,named-entity-recognition,nlp,pii,privacy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pyyaml>=6.0
Requires-Dist: typing-extensions>=4.0; python_version < '3.11'
Provides-Extra: all
Requires-Dist: accelerate>=0.20; extra == 'all'
Requires-Dist: huggingface-hub>=0.20; extra == 'all'
Requires-Dist: numpy>=1.24.0; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: protobuf>=3.20.0; extra == 'all'
Requires-Dist: sentencepiece>=0.1.99; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: transformers>=4.30; extra == 'all'
Provides-Extra: dev
Requires-Dist: accelerate>=0.20; extra == 'dev'
Requires-Dist: huggingface-hub>=0.20; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: numpy>=1.24.0; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.12; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Requires-Dist: ruff>=0.3; extra == 'dev'
Requires-Dist: sentencepiece>=0.1.99; extra == 'dev'
Requires-Dist: torch>=2.0; extra == 'dev'
Requires-Dist: transformers>=4.30; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Provides-Extra: dev-core
Requires-Dist: mypy>=1.8; extra == 'dev-core'
Requires-Dist: pre-commit>=3.6; extra == 'dev-core'
Requires-Dist: pytest-cov>=4.0; extra == 'dev-core'
Requires-Dist: pytest-mock>=3.12; extra == 'dev-core'
Requires-Dist: pytest>=8.0; extra == 'dev-core'
Requires-Dist: pyyaml>=6.0; extra == 'dev-core'
Requires-Dist: ruff>=0.3; extra == 'dev-core'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev-core'
Provides-Extra: model
Requires-Dist: accelerate>=0.20; extra == 'model'
Requires-Dist: huggingface-hub>=0.20; extra == 'model'
Requires-Dist: numpy>=1.24.0; extra == 'model'
Requires-Dist: protobuf>=3.20.0; extra == 'model'
Requires-Dist: sentencepiece>=0.1.99; extra == 'model'
Requires-Dist: torch>=2.0; extra == 'model'
Requires-Dist: transformers>=4.30; extra == 'model'
Provides-Extra: models
Requires-Dist: accelerate>=0.20; extra == 'models'
Requires-Dist: huggingface-hub>=0.20; extra == 'models'
Requires-Dist: numpy>=1.24.0; extra == 'models'
Requires-Dist: protobuf>=3.20.0; extra == 'models'
Requires-Dist: sentencepiece>=0.1.99; extra == 'models'
Requires-Dist: torch>=2.0; extra == 'models'
Requires-Dist: transformers>=4.30; extra == 'models'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == 'pandas'
Description-Content-Type: text/markdown

# anonypii

Production-grade PII detection, masking, and reversible anonymization for Python.

Backed by two fine-tuned DeBERTa-v3-base models from the PIIBench research:

| Model | HuggingFace | Full-test F1 | Wins |
|---|---|---:|---|
| `piibench-deberta-base` | [Pritesh-2711/piibench-deberta-base](https://huggingface.co/Pritesh-2711/piibench-deberta-base) | **0.6455** | 54/82 entity types |
| `piibench-deberta-sch` | [Pritesh-2711/piibench-deberta-sch](https://huggingface.co/Pritesh-2711/piibench-deberta-sch) | 0.5894 | 28/82 entity types (HTTP_COOKIE, DATE_TIME, ...) |

Both models cover all **82 entity types** across 10 coarse categories (CREDENTIAL, FINANCIAL_ID, CONTACT, NETWORK, LOCATION, PERSON_GROUP, ORG_ROLE, TEMPORAL, MISC, FINANCIAL_NER).

---

## Installation

```bash
# Core library only (regex detector, no model)
pip install anonypii

# With model support (torch, transformers, huggingface-hub)
pip install anonypii[model]

# With model support + auto-download of both models
pip install anonypii[models]

# With pandas DataFrame support
pip install anonypii[pandas]

# Everything
pip install anonypii[all]
```

### Download models manually

```bash
anonypii download all                    # both models
anonypii download piibench-deberta-base  # recommended only
anonypii download piibench-deberta-sch   # SC+H only
```

Or at runtime:

```python
from anonypii.detectors.model import ModelPIIDetector
detector = ModelPIIDetector(model="piibench-deberta-base", download=True)
```

---

## Quick start

### Irreversible masking

```python
from anonypii import Anonymizer

anon = Anonymizer(model="piibench-deberta-base", download=True)

anon.mask("My email is john@example.com")
# "My email is <EMAIL>"

anon.mask("SSN: 123-45-6789 and card 4111-1111-1111-1111")
# "SSN: <SSN> and card <CREDIT_CARD>"
```

### Reversible anonymization

```python
from anonypii import Anonymizer

anon = Anonymizer(model="piibench-deberta-base", download=True)

result = anon.anonymize("My email is john@example.com")
print(result.text)     # "My email is {{EMAIL_001}}"
print(result.restore()) # "My email is john@example.com"
```

### Stateful reversible anonymizer

```python
from anonypii import ReversibleAnonymizer

ra = ReversibleAnonymizer(model="piibench-deberta-base", download=True)

r = ra.anonymize("Contact alice@corp.com or call 555-123-4567")
print(r.text)           # "Contact {{EMAIL_001}} or call {{PHONE_001}}"
print(ra.restore(r.text)) # original text restored
```

### Using the regex detector (no download needed)

```python
from anonypii import Anonymizer
from anonypii.detectors.regex import RegexPIIDetector

anon = Anonymizer(detector=RegexPIIDetector())
print(anon.mask("john@example.com / 123-45-6789"))
# "<EMAIL> / <SSN>"
```

---

## Masking strategies

```python
from anonypii.masking.strategies import (
    TagMaskingStrategy,        # <EMAIL>          (default for mask())
    RedactedMaskingStrategy,   # [REDACTED]
    StarMaskingStrategy,       # j**************m
    TokenMaskingStrategy,      # {{EMAIL_001}}    (default for anonymize())
)

# Star masking: keep first and last character
from anonypii.masking.strategies import StarMaskingStrategy
anon = Anonymizer(detector=..., strategy=StarMaskingStrategy(keep_start=1, keep_end=1))
```

---

## Entity configuration

Restrict detection to a subset of entities via a YAML or JSON config file:

```yaml
# my_config.yaml
schema_version: "1.0"

active_entity_types:
  - EMAIL
  - SSN
  - CREDIT_CARD

# Or activate entire coarse groups:
active_coarse_groups:
  - CREDENTIAL
  - FINANCIAL_ID
```

```python
anon = Anonymizer(config_path="my_config.yaml", ...)
```

---

## Allowlist

Suppress known-safe values from detection results:

```python
import re
from anonypii.detectors.regex import RegexPIIDetector

detector = RegexPIIDetector(
    allowlist=[
        "noreply@company.com",              # exact literal
        re.compile(r".*@internal\.com$"),   # regex pattern
    ]
)
```

---

## Vault options

```python
from anonypii.vault.memory import InMemoryVault           # default, session-only
from anonypii.vault.memory import ThreadSafeInMemoryVault # thread-safe variant
from anonypii.vault.json_file import JsonFileVault        # persistent across sessions

ra = ReversibleAnonymizer(
    detector=...,
    vault=JsonFileVault("~/.anonypii/vault.json"),
)
```

---

## DataFrame processing

```python
import pandas as pd
from anonypii import Anonymizer
from anonypii.io.dataframe import process_dataframe

df = pd.DataFrame({"email": ["alice@x.com"], "notes": ["SSN 123-45-6789"]})
redacted_df, results = process_dataframe(df, Anonymizer(...))
```

---

## CLI

```bash
anonypii detect  "My email is john@example.com"
anonypii mask    "My email is john@example.com"
anonypii anonymize "My email is john@example.com" --output-mapping mapping.json
anonypii restore   "My email is {{EMAIL_001}}"     --mapping mapping.json
anonypii info
anonypii download all
```

---

## Entity types (82 total)

| Coarse group | Entity types |
|---|---|
| CREDENTIAL | SSN, PASSWORD, API_KEY, PIN, PASSPORT_NUMBER, DRIVER_LICENSE, TAX_ID, NATIONAL_ID, ... |
| FINANCIAL_ID | CREDIT_CARD, IBAN, ACCOUNT_NUMBER, BANK_ROUTING_NUMBER, BIC, SWIFT_BIC, CVV, ... |
| CONTACT | EMAIL, PHONE, PHONE_NUMBER, FAX_NUMBER |
| NETWORK | IP_ADDRESS, IPV4, IPV6, MAC_ADDRESS, URL, USERNAME, HTTP_COOKIE, DEVICE_IDENTIFIER |
| PERSON_GROUP | PERSON, FIRST_NAME, LAST_NAME, NAME, AGE, GENDER |
| LOCATION | ADDRESS, CITY, STATE, COUNTRY, POSTCODE, COORDINATE, STREET_ADDRESS, ... |
| ORG_ROLE | ORG, COMPANY, COMPANY_NAME, JOB, OCCUPATION |
| TEMPORAL | DATE, TIME, DATE_TIME, DATE_OF_BIRTH |
| MISC | CRYPTO_ADDRESS, VEHICLE, CURRENCY, AMOUNT, BLOOD_TYPE, LICENSE_PLATE, ... |
| FINANCIAL_NER | FINANCIAL_ENTITY |

---

## Research

The underlying models are described in:

- **Dataset**: [PIIBench: A Unified Multi-Source Benchmark Corpus for PII Detection](https://arxiv.org/abs/2604.15776) — Jha (2026)
- **Models**: [Fine-Tuning Over Architectural Complexity: PII Detection on PIIBench with DeBERTa](https://arxiv.org/abs/2605.25816) — Jha (2026)

---

## License

Apache License 2.0 — see [LICENSE](LICENSE).
