Metadata-Version: 2.4
Name: indic-pii
Version: 0.1.1
Summary: Detection and redaction of Indian-specific Personally Identifiable Information (PII)
License: MIT
Keywords: pii,india,aadhaar,pan,privacy,redaction,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Security
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: regex>=2022.0.0
Provides-Extra: ner
Requires-Dist: spacy>=3.0; extra == "ner"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: spacy>=3.0; extra == "dev"

# indic-pii

`indic-pii` is a lightweight Python library for detecting and redacting Indian-specific personally identifiable information (PII) from free-form text.

It focuses on common identifiers used in India and exposes a small API for:

- Detecting PII spans with confidence scores
- Redacting matched values with default or custom labels
- Optionally using spaCy NER for names, locations, and organisations

## Supported PII Types

The library currently detects:

- Aadhaar numbers
- PAN numbers
- UPI IDs
- Indian mobile numbers
- Bank account numbers
- IFSC codes
- Passport numbers
- Voter IDs
- Driving licence numbers
- Email addresses
- Dates of birth

With optional NER support enabled, it can also detect:

- Person names
- Locations
- Organisations

## Installation

Install the base package:

```bash
pip install indic-pii
```

Install with optional NER support:

```bash
pip install "indic-pii[ner]"
```

If you want NER detection to work, you will also need a spaCy model, for example:

```bash
python -m spacy download en_core_web_sm
```

## Quick Start

```python
from indic_pii import PIIDetector

text = (
    "Rahul's Aadhaar is 3043 3218 1964, PAN is ABCDE1234F, "
    "and phone number is +91 9876543210."
)

detector = PIIDetector(use_ner=False)

matches = detector.detect(text)
for match in matches:
    print(match.pii_type, match.value, match.confidence)

print(detector.redact(text))
```

You can also use the functional API:

```python
import indic_pii

matches = indic_pii.detect("Send to rahul@upi")
redacted = indic_pii.redact("Send to rahul@upi")
```

## Example Output

```python
[
    PIIMatch(type='AADHAAR', value='3043 3218 1964', span=(18, 32), confidence=1.0),
    PIIMatch(type='PAN', value='ABCDE1234F', span=(41, 51), confidence=0.9),
    PIIMatch(type='PHONE', value='+91 9876543210', span=(72, 86), confidence=0.9),
]
```

Redacted text:

```text
Rahul's Aadhaar is [AADHAAR_REDACTED], PAN is [PAN_REDACTED], and phone number is [PHONE_REDACTED].
```

## Custom Redaction Labels

```python
from indic_pii import PIIDetector

detector = PIIDetector(use_ner=False)
text = "Email user@example.com or call 9876543210"

redacted = detector.redact(
    text,
    custom_labels={
        "EMAIL": "***",
        "PHONE": "<hidden-phone>",
    },
)
```

## Notes

- Aadhaar detection can validate candidates using the Verhoeff checksum to reduce false positives.
- Bank account numbers are context-aware and require nearby account-related keywords.
- NER support is optional and degrades gracefully if spaCy or a compatible model is unavailable.
- This library is regex-first and intended for practical text sanitisation workflows, not as a formal compliance guarantee.

## Public API

Main exports:

- `PIIDetector`
- `PIIMatch`
- `indic_pii.detect(...)`
- `indic_pii.redact(...)`
- `indic_pii.ner`

## Python Support

`indic-pii` supports Python 3.8 and newer.
