Metadata-Version: 2.4
Name: pii-masker-gpt
Version: 0.1.0
Summary: A hybrid PII detection, masking, and unmasking library using spaCy and regex.
Author-email: PII Masker GPT <support@example.com>
License: MIT
Keywords: pii,masking,privacy,nlp,spacy
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.0.0
Requires-Dist: regex>=2021.11.10
Dynamic: license-file

# pii-masker-gpt

A production-ready Python package for detecting, masking, and unmasking PII in text using a hybrid spaCy + regex approach.

## Features

- Detects PERSON, ADDRESS, EMAIL, PHONE
- Uses spaCy `en_core_web_sm` for PERSON detection and location heuristics
- Uses regex rules for EMAIL, PHONE, ZIP, and rich ADDRESS extraction
- Produces consistent placeholder mapping across repeated values
- Supports reversing masked text back to original content
- Includes a convenient CLI for local usage

## Installation

1. Create or activate your Python environment.
2. Install the package locally:

```bash
python -m pip install -e .
```

3. Install the spaCy model:

```bash
python -m spacy download en_core_web_sm
```

## Quick Start

```python
from pii_masker import mask, unmask

text = (
    "John Doe lives at 123 Main St, Boston, MA 02118. "
    "His email is john.doe@example.com and his phone is (415) 555-4321."
)
result = mask(text)
print(result["masked_text"])
print(result["mapping"])

unmasked = unmask(result["masked_text"], result["mapping"])
print(unmasked)
```

### Expected result

- `masked_text` will contain placeholders like `[PERSON1]`, `[ADDRESS1]`, `[EMAIL1]`, `[PHONE1]`
- `mapping` preserves the original values for unmasking

## LLM Prompt Privacy Use Case

This package is designed for privacy-conscious applications that need to pass user-generated prompts through any LLM model without exposing sensitive personal data to third-party services.

Typical workflow:

1. Detect and mask PII in the user prompt before sending it to the LLM.
2. Send only masked text to the LLM to avoid exposing sensitive personal data.
3. Unmask the LLM response to restore the original context locally.

### Complete Advanced Example Workflow (Multiple PII of Same Type)

**Step 1: Original User Prompt (with Multiple PII Types - 8+ lines)**

```
Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation.
```

**Step 2: Mask the Prompt**

```python
from pii_masker import mask, unmask

original_prompt = """Please help me coordinate between my two offices. I work at 123 Main Street, Boston, MA 02118 and also at 456 Oak Avenue, New York, NY 10001.
For the Boston office, contact me at john.doe@example.com or (415) 555-4321.
For the New York office, use jane.smith@example.com or (212) 555-6789.
I need to discuss relocation of our primary office from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001.
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is john.doe@example.com if primary line is unavailable.
Ensure all responses reference the correct office before implementation."""

masked_result = mask(original_prompt)
masked_prompt = masked_result["masked_text"]
mapping = masked_result["mapping"]
```

**Step 3: Masked Prompt Sent to LLM (Multiple PII Replaced Consistently)**

```
Please help me coordinate between my two offices. I work at [ADDRESS1] and also at [ADDRESS2].
For the Boston office, contact me at [EMAIL1] or [PHONE1].
For the New York office, use [EMAIL2] or [PHONE2].
I need to discuss relocation of our primary office from [ADDRESS1] to [ADDRESS2].
Please include transition details and ensure all team members at both locations are notified.
Also, backup contact for Boston is [EMAIL1] if primary line is unavailable.
Ensure all responses reference the correct office before implementation.
```

Notice how consistent placeholders are used:
- `123 Main Street, Boston, MA 02118` → `[ADDRESS1]` (appears 2x)
- `456 Oak Avenue, New York, NY 10001` → `[ADDRESS2]` (appears 2x)
- `john.doe@example.com` → `[EMAIL1]` (appears 2x)
- `jane.smith@example.com` → `[EMAIL2]`
- `(415) 555-4321` → `[PHONE1]`
- `(212) 555-6789` → `[PHONE2]`

**Step 4: LLM Response (with masked placeholders)**

```
Certainly! Here's a coordinated transition plan:

For [ADDRESS1] (Boston):
- Notify all staff at [ADDRESS1] of the transition
- Primary contact: [EMAIL1] or [PHONE1]
- Backup contact: [EMAIL1]

For [ADDRESS2] (New York):
- Establish new operations at [ADDRESS2]
- Primary contact: [EMAIL2] or [PHONE2]
- Ensure infrastructure at [ADDRESS2] is ready

Transition Timeline:
- Week 1: Communicate with staff at [ADDRESS1] and [ADDRESS2]
- Week 2-3: Begin relocation from [ADDRESS1] to [ADDRESS2]
- Week 4: Finalize all operations at [ADDRESS2]

All confirmations should be sent to [EMAIL1] and [EMAIL2].
```

**Step 5: Unmask the Response Locally**

```python
unmasked_response = unmask(llm_response, mapping)
print(unmasked_response)
```

**Step 6: Final Unmasked Response (All Original PII Restored)**

```
Certainly! Here's a coordinated transition plan:

For 123 Main Street, Boston, MA 02118 (Boston):
- Notify all staff at 123 Main Street, Boston, MA 02118 of the transition
- Primary contact: john.doe@example.com or (415) 555-4321
- Backup contact: john.doe@example.com

For 456 Oak Avenue, New York, NY 10001 (New York):
- Establish new operations at 456 Oak Avenue, New York, NY 10001
- Primary contact: jane.smith@example.com or (212) 555-6789
- Ensure infrastructure at 456 Oak Avenue, New York, NY 10001 is ready

Transition Timeline:
- Week 1: Communicate with staff at 123 Main Street, Boston, MA 02118 and 456 Oak Avenue, New York, NY 10001
- Week 2-3: Begin relocation from 123 Main Street, Boston, MA 02118 to 456 Oak Avenue, New York, NY 10001
- Week 4: Finalize all operations at 456 Oak Avenue, New York, NY 10001

All confirmations should be sent to john.doe@example.com and jane.smith@example.com.
```

**Key Benefits:**

- Original PII never leaves your system
- LLM only processes masked tokens
- Responses are unmasked locally after receiving them
- Complete control over sensitive data

## API

- `mask(text: str) -> dict`
- `unmask(masked_text: str, mapping: dict) -> str`
- `batch_mask(texts: list[str]) -> list[dict]`

## CLI Usage

After installing the package, use the command-line interface:

```bash
pii-masker "John lives at 123 Main St, Boston, MA 02118 and contacts john@example.com"
```

If you want the mapping output in formatted JSON, run:

```bash
pii-masker "Jane Doe, 500 Elm Rd, Springfield, IL 62704" --show-mapping
```

### Sample CLI output

```text
[PERSON1] lives at [ADDRESS1] and contacts [EMAIL1]
{
  "PERSON": {
    "John": "PERSON1"
  },
  "ADDRESS": {
    "123 Main St, Boston, MA 02118": "ADDRESS1"
  },
  "EMAIL": {
    "john@example.com": "EMAIL1"
  }
}
```

## Testing

Run the test suite with pytest:

```bash
py -3 -m pytest
```

## Notes

- The package loads spaCy only once for best performance.
- Address detection merges street, city, state, and ZIP heuristics to avoid address fragmentation.
- Mapping is case-insensitive, so `John` and `john` reuse the same placeholder.

## Publishing to GitHub and PyPI

### GitHub

This repository is ready to be initialized as a Git repository and pushed to GitHub.

```bash
cd path/to/pii-masker-gpt
git init
git add .
git commit -m "Initial package commit"
git remote add origin https://github.com/<your-username>/<your-repo>.git
git branch -M main
git push -u origin main
```

The included GitHub Actions workflow at `.github/workflows/python-publish.yml` can publish a release automatically when a tag is pushed.

### PyPI

To publish to PyPI manually:

```bash
python -m pip install --upgrade build twine
python -m build
python -m twine upload dist/*
```

To publish automatically from GitHub Actions, add a repository secret named `PYPI_API_TOKEN` with your PyPI API token.

Then create and push a tag:

```bash
git tag v0.1.0
git push origin v0.1.0
```
