Metadata-Version: 2.4
Name: ai-prompt-sanitizer
Version: 1.0.0
Summary: Lightweight, tiered, bidirectional PII sanitizer for LLM pipelines
Project-URL: Homepage, https://github.com/jeslor/prompt-sanitizer
Project-URL: Repository, https://github.com/jeslor/prompt-sanitizer
Project-URL: Issues, https://github.com/jeslor/prompt-sanitizer/issues
License: MIT
Keywords: anonymization,gdpr,hipaa,llm,pii,privacy,redaction,sanitizer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: all
Requires-Dist: faker>=24.0.0; extra == 'all'
Requires-Dist: fastapi>=0.100.0; extra == 'all'
Requires-Dist: langchain-experimental>=0.0.60; extra == 'all'
Requires-Dist: langchain>=0.2.0; extra == 'all'
Requires-Dist: llama-index-core>=0.10.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: tokenizers>=0.19.0; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Requires-Dist: transformers>=4.40.0; extra == 'all'
Provides-Extra: integrations
Requires-Dist: fastapi>=0.100.0; extra == 'integrations'
Requires-Dist: langchain-experimental>=0.0.60; extra == 'integrations'
Requires-Dist: langchain>=0.2.0; extra == 'integrations'
Requires-Dist: llama-index-core>=0.10.0; extra == 'integrations'
Requires-Dist: openai>=1.0.0; extra == 'integrations'
Provides-Extra: nlp
Requires-Dist: tokenizers>=0.19.0; extra == 'nlp'
Requires-Dist: torch>=2.0.0; extra == 'nlp'
Requires-Dist: transformers>=4.40.0; extra == 'nlp'
Provides-Extra: synthetic
Requires-Dist: faker>=24.0.0; extra == 'synthetic'
Description-Content-Type: text/markdown

# prompt-sanitizer

PII and secret sanitization for Python LLM pipelines.

`prompt-sanitizer` provides a typed API for detecting, redacting, anonymizing, and restoring sensitive values before they reach a model, tool, middleware layer, log sink, or SDK wrapper. FAST mode has zero required dependencies. SMART and FULL add optional NLP, synthetic replacement, and audit logging.

## Install

Python 3.10+.

```bash
pip install ai-prompt-sanitizer
pip install "ai-prompt-sanitizer[nlp]"
pip install "ai-prompt-sanitizer[synthetic]"
pip install "ai-prompt-sanitizer[integrations]"
pip install "ai-prompt-sanitizer[all]"
```

### Optional extras

| Extra | Adds | Typical use |
| --- | --- | --- |
| `nlp` | `transformers` + `torch` | NER in SMART/FULL mode |
| `synthetic` | `faker` | realistic fake replacements |
| `integrations` | framework / SDK adapters | LangChain, LlamaIndex, OpenAI, FastAPI, Django |
| `all` | all extras | full feature set |

## Quick start

```python
from prompt_sanitizer import Sanitizer, Mode

s = Sanitizer(mode=Mode.FAST)
result = s.sanitize("Contact Jane Doe at jane@example.com or 415-555-0112.")

print(result.text)
print(result.has_pii)
print(result.risk_score)
print(result.tokens)
for entity in result.entities:
    print(entity.entity_type, entity.value, entity.replacement)
```

## Modes

| Mode | Pipeline | Dependencies | Notes |
| --- | --- | --- | --- |
| `Mode.FAST` | regex + secret detectors | none | sub-ms, stdlib only |
| `Mode.SMART` | FAST + Piiranha NER | `prompt-sanitizer[nlp]` | lazy-loads on first call |
| `Mode.FULL` | SMART + synthetic replacement + audit log | usually `nlp` + `synthetic` | best for compliance-oriented flows |

### FAST mode

```python
from prompt_sanitizer import Sanitizer, Mode

s = Sanitizer(mode=Mode.FAST)
text = "SSN 078-05-1120, card 4111 1111 1111 1111, token sk-proj-xxxxxxxxxxxxxxxxxxxxxxxx"
result = s.sanitize(text)

print(result.text)
print(result.entities)
print(result.tokens)
```

Use FAST for prompt pre-processing, log scrubbing, middleware guards, CI checks, and zero-dependency CLI tooling.

### SMART mode

```python
from prompt_sanitizer import Sanitizer, Mode

s = Sanitizer(mode=Mode.SMART)
result = s.sanitize(
    "Alice from Acme Corp met us in Berlin on 2025-02-14. Email alice@acme.example."
)

print(result.text)
for entity in result.entities:
    print(entity.entity_type, entity.value, entity.confidence)
```

Use SMART when prompts contain free-form prose with names, organizations, dates, or locations that regexes alone may miss.

### FULL mode

```python
from prompt_sanitizer import Sanitizer, Mode, SQLiteAuditLog

audit = SQLiteAuditLog("prompt_sanitizer_audit.db")
s = Sanitizer(mode=Mode.FULL, locale="en_US", on_detect="redact", audit_log=audit)

result = s.sanitize("Customer Jane Doe uses jane@example.com and 415-555-0112.")
print(result.text)
print(result.tokens)
print(s.audit.export(format="json"))
```

Use FULL when you want synthetic replacement plus an audit trail.

## Public API

### `Sanitizer`

```python
Sanitizer(
    mode: Mode = Mode.FAST,
    locale: str = "en_US",
    entities: list[EntityType] | None = None,
    on_detect: str = "redact",
    audit_log: BaseAuditLog | None = None,
)
```

| Parameter | Type | Description |
| --- | --- | --- |
| `mode` | `Mode` | detection pipeline |
| `locale` | `str` | locale for synthetic replacement generation |
| `entities` | `list[EntityType] \| None` | optional allowlist of entity types |
| `on_detect` | `str` | `"redact"`, `"warn"`, or `"block"` |
| `audit_log` | `BaseAuditLog \| None` | optional audit backend |

| Method | Signature | Description |
| --- | --- | --- |
| `sanitize` | `sanitize(text: str, session_id: str | None = None) -> SanitizeResult` | sanitize one string |
| `sanitize_batch` | `sanitize_batch(texts: list[str]) -> list[SanitizeResult]` | sanitize multiple inputs |
| `session` | `session(session_id: str | None = None) -> Session` | create a reusable anonymization session |
| `add_entity` | `add_entity(name: str, pattern: str, confidence: float = 0.85) -> None` | register a custom entity |
| `stream` | `stream(source: AsyncIterable, session: Session | None) -> AsyncGenerator[str, None]` | restore streamed chunks |
| `guard` | `guard(on_detect: str) -> decorator` | decorate a function with sanitization logic |
| `audit` | `.audit -> BaseAuditLog | None` | access the configured audit log |

### Detection policy

| `on_detect` value | Behavior |
| --- | --- |
| `"redact"` | rewrite the returned text |
| `"warn"` | return original text, but populate entities and scores |
| `"block"` | raise instead of returning sanitized text |

```python
results = s.sanitize_batch(["Email a@example.com", "No sensitive data here"])

@s.guard(on_detect="redact")
def call_model(prompt: str) -> str:
    return prompt
```

---

## `Mode`, `SanitizeResult`, and `DetectedEntity`

| `Mode` value | Meaning |
| --- | --- |
| `Mode.FAST` | regex + secrets, zero deps, sub-ms |
| `Mode.SMART` | FAST + Piiranha NER, lazy loads on first call |
| `Mode.FULL` | SMART + synthetic replacement + audit log |

| `SanitizeResult` attribute | Type | Description |
| --- | --- | --- |
| `text` | `str` | sanitized text |
| `entities` | `list[DetectedEntity]` | detected spans |
| `tokens` | `dict[str, str]` | `{original_value: replacement}` map |
| `risk_score` | `float` | composite score from `0.0` to `1.0` |
| `has_pii` | `bool` | whether sensitive data was found |

| `DetectedEntity` attribute | Type | Description |
| --- | --- | --- |
| `entity_type` | `EntityType` | entity classification |
| `value` | `str` | original matched value |
| `start` | `int` | inclusive start offset |
| `end` | `int` | exclusive end offset |
| `confidence` | `float` | detection confidence |
| `replacement` | `str \| None` | replacement value, if generated |

```python
result = s.sanitize("Contact me at sam@example.com")
assert result.has_pii is True
assert 0.0 <= result.risk_score <= 1.0
for entity in result.entities:
    print(entity.entity_type, entity.value, entity.replacement)
```
## Sessions and vaults

Use sessions when the model should never see raw values, but the final response should restore them.

```python
from prompt_sanitizer import Sanitizer

s = Sanitizer()
session = s.session(session_id="support-chat-001")
clean_prompt = session.anonymize("My name is Elena Ruiz and my email is elena@company.com")
llm_reply = "Confirmed. I will email [EMAIL_1] shortly."
final_reply = session.deanonymize(llm_reply)

print(clean_prompt)
print(final_reply)
```

| `Session` API | Description |
| --- | --- |
| `session.anonymize(text: str) -> str` | replace PII with vault tokens |
| `session.deanonymize(text: str) -> str` | restore originals from the vault |
| `session.vault: Vault` | access the underlying vault |

| `Vault` API | Description |
| --- | --- |
| `vault.store(value: str, replacement: str) -> None` | store a mapping |
| `vault.lookup(replacement: str) -> str \| None` | resolve token to original |
| `vault.reverse(value: str) -> str \| None` | resolve original to replacement |
| `vault.clear() -> None` | clear all mappings |

```python
vault = session.vault
vault.store("alice@example.com", "[EMAIL_1]")
print(vault.lookup("[EMAIL_1]"))
print(vault.reverse("alice@example.com"))
vault.clear()
```
## Custom entities

Use `add_entity()` for internal identifiers, tenant-specific secrets, or domain-specific formats.

```python
from prompt_sanitizer import Sanitizer

s = Sanitizer()
s.add_entity(name="customer_id", pattern=r"\bCUS-\d{8}\b", confidence=0.90)
s.add_entity(name="invoice_no", pattern=r"\bINV-\d{6}-[A-Z]{2}\b", confidence=0.88)

result = s.sanitize("Customer CUS-12345678 opened invoice INV-882211-US")
print(result.text)
print(result.entities)
```

## Filtering by entity type

```python
from prompt_sanitizer import Sanitizer, EntityType

s = Sanitizer(entities=[EntityType.EMAIL, EntityType.API_KEY])
result = s.sanitize("Email a@b.com and SSN 123-45-6789")
print(result.text)
```

## Audit logging

Audit backends are optional. Use them when you want structured records of detections.

### `MemoryAuditLog`

```python
from prompt_sanitizer import MemoryAuditLog, Mode, Sanitizer

audit = MemoryAuditLog()
s = Sanitizer(mode=Mode.FULL, audit_log=audit)
s.sanitize("Email finance@example.com")

print(audit.events())
print(audit.export(format="json"))
```

### `SQLiteAuditLog`

```python
from prompt_sanitizer import SQLiteAuditLog, Mode, Sanitizer

audit = SQLiteAuditLog("audit.db")
s = Sanitizer(mode=Mode.FULL, audit_log=audit)
s.sanitize("Call +1 415 555 0112", session_id="request-17")

print(audit.events())
print(audit.export(format="csv"))
```

### Audit API

| API | Description |
| --- | --- |
| `MemoryAuditLog()` | in-memory list of `AuditEvent` |
| `SQLiteAuditLog(path: str)` | SQLite-backed persisted log |
| `.events() -> list[AuditEvent]` | return recorded events |
| `.export(format: "json" \| "csv") -> str` | export audit records |

## Integrations

Install integration dependencies first:

```bash
pip install "ai-prompt-sanitizer[integrations]"
```

### LangChain

```python
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.langchain import PromptSanitizerRunnable, SanitizedLLM

s = Sanitizer()
# As a runnable step in a chain
chain = PromptSanitizerRunnable(sanitizer=s) | llm | OutputParser()
result = chain.invoke("My email is dev@example.com")

# Or wrap the LLM directly
safe_llm = SanitizedLLM(llm, s)
reply = safe_llm.invoke("Contact alice@example.com with the summary.")
```

### LlamaIndex

```python
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.llamaindex import PromptSanitizerPostprocessor

s = Sanitizer()
postprocessor = PromptSanitizerPostprocessor(sanitizer=s)
query_engine = index.as_query_engine(node_postprocessors=[postprocessor])
response = query_engine.query("Summarize the contract for jane@example.com")
```

### OpenAI SDK wrapper

```python
import openai
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.openai import wrap

s = Sanitizer()
client = wrap(openai.OpenAI(), sanitizer=s)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "My card is 4111 1111 1111 1111"}],
)
```

### FastAPI middleware

```python
from fastapi import FastAPI
from prompt_sanitizer import Sanitizer
from prompt_sanitizer.integrations.fastapi import SanitizerMiddleware

s = Sanitizer()
app = FastAPI()
app.add_middleware(SanitizerMiddleware, sanitizer=s, fields=["prompt", "message"])
```

### Django middleware

```python
MIDDLEWARE = ["prompt_sanitizer.integrations.django.SanitizerMiddleware"]
```

```python
from prompt_sanitizer import Sanitizer

PROMPT_SANITIZER = {
    "sanitizer": Sanitizer(),
    "fields": ["prompt", "message"],
}
```

## Entity types

| Group | Values |
| --- | --- |
| core PII | `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IBAN`, `IP_ADDRESS`, `URL`, `DATE` |
| identity / org | `PERSON_NAME`, `ORGANIZATION`, `LOCATION` |
| secrets | `API_KEY`, `JWT_TOKEN`, `SECRET_KEY`, `AWS_KEY`, `GITHUB_TOKEN`, `OPENAI_KEY`, `ANTHROPIC_KEY` |
| extension | `CUSTOM` |

---

## Operational notes

- FAST mode is stdlib-only.
- SMART lazy-loads NER on first use.
- FULL is the best fit for synthetic replacement plus audit.
- `sanitize()` is for one-shot calls.
- `session()` is for reversible multi-turn workflows.
- `sanitize_batch()` treats each input independently.

## License

MIT.
