Metadata-Version: 2.4
Name: sovereign-vault
Version: 1.0.1
Summary: Reversible PII tokenization for LLM pipelines — send documents to cloud AI without exposing real data
License: MIT
Project-URL: Homepage, https://lexipro.online
Project-URL: Documentation, https://github.com/brokenbartender/sovereign-vault#readme
Project-URL: Repository, https://github.com/brokenbartender/sovereign-vault
Project-URL: Issues, https://github.com/brokenbartender/sovereign-vault/issues
Project-URL: Part of, https://lexipro.online
Keywords: privacy,pii,llm,tokenization,gdpr,hipaa,ai,redaction,nlp,ner,anthropic,gemini,ollama,gliner,mcp,agentic-ai,data-privacy,compliance,forensics,sovereign-ai
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Legal Industry
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: ner
Requires-Dist: gliner>=0.1.0; extra == "ner"
Provides-Extra: llm
Requires-Dist: ollama>=0.1.0; extra == "llm"
Provides-Extra: all
Requires-Dist: gliner>=0.1.0; extra == "all"
Requires-Dist: ollama>=0.1.0; extra == "all"
Dynamic: license-file

# sovereign-vault

**Reversible PII tokenization for LLM pipelines.**

Send documents containing real names, SSNs, emails, and account numbers to any cloud AI — Claude, Gemini, GPT — without exposing the actual values. The AI reasons about relationships and patterns on placeholder tokens. You reconstruct the real values locally after the response comes back.

```
pip install sovereign-vault
```

[![PyPI version](https://badge.fury.io/py/sovereign-vault.svg)](https://pypi.org/project/sovereign-vault/)
[![Downloads](https://img.shields.io/pypi/dm/sovereign-vault.svg)](https://pypi.org/project/sovereign-vault/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![CI](https://github.com/brokenbartender/sovereign-vault/actions/workflows/ci.yml/badge.svg)](https://github.com/brokenbartender/sovereign-vault/actions)

---

## The problem

You have documents with names, SSNs, emails, and account numbers. You need a cloud AI to analyze patterns, identify anomalies, or summarize findings. But you can't send the raw PII — compliance, legal, or common sense says no.

Standard redaction destroys the data permanently. The AI then can't reason about cross-entity relationships — *"the same person appears in both transactions"* becomes impossible once everything is `[REDACTED]`.

## The solution

Sovereign Vault replaces PII with **stable, HMAC-bound tokens** per session. The same value always maps to the same token, so AI can track relationships across a document. You reconstruct locally after the cloud call.

```python
from sovereign_vault import VaultSession

with VaultSession() as vault:
    abstract = vault.tokenize(
        "John Doe (SSN: 123-45-6789) transferred funds to "
        "Jane Smith (SSN: 987-65-4321) via john@firm.com on 2024-01-15."
    )
    # abstract:
    # "[[PERSON_A1B2C3D4_e5f6a7]] (SSN: [[SSN_B8C9D0E1_f2a3b4]]) transferred
    #  funds to [[PERSON_F5G6H7I8_j9k0l1]] (SSN: [[SSN_J2K3L4M5_n6o7p8]])
    #  via [[EMAIL_N9O0P1Q2_r3s4t5]] on 2024-01-15."

    response = your_llm_client.complete(abstract)  # cloud sees only tokens

    result = vault.reconstruct(response)  # real values restored locally
    # VaultSession.destroy() called automatically on context exit
```

No disk writes. No persistence between sessions. The mapping lives in RAM and is wiped on `destroy()`.

---

## Detection layers

Three layers run in sequence. Each is optional — the system never falls below Layer 1 reliability.

| Layer | Method | Confidence | Requires |
|-------|--------|-----------|---------|
| 1 — Regex | Deterministic structural patterns | 1.0 | Nothing (always active) |
| 2 — GLiNER | Probabilistic NLP NER | 0.85× model score | `pip install sovereign-vault[ner]` |
| 3 — Ollama | Contextual LLM sweep | 0.65 | Local Ollama + `pip install sovereign-vault[llm]` |

Layer 3 triggers only when GLiNER finds fewer than 3 entities — handles implicit identifiers and role references that regex and NER miss.

**Regex catches:** SSN, phone, email, IP address, credit card, passport, Michigan DL, court case numbers

**GLiNER catches:** person names, organizations, locations, addresses, DOB, financial accounts, government IDs, medical record numbers

**Ollama catches:** contextual identifiers — "the defendant", "Account #XYZ", implicit role-based references

---

## Installation

```bash
# Core (regex only — no dependencies)
pip install sovereign-vault

# With NLP entity recognition
pip install sovereign-vault[ner]

# With local LLM sweep (requires Ollama running locally)
pip install sovereign-vault[llm]

# Everything
pip install sovereign-vault[all]
```

---

## Usage

### Basic round-trip

```python
from sovereign_vault import VaultSession

raw = "Alice (alice@corp.com, SSN 123-45-6789) authorized the transfer."

with VaultSession(use_gliner=False, use_ollama=False) as vault:
    abstract = vault.tokenize(raw)
    # Send `abstract` to cloud AI
    cloud_response = call_your_cloud_ai(abstract)
    restored = vault.reconstruct(cloud_response)
```

### LENIENT mode — cloud paraphrased some tokens

```python
with VaultSession(recon_mode=ReconMode.LENIENT) as vault:
    abstract = vault.tokenize(raw)
    cloud_response = call_cloud(abstract)
    # Won't raise even if cloud dropped or paraphrased some tokens
    restored = vault.reconstruct(cloud_response)
```

### SEALED mode — abstract output only, no reconstruction

```python
with VaultSession(seal_mode=SealMode.SEALED) as vault:
    abstract = vault.tokenize(raw)
    # Reconstruction is intentionally disabled
    # Use when the abstract output IS the final product
```

### Audit log — chain of custody, no real values

```python
vault = VaultSession()
vault.tokenize(raw)
for entry in vault.audit_log():
    print(entry["label"], entry["source_layer"], entry["confidence"])
vault.destroy()
```

### Multi-session / server use

```python
from sovereign_vault import new_session, get_session, drop_session

sid, vault = new_session()
abstract = vault.tokenize(raw)
# ... pass sid to the next step in your pipeline ...
vault2 = get_session(sid)
restored = vault2.reconstruct(cloud_output)
drop_session(sid)  # destroys and deregisters
```

---

## Security model

- **RAM-only, session-scoped** — no disk writes, no persistence between sessions
- **HMAC-bound tokens** — each token carries an HMAC tag derived from a 32-byte session secret; tampered or injected tokens raise `VaultSealBreach`
- **Injection prevention** — input containing pre-existing `[[...]]` vault token format is rejected immediately
- **Entropy leak detection** — `reconstruct()` flags high-entropy tokens in cloud output that may be inferred identifiers
- **Best-effort memory wipe** — `destroy()` overwrites real values with random bytes before clearing

---

## Reconstruction modes

| Mode | Behavior |
|------|----------|
| `ReconMode.STRICT` (default) | Raises `VaultReconstructionDegraded` if cloud dropped any vault token |
| `ReconMode.LENIENT` | Allows partial reconstruction — logs missing tokens as warnings |
| `SealMode.SEALED` | Disables reconstruction entirely — raises `VaultSealBreach` if attempted |

---

## Use cases

- **Forensic e-discovery** — send document patterns to cloud AI without exposing real names or case numbers
- **HIPAA pipelines** — analyze medical records cross-entity without raw patient identifiers leaving your perimeter
- **Financial fraud detection** — transaction pattern analysis without raw account numbers
- **Gov/defense document processing** — reason about relationships in sensitive case files
- **Cross-agent PII passing** — sanitize data moving between local and cloud agents in an agentic pipeline

---

## Part of the LexiPro Sovereign OS

Sovereign Vault is a component of **[LexiPro](https://lexipro.online)** — a local-first agentic OS running 15 MCP servers, 228 tools, and 20 agent personas on sovereign hardware. In the full OS, it powers **Workflow O (Privacy Bridge)**: tokenize before any cloud call, reconstruct locally after, audit trail preserved.

Powered by:
- **[Anthropic Claude](https://anthropic.com)** — Tier 5 reasoning backbone for multi-file analysis
- **[Google Gemini](https://deepmind.google/technologies/gemini/)** — OSINT, research, and long-context processing
- **[Ollama](https://ollama.ai)** — Layer 3 local LLM sweep (Gemma, Llama) for contextual entity detection
- **[GLiNER](https://github.com/urchade/GLiNER)** — Layer 2 NLP NER for named entity recognition

---

## Contributing

Issues and PRs welcome. The detection layer system is designed for extension — add new regex patterns to `REGEX_PATTERNS`, new GLiNER entity types to `_GLINER_TYPES`, or swap the Ollama model via `ollama_model` parameter.

---

## Known Limitations

| Limitation | Impact | Mitigation |
|------------|--------|------------|
| RAM-only storage | Vault lost if process crashes mid-pipeline | Call `vault.destroy()` in a `finally` block; checkpoint vault keys externally if needed |
| Probabilistic NER (GLiNER/Ollama) | Novel PII formats may not be detected | Use `coverage_report()` after tokenize to assess detection quality |
| Regex layer only on plain text | HTML entities, encoded chars may slip through | Pre-normalize input with `html.unescape()` before tokenizing |
| Session-scoped tokens | Same real value gets different token in different sessions | Design your pipeline to tokenize once per document, not per chunk |
| Not a legal compliance layer | Sovereign Vault assists compliance; it cannot replace legal review | Combine with your organization's data classification policy |

---

## Comparison: Sovereign Vault vs. alternatives

| Feature | sovereign-vault | Microsoft Presidio | AWS Comprehend PII | Simple regex redaction |
|---------|----------------|-------------------|-------------------|----------------------|
| Reversible tokenization | Yes | No (replace only) | No | No |
| HMAC integrity on tokens | Yes | No | No | No |
| Offline capable | Yes (regex layer) | Partial | No (API) | Yes |
| Named entity detection | Yes (GLiNER + Ollama) | Yes (spaCy) | Yes (cloud) | No |
| STRICT mode audit trail | Yes | No | No | No |
| Cloud cost | $0 (local) | $0 (local) | Per-call | $0 |
| Setup complexity | pip install | pip + models + server | AWS credentials | None |


---

## Compliance Disclaimer

Sovereign Vault is a **technical tool** that assists with PII handling in LLM
pipelines. It is not a legal compliance product and does not constitute legal
advice.

**GDPR / HIPAA / CCPA:** Tokenizing PII before sending it to a cloud model
reduces exposure but does not by itself satisfy the requirements of any data
protection regulation. Your compliance obligations depend on your specific use
case, data classification, and organizational policies. Consult qualified legal
counsel before deploying in regulated environments.

**What sovereign-vault does:**
- Replaces PII with HMAC-bound tokens so cloud AI never receives raw values
- Provides an audit trail of all vaulted entities (no real values in log)
- Wipes vault from RAM on `destroy()` call

**What sovereign-vault does NOT do:**
- Guarantee detection of all PII in all languages and formats
- Provide legal indemnification or certification
- Replace a data classification policy or DPO review
- Encrypt data at rest (vault is RAM-only by design)


---

## License

MIT — see [LICENSE](LICENSE).

Built by [Broken Arrow Entertainment LLC](https://lexipro.online) · Sovereign Intelligence Systems Group
