Metadata-Version: 2.4
Name: redactai
Version: 0.1.1
Summary: Make data safe before feeding it to AI
Project-URL: Homepage, https://github.com/jagreehal/redactai
Project-URL: Repository, https://github.com/jagreehal/redactai.git
Project-URL: Issues, https://github.com/jagreehal/redactai/issues
Author: Jag Reehal
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: anonymization,llm,pii,presidio,privacy,redaction
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: asyncpg>=0.29.0
Requires-Dist: faker>=33.0.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.28.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: presidio-analyzer>=2.2.0
Requires-Dist: presidio-anonymizer>=2.2.0
Requires-Dist: pydantic-settings>=2.7.0
Requires-Dist: pydantic>=2.10.0
Requires-Dist: pypdf>=5.0.0
Requires-Dist: python-docx>=1.1.0
Requires-Dist: python-multipart>=0.0.20
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: spacy>=3.8.0
Requires-Dist: typer>=0.15.0
Requires-Dist: uvicorn[standard]>=0.34.0
Requires-Dist: watchfiles>=1.0.0
Provides-Extra: auth
Requires-Dist: workos>=5.0.0; extra == 'auth'
Provides-Extra: dev
Requires-Dist: mypy>=1.13.0; extra == 'dev'
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.0.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: scalar-fastapi>=1.0.0; extra == 'docs'
Provides-Extra: image
Requires-Dist: presidio-image-redactor>=0.0.50; extra == 'image'
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0.0; extra == 'mcp'
Provides-Extra: structured
Requires-Dist: pandas>=2.0.0; extra == 'structured'
Requires-Dist: presidio-structured>=0.0.3; extra == 'structured'
Provides-Extra: telemetry
Requires-Dist: opentelemetry-api>=1.29.0; extra == 'telemetry'
Requires-Dist: opentelemetry-exporter-otlp>=1.29.0; extra == 'telemetry'
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.50b0; extra == 'telemetry'
Requires-Dist: opentelemetry-sdk>=1.29.0; extra == 'telemetry'
Description-Content-Type: text/markdown

# RedactAI

Strip PII from text, files, and pipelines before it reaches your AI.

## Install

```bash
pip install redactai
python -m spacy download en_core_web_sm
```

## Quick Start

**Clean a file:**

```bash
redactai clean data.csv -o data.clean.csv
```

**Clean text in Python:**

```python
from redactai import clean

safe = clean("Call John Smith at 555-0123")
# "Call Marcia Wells at 555-8912"  (faker replacements by default)
```

**Scan for PII in CI:**

```bash
redactai scan ./data --ci  # exits 1 if PII detected
```

## CLI Commands

| Command | Description |
|---------|-------------|
| `redactai clean [PATH]` | Anonymize a file, folder, or stdin |
| `redactai scan PATH` | Detect PII and report findings |
| `redactai analyze` | Analyze text or file and return entity details |
| `redactai decrypt` | Decrypt previously encrypted output |
| `redactai watch PATH` | Watch a folder and clean files on change |
| `redactai init` | Generate a `.redactai.yml` config file |
| `redactai entities` | List supported PII entity types |
| `redactai profiles` | List built-in profiles |
| `redactai profiles show NAME` | Show profile details |
| `redactai mcp` | Start MCP tool server for AI agents |
| `redactai server start` | Start the local API daemon |
| `redactai server stop` | Stop the daemon |
| `redactai server status` | Show daemon status |
| `redactai server restart` | Restart the daemon |
| `redactai login` | Authenticate with a remote API |
| `redactai logout` | Remove stored credentials |
| `redactai whoami` | Show current auth status |

## Python API

```python
from redactai import clean, scan
```

### `clean(text, *, profile, threshold, language, entities, operators) -> str`

```python
# Use a built-in profile
clean("Email me at john@acme.com", profile="llm_guardrail")
# "Email me at <EMAIL_ADDRESS>"

# Override operator for a specific entity
clean("Call 555-0123", operators={"PHONE_NUMBER": {"type": "mask", "masking_char": "*", "chars_to_mask": 6}})
# "Call ***-****"
```

### `scan(text, *, threshold, language, entities) -> list[dict]`

```python
hits = scan("My SSN is 123-45-6789")
# [{"entity_type": "US_SSN", "start": 10, "end": 21, "score": 0.85, "text": "123-45-6789"}]
```

### Other exports

```python
from redactai import entities, profiles, profile_detail

entities()          # ["CREDIT_CARD", "EMAIL_ADDRESS", "PERSON", ...]
profiles()          # [{"id": "llm_guardrail@1", "name": "llm_guardrail", ...}, ...]
profile_detail("llm_guardrail")  # full config including operators
```

## Profiles

| Profile | Description | Threshold |
|---------|-------------|-----------|
| `llm_guardrail` | Redact all PII before sending to LLMs | 0.3 |
| `app_logs_safe` | Mask PII in logs, keep structure for debugging | 0.7 |
| `analytics_pseudonymized` | Replace PII with consistent fakes for analytics | 0.5 |
| `customer_support_shareable` | Redact sensitive PII, keep names/locations for context | 0.5 |
| `strict_compliance_export` | Maximum redaction for GDPR/HIPAA/CCPA compliance | 0.3 |
| `dev_demo_readable` | Replace PII with realistic Faker data for demos | 0.5 |

## CI/CD

### Exit code

```bash
redactai scan ./data --ci  # exit 0 = clean, exit 1 = PII found
```

### GitHub Actions

```yaml
- uses: actions/setup-python@v5
  with:
    python-version: "3.12"
- run: pip install redactai && python -m spacy download en_core_web_sm
- run: redactai scan ./data --ci
```

### pre-commit

```yaml
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/yourorg/redactai
    rev: v0.1.0
    hooks:
      - id: redactai-scan
```

## Config (.redactai.yml)

Generate a starter config:

```bash
redactai init
```

Minimal example:

```yaml
profile: llm_guardrail
threshold: 0.4
entities:
  - PERSON
  - EMAIL_ADDRESS
  - CREDIT_CARD

operators:
  PERSON:
    type: faker
    locale: en_US
  EMAIL_ADDRESS:
    type: redact

allow_list:
  - "Acme Corp"

files:
  include:
    - "**/*.csv"
    - "**/*.txt"
  exclude:
    - "**/node_modules/**"
  output_dir: ./clean
```

## Hooks

Three hook layers fire on events: `pre_scan`, `on_pii_detected`, `post_clean`, `on_error`.

### Shell hooks (in .redactai.yml)

```yaml
hooks:
  post_clean:
    - shell: "echo 'Cleaned {{file}} -> {{output_file}}'"
  on_pii_detected:
    - shell: "notify-send 'PII found: {{entity_count}} entities in {{file}}'"
```

### Python plugins

```python
from redactai.hooks import on_pii_detected, HookEvent

@on_pii_detected
def alert(event: HookEvent):
    print(f"Found {event.entity_count} entities in {event.file}")
```

### Webhooks

```yaml
hooks:
  on_pii_detected:
    - url: "https://hooks.slack.com/services/..."
```

## MCP Server

Expose RedactAI as tools for Claude Desktop or other AI agents:

```bash
pip install redactai[mcp]
redactai mcp  # starts stdio transport
```

Add to Claude Desktop config (`claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "redactai": {
      "command": "redactai",
      "args": ["mcp"]
    }
  }
}
```

## API Server

```bash
# Start as background daemon (auto-starts on first CLI call)
redactai server start

# Or run in foreground
redactai server start --foreground

# Manage
redactai server status
redactai server restart
redactai server stop
```

The daemon exposes a REST API at `http://localhost:8000` with endpoints:
- `POST /analyze` -- detect PII entities
- `POST /anonymize` -- anonymize text
- `POST /upload` -- process files (multipart)
- `GET /health` -- health check

## Persistence

Tokens and the audit log are stored in Postgres when `REDACTAI_DATABASE_URL` is set.
Without it, both fall back to in-process memory (convenient for local dev and tests,
but lost on restart and not safe across multiple instances).

```bash
# Supabase pooled connection (recommended for FastAPI)
export REDACTAI_DATABASE_URL="postgresql://postgres.PROJECT:PASSWORD@REGION.pooler.supabase.com:6543/postgres"
```

Schema is applied automatically on startup via `CREATE TABLE IF NOT EXISTS`.
Raw tokens are never stored — only SHA-256 hashes and the first 12-character
prefix. Revoked tokens are retained with a `revoked_at` timestamp for audit purposes.

### Testing

The default test suite runs entirely in-memory (no Postgres required).
Live-database smoke tests are marked `@pytest.mark.postgres` and auto-skip
unless `REDACTAI_DATABASE_URL` is set:

```bash
# Default — in-memory only
pytest

# Include Postgres smoke tests (requires live DB)
set -a; source .env; set +a
pytest -m postgres
```

## File Types

Supported: `.txt`, `.csv`, `.pdf`, `.docx`, `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`, `.json`

## Image Redaction

Redact PII from images using OCR:

```bash
# Single image
redactai redact-image screenshot.png -o screenshot.redacted.png

# Batch directory
redactai redact-image ./screenshots -o ./screenshots.redacted

# Custom fill color
redactai redact-image photo.jpg --fill "255,192,203"
```

## Structured Data

Anonymize PII in CSV and JSON files with column-aware detection:

```bash
# CSV
redactai structured data.csv -o data.clean.csv

# JSON
redactai structured data.json -o data.clean.json

# Custom strategy
redactai structured data.csv --strategy highest_confidence
```

## Pseudonymization

Consistent fake↔real mappings across files and sessions. Same input → same output always.

```bash
# Pseudonymize with deterministic seed
redactai pseudonymize data.txt --seed "project-alpha" --store mappings.json

# Restore originals
redactai pseudonymize data.pseudonymized.txt --restore --store mappings.json

# Show mapping stats
redactai pseudonymize data.txt --show-mapping --seed "project-alpha"
```

## Multi-Language Support

20+ languages with dedicated spaCy models:

```bash
# List all supported languages
redactai languages

# Use a specific language
redactai clean document.txt --language de   # German
redactai clean document.txt --language ja   # Japanese
redactai clean document.txt --language zh   # Chinese
```

## Custom NER Recognizers

Plug in Transformers, GLiNER, or Flair models for domain-specific detection:

```bash
# Add a Transformers recognizer
redactai add-recognizer transformers --model obi/deid_roberta_i2b2 --threshold 0.5

# Add GLiNER zero-shot recognizer
redactai add-recognizer gliner --model urchade/gliner_medium-v2.1 --labels "PERSON,EMAIL,PHONE"

# Add Flair recognizer
redactai add-recognizer flair --model ner-multi
```

## PDF Annotation

Highlight PII in PDFs without destroying the original. Perfect for legal review and audit trails.

```bash
# Annotate with highlights
redactai annotate-pdf document.pdf -o document.annotated.pdf

# Use underline instead of highlight
redactai annotate-pdf document.pdf --type underline --color "0.0,0.0,1.0"

# Generate a PII report (JSON, CSV, or text)
redactai annotate-pdf document.pdf --report --report-format json
```

## Evaluation

Benchmark detection quality against ground truth labels. Critical for audit evidence.

```bash
# Run evaluation against ground truth
redactai evaluate ground_truth.json -o report --format both

# Custom threshold and entities
redactai evaluate ground_truth.json --threshold 0.5 --entities PERSON,EMAIL_ADDRESS
```

**Ground truth format (`ground_truth.json`):**
```json
[
  {
    "text": "My name is John Smith and email is john@example.com",
    "entities": [
      {"entity_type": "PERSON", "start": 11, "end": 21},
      {"entity_type": "EMAIL_ADDRESS", "start": 39, "end": 54}
    ]
  }
]
```

## License

Apache 2.0 — see [LICENSE](LICENSE).

## Decision Trace

Explain exactly why each PII entity was detected — perfect for audit compliance and debugging.

```bash
# Show detailed decision trace
redactai trace --text "My name is John Smith and email is john@example.com"

# Trace from file
redactai trace --file document.txt --format json -o trace.json

# Show as markdown
redactai trace --file document.txt --format markdown
```

## DICOM Medical Redaction

HIPAA-compliant de-identification of medical images. Redacts both pixel text (OCR) and metadata tags.

```bash
# Single DICOM file
redactai redact-dicom scan.dcm -o scan.redacted.dcm

# Batch directory
redactai redact-dicom ./dicom_folder -o ./dicom_redacted

# Clean pixels only, keep metadata
redactai redact-dicom scan.dcm --no-clean-metadata
```

## K-Anonymity

Statistical anonymization guarantees — each record is indistinguishable from at least k-1 others.

```bash
# Apply k-anonymity (k=5)
redactai k-anonymity data.csv age,zip,gender -k 5 -o data.anonymous.csv

# With l-diversity check
redactai k-anonymity data.csv age,zip -k 5 --sensitive disease --l 3
```

## Streaming Processing

Real-time PII masking for logs, telemetry, and data pipelines.

```bash
# Process log file
redactai stream app.log -o app.masked.log

# Process stdin (pipe)
tail -f /var/log/app.log | redactai stream

# Custom entities
redactai stream app.log --entities EMAIL_ADDRESS,IP_ADDRESS
```

## Synthetic Data Generation

Generate realistic but fake datasets that preserve statistical patterns without exposing real PII.

```bash
# Generate synthetic CSV from real data
redactai synthetic real_data.csv -o synthetic.csv --num 1000

# Generate synthetic text
redactai synthetic "My name is John Smith, email john@test.com" -o synthetic.json

# Reproducible with seed
redactai synthetic data.csv --seed 42 --locale en_GB
```
