Metadata-Version: 2.4
Name: entropyguard
Version: 1.22.1
Summary: High-Performance Semantic Deduplication Tool for RAG Pipelines
License: MIT
License-File: LICENSE
Keywords: rag,llm,deduplication,data-engineering,nlp,cli,ai
Author: Damian Siuta
Author-email: dami.siuta@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Provides-Extra: all
Provides-Extra: gpu
Provides-Extra: logging
Provides-Extra: metrics
Requires-Dist: faiss-cpu (>=1.7.4,<2.0.0)
Requires-Dist: fastexcel (>=0.11.6,<0.12.0)
Requires-Dist: numpy (>=1.24.0,<2.0.0)
Requires-Dist: polars (>=0.20.0,<0.21.0)
Requires-Dist: prometheus-client (>=0.19.0,<0.20.0) ; extra == "metrics" or extra == "all"
Requires-Dist: pyarrow (>=15.0.0,<16.0.0)
Requires-Dist: pydantic (>=2.5.0,<3.0.0)
Requires-Dist: sentence-transformers (>=2.0.0,<3.0.0)
Requires-Dist: structlog (>=24.1.0,<25.0.0) ; extra == "logging" or extra == "all"
Requires-Dist: torch (>=2.1.0,<3.0.0) ; extra == "gpu" or extra == "all"
Requires-Dist: tqdm (>=4.66.0,<5.0.0)
Requires-Dist: typing-extensions (>=4.8.0,<5.0.0)
Requires-Dist: xxhash (>=3.4.0,<4.0.0)
Project-URL: Homepage, https://github.com/DamianSiuta/entropyguard
Project-URL: Repository, https://github.com/DamianSiuta/entropyguard
Description-Content-Type: text/markdown

# 🛡️ EntropyGuard v1.22.1

<div align="center">

**The Unbreakable RAG Data Cleaner**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://www.docker.com/)
[![Production Ready](https://img.shields.io/badge/status-production--ready-green.svg)](https://github.com/DamianSiuta/entropyguard)

**Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.**

[Features](#-key-features) • [Quick Start](#-quick-start) • [Installation](#-installation) • [Documentation](#-documentation)

</div>

---

## Why EntropyGuard?

### The Problem: Dirty Data = Hallucinations & Wasted Money

Training Large Language Models on contaminated, redundant, or low-quality data leads to:
- **Model Collapse** — Degraded performance from duplicate content
- **Hallucinations** — Inaccurate outputs from poor training data
- **Wasted Compute** — Paying for processing duplicate data multiple times
- **Compliance Risks** — PII and sensitive data in training sets

### The Solution: Local CPU Processing with Hybrid Deduplication

EntropyGuard runs **100% locally** on your CPU—no data ever leaves your machine. Perfect for:
- **Air-gapped environments** (no cloud dependencies)
- **Privacy compliance** (GDPR, HIPAA, SOC 2)
- **Cost efficiency** (no API calls, no cloud fees)
- **Enterprise security** (complete data sovereignty)

---

## ✨ Key Features

### 🛡️ **Fault Tolerant**
- **Checkpoint/Resume System** — Automatic recovery from failures
- **Memory Safety** — Chunked processing prevents OOM errors
- **Graceful Shutdown** — SIGINT/SIGTERM handling (Windows + Unix)
- **Error Recovery** — Automatic retry with exponential backoff

### 🚀 **High Performance**
- **Hybrid Engine** — Hash-based exact dedup + AI semantic similarity
- **Unix Pipes Support** — Stream processing for data engineering workflows
- **Lazy Evaluation** — Polars LazyFrame for datasets larger than RAM
- **Optimized Memory** — Pre-materialization checks prevent OOM

### 📉 **Memory Safe**
- **Chunked Processing** — Process datasets larger than available RAM
- **Memory Profiling** — Track memory usage per pipeline stage
- **Resource Guards** — Disk space and memory checks before operations

### 📊 **Observability**
- **Prometheus Metrics** — Export pipeline metrics for monitoring
- **Structured Logging** — JSON logs with correlation IDs
- **Progress Tracking** — Real-time ETA and throughput estimation
- **Audit Logs** — Complete audit trail of all operations

### 🔒 **Enterprise Ready**
- **Standard Exit Codes** — sysexits.h compliant for automation
- **Type Safety** — Full type hints (MyPy strict compatible)
- **Configuration Validation** — Pydantic-based schema validation
- **Input Validation** — Format detection and consistency checks

---

## ⚡ Quick Start

### The "Magic" Command

```bash
# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
```

### Basic Usage

```bash
# File-to-file processing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --text-column text \
  --dedup-threshold 0.95

# With custom settings
entropyguard \
  --input data.ndjson \
  --output cleaned.ndjson \
  --text-column content \
  --min-length 100 \
  --dedup-threshold 0.9 \
  --chunk-size 500
```

### Advanced: Checkpoint & Resume

```bash
# Enable automatic checkpoint recovery
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --text-column text

# Resume from checkpoint manually
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --resume \
  --text-column text
```

---

## 📦 Installation

### Option 1: pip from PyPI (Recommended)

```bash
pip install entropyguard
```

**Requirements:**
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)

### Option 2: Install from Git

```bash
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
```

**Requirements:**
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
- `git` available on your system

### Option 3: Docker

```bash
# Build image
docker build -t entropyguard:latest .

# Run container
docker run -v $(pwd):/data entropyguard:latest \
  --input /data/input.jsonl \
  --output /data/output.jsonl \
  --text-column text
```

### Option 4: Development Setup

```bash
git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install
```

---

## 📋 CLI Flags Reference

Complete reference for all available flags:

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| **Input/Output** |
| `--input` | string | `-` (stdin) | Path to input file (CSV, JSON, NDJSON). Use `-` for stdin |
| `--output` | string | `-` (stdout) | Path to output file (NDJSON). Use `-` for stdout |
| `--text-column` | string | auto-detect | Name of text column to process. Auto-detects first string column if omitted |
| `--required-columns` | string | None | Comma-separated list of required columns (optional schema validation) |
| **Processing Options** |
| `--min-length` | int | `50` | Minimum text length after sanitization (characters) |
| `--dedup-threshold` | float | `0.95` | Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter |
| `--model-name` | string | `all-MiniLM-L6-v2` | Sentence-transformers model for embeddings. Use `paraphrase-multilingual-MiniLM-L12-v2` for multilingual |
| `--batch-size` | int | `10000` | Batch size for embedding processing. Reduce for low-memory systems |
| **Chunking (RAG)** |
| `--chunk-size` | int | None | Chunk size (characters) for splitting long texts. Disabled if not set |
| `--chunk-overlap` | int | `50` | Overlap size (characters) between consecutive chunks. Only used with `--chunk-size` |
| `--separators` | list | default | Custom separators for chunking (space-separated). Use `\n` for newline, `\t` for tab |
| **Checkpoint & Resume** |
| `--checkpoint-dir` | string | None | Directory to save checkpoints for error recovery |
| `--resume` | flag | false | Resume from last checkpoint if available. Requires `--checkpoint-dir` |
| `--no-auto-resume` | flag | false | Disable automatic checkpoint recovery (requires explicit `--resume`) |
| **Logging & Output** |
| `--verbose` | flag | false | Enable verbose logging (INFO level) |
| `--debug` | flag | false | Enable debug mode (DEBUG level + full tracebacks). Implies `--verbose` |
| `--demo` | flag | false | Demo mode: Hide INFO logs, show only progress bars and final summary |
| `--quiet` | flag | false | Disable progress bars (useful for CI/CD) |
| `--json` | flag | false | Output results as JSON (machine-readable format) |
| `--json-logs` | flag | false | Output logs as JSON (for log aggregation systems) |
| **Monitoring & Profiling** |
| `--profile-memory` | flag | false | Enable memory profiling. Tracks usage at each pipeline stage |
| `--memory-report-path` | string | None | Path to save memory profiling report (JSON). Requires `--profile-memory` |
| `--metrics-port` | int | None | Start Prometheus metrics HTTP server on specified port |
| `--audit-log` | string | None | Path to JSON file for audit log of dropped/duplicate rows |
| **Configuration** |
| `--config` | string | auto-detect | Path to config file (JSON/YAML/TOML). Auto-detects `.entropyguardrc` in current/home dir |
| **Utility** |
| `--dry-run` | flag | false | Simulate processing without expensive operations. Shows statistics only |
| `--version` | flag | - | Show version number and exit |

### Flag Categories Explained

**Input/Output**: Control where data comes from and goes to. Supports Unix pipes (`-` for stdin/stdout).

**Processing Options**: Core deduplication settings. `--dedup-threshold` controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).

**Chunking (RAG)**: For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.

**Checkpoint & Resume**: Fault tolerance features. Automatically saves progress and can resume from failures.

**Logging & Output**: Control verbosity and output format. `--demo` is perfect for video demonstrations.

**Monitoring & Profiling**: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.

**Configuration**: Use config files to avoid repeating flags. CLI arguments override config file values.

---

## 🏢 Enterprise / Advanced Usage

### Configuration File (`.entropyguardrc.json`)

Create a configuration file in your home directory or project root:

```json
{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "chunk_size": 500,
  "chunk_overlap": 50,
  "remove_pii": true,
  "normalize_text": true,
  "show_progress": true
}
```

Then run:

```bash
entropyguard --input data.jsonl --output clean.jsonl
```

### Monitoring & Observability

```bash
# Enable Prometheus metrics
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --metrics-port 9090 \
  --text-column text

# Enable memory profiling
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --profile-memory \
  --text-column text

# JSON logs for machine parsing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --json-logs \
  --text-column text
```

### Exit Codes

EntropyGuard follows the sysexits.h standard:

| Code | Meaning |
|------|---------|
| `0` | Success |
| `1` | General error |
| `2` | Usage error (invalid arguments) |
| `64` | Data format error |
| `65` | Input file error |
| `66` | Output file error |
| `70` | Software error (internal bug) |
| `130` | Process interrupted (SIGINT/Ctrl+C) |

---

## 📊 Comparison

| Feature | EntropyGuard | Basic Scripts | Vector DBs |
|---------|-------------|---------------|------------|
| **Exact Deduplication** | ✅ Hash-based (fast) | ⚠️ Manual | ❌ |
| **Semantic Deduplication** | ✅ AI-powered | ❌ | ✅ |
| **Local Processing** | ✅ 100% local | ✅ | ⚠️ Requires DB |
| **Memory Safety** | ✅ Chunked processing | ⚠️ Manual | ⚠️ Depends on DB |
| **Fault Tolerance** | ✅ Checkpoint/Resume | ❌ | ⚠️ Depends on DB |
| **Unix Pipes** | ✅ Native support | ⚠️ Manual | ❌ |
| **Observability** | ✅ Metrics + Logs | ❌ | ⚠️ Depends on DB |
| **Configuration** | ✅ Pydantic validation | ❌ | ⚠️ DB-specific |
| **Type Safety** | ✅ Full type hints | ❌ | ⚠️ Depends on language |

---

## 🛠️ Tech Stack

- **Core:** Python 3.10+, Polars (LazyFrame)
- **AI/ML:** PyTorch (CPU), FAISS, Sentence-Transformers
- **Validation:** Pydantic v2
- **Logging:** structlog (optional)
- **Metrics:** Prometheus Client (optional)
- **Infrastructure:** Poetry, Docker-ready

---

## 📋 Edition Comparison

EntropyGuard is available in two editions:

| Feature | **Community (Open Source)** | **Enterprise** |
|---------|----------------------------|----------------|
| **CLI Tool** | ✅ Full-featured | ✅ Full-featured |
| **Semantic Deduplication** | ✅ Unlimited | ✅ Unlimited |
| **PII Removal** | ✅ Unlimited | ✅ Unlimited |
| **Data Formats** | ✅ All formats | ✅ All formats |
| **Docker Support** | ✅ Yes | ✅ Yes |
| **Audit Logs** | ✅ Yes | ✅ Enhanced |
| **Web Dashboard** | ❌ | ✅ Professional Analytics Platform |
| **Real-time Monitoring** | ❌ | ✅ Live telemetry & metrics |
| **Alert System** | ❌ | ✅ Custom alert rules (Watchtower) |
| **API Access** | ❌ | ✅ RESTful API |
| **SSO Integration** | ❌ | ✅ SAML 2.0, OAuth 2.0 |
| **Support** | Community | Priority support with SLA |
| **License** | MIT License | Commercial license required |

> **📌 Legal Notice:** Enterprise features (Control Plane, Dashboard, API, Alerting System) are **proprietary software** covered by a commercial license. These components are **NOT included** in the Open Source release and are **NOT** subject to the MIT license terms.

---

## 📚 Documentation

- [Checkpoint & Resume Guide](./CHECKPOINT_RESUME_GUIDE.md)
- [Project Comprehensive Documentation](./PROJECT_COMPREHENSIVE_DOCUMENTATION.md)
- [Open Core Strategy](./OPEN_CORE_STRATEGY.md)

---

## 🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

Built with ❤️ by the EntropyGuard Team

**Special thanks to:**
- [Polars](https://www.pola.rs/) for the amazing DataFrame library
- [Sentence-Transformers](https://www.sbert.net/) for semantic embeddings
- [FAISS](https://github.com/facebookresearch/faiss) for vector similarity search

---

<div align="center">

**[⬆ Back to Top](#-entropyguard-v1220)**

Made with ❤️ for the LLM community

</div>

