Metadata-Version: 2.4
Name: prompt-groomer
Version: 0.2.2
Summary: A lightweight Python library for optimizing and cleaning LLM inputs
Project-URL: Homepage, https://github.com/JacobHuang91/prompt-groomer
Project-URL: Repository, https://github.com/JacobHuang91/prompt-groomer
Project-URL: Issues, https://github.com/JacobHuang91/prompt-groomer/issues
Project-URL: Documentation, https://jacobhuang91.github.io/prompt-groomer/
Author-email: Jacob Huang <jacobhuang91@gmail.com>
License: MIT
Keywords: ai,llm,optimization,prompt,tokens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Provides-Extra: benchmark
Requires-Dist: matplotlib>=3.7.0; extra == 'benchmark'
Requires-Dist: openai>=1.0.0; extra == 'benchmark'
Requires-Dist: pandas>=2.0.0; extra == 'benchmark'
Requires-Dist: python-dotenv>=1.0.0; extra == 'benchmark'
Requires-Dist: scikit-learn>=1.3.0; extra == 'benchmark'
Requires-Dist: tqdm>=4.65.0; extra == 'benchmark'
Provides-Extra: dev
Requires-Dist: mike>=2.0.0; extra == 'dev'
Requires-Dist: mkdocs-autorefs>=0.5.0; extra == 'dev'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'dev'
Requires-Dist: mkdocs>=1.5.0; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# Prompt Groomer

<div align="center">

[![PyPI version](https://img.shields.io/pypi/v/prompt-groomer.svg)](https://pypi.org/project/prompt-groomer/)
[![Python Versions](https://img.shields.io/pypi/pyversions/prompt-groomer.svg)](https://pypi.org/project/prompt-groomer/)
[![Downloads](https://img.shields.io/pypi/dm/prompt-groomer.svg)](https://pypi.org/project/prompt-groomer/)
[![GitHub Stars](https://img.shields.io/github/stars/JacobHuang91/prompt-groomer)](https://github.com/JacobHuang91/prompt-groomer)
[![CI Status](https://github.com/JacobHuang91/prompt-groomer/workflows/CI/badge.svg)](https://github.com/JacobHuang91/prompt-groomer/actions)
[![codecov](https://codecov.io/gh/JacobHuang91/prompt-groomer/branch/main/graph/badge.svg)](https://codecov.io/gh/JacobHuang91/prompt-groomer)
[![License](https://img.shields.io/github/license/JacobHuang91/prompt-groomer)](https://github.com/JacobHuang91/prompt-groomer/blob/main/LICENSE)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://jacobhuang91.github.io/prompt-groomer/)

</div>

> 🧹 **A lightweight Python library for optimizing and cleaning LLM inputs.**
> **Save 10-20% on API costs** by removing invisible tokens, stripping HTML, and redacting PII.

<div align="center">

⭐ **If you find this useful, please star us on GitHub!** ⭐

</div>

---

### 🎯 Perfect for:

**RAG Applications** • **Chatbots** • **Document Processing** • **Production LLM Apps** • **Cost Optimization**

---

## Why use Prompt Groomer?

Stop paying for invisible tokens and dirty data.

| Feature | Before (Dirty Input) | After (Groomed) |
| :--- | :--- | :--- |
| **HTML Cleaning** | `<div><b>Hello</b> world</div>` | `Hello world` |
| **Whitespace** | `User    input\n\n\n  here` | `User input here` |
| **PII Redaction** | `Call me at 555-0199` | `Call me at [PHONE]` |
| **Deduplication** | `Same text.\n\nSame text.\n\nDifferent.` | `Same text.\n\nDifferent.` |
| **Token Cost** | ❌ **150 Tokens** | ✅ **85 Tokens** (Saved 43%) |

### 📦 It's this easy:

```python
from prompt_groomer import StripHTML, NormalizeWhitespace

cleaned = (StripHTML() | NormalizeWhitespace()).run(dirty_input)
```

### ✨ Key Features

- **🪶 Zero Dependencies** - Lightweight core with no external dependencies
- **🔧 Modular Design** - 4 focused modules: Cleaner, Compressor, Scrubber, Analyzer
- **⚡ Production Ready** - Battle-tested operations with comprehensive test coverage
- **🎯 Type Safe** - Full type hints for better IDE support and fewer bugs
- **📦 Easy to Use** - Modern pipe operator syntax (`|`), compose operations like LEGO blocks

## Overview

Prompt Groomer helps you clean and optimize prompts before sending them to LLM APIs. By removing unnecessary whitespace, duplicate characters, and other inefficiencies, you can:

- Reduce token usage and API costs
- Improve prompt quality and consistency
- Process inputs more efficiently

## Status

This project is in early development. Features are being added iteratively.

## Installation

```bash
# Using uv (recommended)
uv pip install prompt-groomer

# Using pip
pip install prompt-groomer
```

## Quick Start

```python
from prompt_groomer import StripHTML, NormalizeWhitespace, TruncateTokens

# ✨ The Pythonic "Pipe" Syntax (Recommended)
pipeline = (
    StripHTML()
    | NormalizeWhitespace()
    | TruncateTokens(max_tokens=1000)
)

raw_input = "<div>  User input with <b>lots</b> of   spaces... </div>"
clean_prompt = pipeline.run(raw_input)
# Output: "User input with lots of spaces..."
```

<details>
<summary><b>Alternative: Fluent API</b></summary>

Prefer method chaining? Use the traditional fluent API:

```python
from prompt_groomer import Groomer, StripHTML, NormalizeWhitespace, TruncateTokens

pipeline = (
    Groomer()
    .pipe(StripHTML())
    .pipe(NormalizeWhitespace())
    .pipe(TruncateTokens(max_tokens=1000))
)

clean_prompt = pipeline.run(raw_input)
```

</details>

> 💡 **Why pipe operator?** More concise, Pythonic, and familiar to LangChain/LangGraph users.

## 📊 Proven Effectiveness

We benchmarked Prompt Groomer on 30 real-world test cases (SQuAD + RAG scenarios) to measure token reduction and response quality:

<div align="center">

| Strategy | Token Reduction | Quality (Cosine) | Judge Approval | Overall Equivalent |
|----------|----------------|------------------|----------------|--------------------|
| **Minimal** | 4.3% | 0.987 | 86.7% | 86.7% |
| **Standard** | 4.8% | 0.984 | 90.0% | 86.7% |
| **Aggressive** | **15.0%** | 0.964 | 80.0% | 66.7% |

</div>

**Key Insights:**
- **Aggressive strategy achieves 3x more savings (15%) vs Minimal** while maintaining 96.4% quality
- Individual RAG tests showed **17-74% token savings** with aggressive strategy
- **Deduplicate** (Standard) shows minimal gains on typical RAG contexts
- **TruncateTokens** (Aggressive) provides the largest cost reduction for long contexts
- **Trade-off**: More aggressive = more savings but slightly lower judge approval

**Example: RAG with duplicates**
- Minimal (HTML + Whitespace): 17% reduction
- Standard (+ Deduplicate): 31% reduction
- **Aggressive (+ Truncate 150 tokens): 49% reduction** 🎉

<div align="center">

![Token Reduction vs Quality](benchmark/custom/results/benchmark_results.png)

</div>

> 💰 **Cost Savings:** At scale (1M tokens/month), 15% reduction saves **~$54/month** on GPT-4 input tokens.
>
> 📖 **See full benchmark:** [benchmark/custom/README.md](benchmark/custom/README.md)

## 🎮 Interactive Demo

Try prompt-groomer in your browser - no installation required!

<div align="center">

[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/JacobHuang91/prompt-groomer)

**[🚀 Launch Interactive Demo →](https://huggingface.co/spaces/JacobHuang91/prompt-groomer)**

</div>

Play with different strategies, see real-time token savings, and find the perfect configuration for your use case. Features:

- 🎯 6 preset examples (e-commerce, support tickets, docs, RAG, etc.)
- ⚡ Quick strategy presets (Minimal, Standard, Aggressive)
- 💰 Real-time cost savings calculator
- 🔧 All 7 operations configurable
- 📊 Visual metrics dashboard

## 4 Core Modules

Prompt Groomer is organized into 4 specialized modules:

### 1. **Cleaner** - Clean Dirty Data
- `StripHTML()` - Remove HTML tags, convert to Markdown
- `NormalizeWhitespace()` - Collapse excessive whitespace
- `FixUnicode()` - Remove zero-width spaces and problematic Unicode

### 2. **Compressor** - Reduce Size
- `TruncateTokens()` - Smart truncation with sentence boundaries
  - Strategies: `"head"`, `"tail"`, `"middle_out"`
- `Deduplicate()` - Remove similar content (great for RAG)

### 3. **Scrubber** - Security & Privacy
- `RedactPII()` - Automatically redact emails, phones, IPs, credit cards, URLs, SSNs

### 4. **Analyzer** - Show Value
- `CountTokens()` - Track token savings and optimization impact

## Complete Example

```python
from prompt_groomer import (
    # Cleaner
    StripHTML, NormalizeWhitespace, FixUnicode,
    # Compressor
    Deduplicate, TruncateTokens,
    # Scrubber
    RedactPII,
    # Analyzer
    CountTokens
)

original_text = """<div>Your messy input here...</div>"""

# Create token counter to track savings
counter = CountTokens(original_text=original_text)

# Build the complete pipeline with all 4 modules
pipeline = (
    StripHTML(to_markdown=True)
    | NormalizeWhitespace()
    | FixUnicode()
    | Deduplicate(similarity_threshold=0.85)
    | TruncateTokens(max_tokens=500, strategy="head")
    | RedactPII(redact_types={"email", "phone"})
)

# Run and analyze
result = pipeline.run(original_text)
counter.process(result)

print(counter.format_stats())
# Output:
# Original: 8 tokens
# Cleaned: 5 tokens
# Saved: 3 tokens (37.5%)
```

## Examples

Check out the [`examples/`](examples/) folder for detailed examples organized by module:
- `cleaner/` - HTML cleaning, whitespace normalization, Unicode fixing
- `compressor/` - Smart truncation, deduplication
- `scrubber/` - PII redaction
- `analyzer/` - Token counting and cost savings
- `all_modules_demo.py` - Complete demonstration

## Development

This project uses `uv` for dependency management and `make` for common tasks.

```bash
# Install dependencies
make install

# Run tests
make test

# Format code
make format
```

## License

MIT