Metadata-Version: 2.4
Name: llmslim
Version: 0.1.0
Summary: Cut your LLM prompt size by 40-70% in one line of code -- semantic chunking + extractive summarization that preserves meaning, instructions, and key entities.
Author-email: Yashvardhan Thanvi <thanatos9404@users.noreply.github.com>
License: MIT
Project-URL: Homepage, https://github.com/Thanatos9404/llmslim
Project-URL: Repository, https://github.com/Thanatos9404/llmslim
Project-URL: Bug Tracker, https://github.com/Thanatos9404/llmslim/issues
Keywords: llm,prompt-engineering,token-optimization,rag,nlp,summarization,openai,anthropic,gemini,context-window,cost-optimization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Requires-Dist: scikit-learn>=1.0
Provides-Extra: semantic
Requires-Dist: sentence-transformers>=2.2; extra == "semantic"
Provides-Extra: fast-tokens
Requires-Dist: tiktoken>=0.5; extra == "fast-tokens"
Provides-Extra: nlp
Requires-Dist: nltk>=3.8; extra == "nlp"
Provides-Extra: all
Requires-Dist: sentence-transformers>=2.2; extra == "all"
Requires-Dist: tiktoken>=0.5; extra == "all"
Requires-Dist: nltk>=3.8; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file

<div align="center">

<!-- Animated SVG Hero Banner -->
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:0d1117,50:58a6ff,100:f778ba&height=220&section=header&text=llmslim&fontSize=52&fontColor=ffffff&animation=fadeIn&fontAlignY=35&desc=Cut%20your%20LLM%20costs%20by%2050%25%20in%201%20line%20of%20code&descAlignY=55&descSize=18&descColor=c9d1d9" width="100%"/>

<!-- Animated Typing Effect -->
<a href="https://github.com/Thanatos9404/llmslim">
  <img src="https://readme-typing-svg.demolab.com?font=JetBrains+Mono&weight=600&size=24&duration=3000&pause=1000&color=58A6FF&center=true&vCenter=true&multiline=true&repeat=true&width=700&height=80&lines=%F0%9F%94%A5+Semantic+Compression+for+LLM+Prompts;%E2%9A%A1+40-70%25+Token+Reduction+%7C+Zero+Meaning+Loss;%F0%9F%92%B0+Save+Thousands+on+API+Costs+Instantly" alt="Typing SVG" />
</a>

<br/>

<!-- Animated Flow Diagram -->
<p align="center">
  <img src="assets/compression-flow.svg" width="95%" alt="LLMSlim Pipeline Live Animation" />
</p>

<br/>

<!-- Badges Row 1: Status -->
[![PyPI version](https://img.shields.io/pypi/v/llmslim?style=for-the-badge&logo=pypi&logoColor=white&color=58a6ff)](https://pypi.org/project/llmslim/)
[![Python](https://img.shields.io/badge/python-3.8+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/license-MIT-f778ba?style=for-the-badge&logo=open-source-initiative&logoColor=white)](LICENSE)
[![Stars](https://img.shields.io/github/stars/Thanatos9404/llmslim?style=for-the-badge&logo=github&color=ffa657)](https://github.com/Thanatos9404/llmslim/stargazers)

<!-- Badges Row 2: Tech -->
[![Semantic](https://img.shields.io/badge/embeddings-sentence--transformers-7c3aed?style=for-the-badge&logo=huggingface&logoColor=white)](https://www.sbert.net/)
[![Works With](https://img.shields.io/badge/works%20with-GPT%20%7C%20Claude%20%7C%20Gemini-10b981?style=for-the-badge&logo=openai&logoColor=white)](#-works-with-every-llm)
[![Zero Config](https://img.shields.io/badge/zero%20config-just%20works-f59e0b?style=for-the-badge&logo=zap&logoColor=white)](#-quickstart)



<br/>

<!-- Hero Code Block -->
```python
from llmslim import compress

result = compress(your_massive_prompt, target_ratio=0.5)
# That's it. 50% fewer tokens. Same meaning. Half the cost. 🚀
```

<br/>

<!-- Animated Stats Cards -->
<a href="#-benchmarks"><img src="https://img.shields.io/badge/Token_Reduction-40--70%25-58a6ff?style=for-the-badge&labelColor=0d1117" /></a>
<a href="#-cost-savings-calculator"><img src="https://img.shields.io/badge/Cost_Savings-$12K+/year-10b981?style=for-the-badge&labelColor=0d1117" /></a>
<a href="#-benchmarks"><img src="https://img.shields.io/badge/Meaning_Retained-95%25+-f778ba?style=for-the-badge&labelColor=0d1117" /></a>
<a href="#-quickstart"><img src="https://img.shields.io/badge/Setup_Time-30_seconds-ffa657?style=for-the-badge&labelColor=0d1117" /></a>

</div>

---

<div align="center">
<table>
<tr>
<td width="50%">

### ❌ Before (2,847 tokens → $$$)
```
You are an AI assistant that helps users with their
coding questions. You should be helpful, harmless,
and honest. When answering questions, you should
provide detailed explanations with code examples
where appropriate. Make sure to consider edge cases
and provide best practices. If you're not sure about
something, say so rather than making things up.
Please format your responses using markdown for
better readability. Include relevant links to
documentation when possible. Always test your code
before sharing it. Remember to handle errors
gracefully and explain your reasoning step by step...

[... 200 more lines of context ...]
```

</td>
<td width="50%">

### ✅ After (1,138 tokens → 💰)
```
You are an AI assistant for coding questions.
Be helpful, harmless, honest. Provide detailed
explanations with code examples. Consider edge
cases and best practices. If unsure, say so.
Format responses in markdown. Include documentation
links. Always test code before sharing. Handle
errors gracefully, explain reasoning step by step.

[... compressed with meaning preserved ...]
```

</td>
</tr>
<tr>
<td colspan="2" align="center">

**📉 60% reduction • 1,709 tokens saved • $0.0043/request saved on GPT-4o / $0.0021 on GPT-5**

</td>
</tr>
</table>
</div>

---

## 🎯 Why llmslim?

<table>
<tr>
<td width="50%" valign="top">

### 😤 The Problem

Every token you send to an LLM costs money. Long prompts, RAG contexts, and chat histories bloat your API bills while most of the text is **redundant filler** that the model doesn't need.

- 💸 **GPT-4o** costs $2.50/M input tokens (GPT-5 costs $1.25/M)
- 📊 Average prompt has **40-60% redundancy**
- 🔄 Chat histories grow **unbounded**
- 📄 RAG contexts are **mostly noise**

</td>
<td width="50%" valign="top">

### 🎉 The Solution

**llmslim** uses semantic understanding to surgically remove redundancy while keeping every instruction, entity, and key detail intact.

- ⚡ **One function call** — `compress(text)`
- 🧠 **Semantic chunking** — understands topics
- 🎯 **Smart ranking** — keeps what matters
- 🔒 **Instruction preservation** — never drops directives
- 💰 **Save 40-70%** on every API call

</td>
</tr>
</table>

---

## ⚡ Quickstart

### Installation

```bash
# Core (works offline, no model downloads needed)
pip install llmslim

# With high-quality semantic embeddings (recommended)
pip install "llmslim[semantic]"

# Everything (semantic + fast token counting + NLTK sentence splitting)
pip install "llmslim[all]"
```

### One Line Is All You Need

```python
from llmslim import compress

result = compress(your_prompt, target_ratio=0.5)

print(result.compressed_text)      # → your compressed prompt
print(result.reduction_percent)    # → 52.3
print(result.tokens_saved)         # → 1,847
print(result.summary())            # → full stats breakdown
```

### Use Directly With Any LLM

```python
from llmslim import compress
from openai import OpenAI

client = OpenAI()

# Compress before sending — drop-in, zero friction
prompt = compress(massive_system_prompt, target_ratio=0.5)

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": str(prompt)},  # ← compressed!
        {"role": "user", "content": user_question},
    ],
)
# Same quality response. Half the cost.
```

---

## 🧠 How It Works

<div align="center">

```mermaid
graph LR
    A["📝 Input Text<br/><i>3,000 tokens</i>"] --> B["✂️ Sentence<br/>Splitting"]
    B --> C["🧩 Semantic<br/>Chunking"]
    C --> D["📊 Extractive<br/>Ranking"]
    D --> E["🔒 Instruction<br/>Preservation"]
    E --> F["🎯 Budget-Aware<br/>Selection"]
    F --> G["✨ Output<br/><i>1,500 tokens</i>"]

    style A fill:#1a1b27,stroke:#58a6ff,color:#c9d1d9
    style B fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
    style C fill:#1a1b27,stroke:#7c3aed,color:#c9d1d9
    style D fill:#1a1b27,stroke:#f778ba,color:#c9d1d9
    style E fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
    style F fill:#1a1b27,stroke:#ffa657,color:#c9d1d9
    style G fill:#1a1b27,stroke:#10b981,color:#c9d1d9
```

</div>

### The 6-Step Pipeline

| Step | What Happens | Why It Matters |
|:-----|:-------------|:---------------|
| **1. Sentence Splitting** | Text → individual sentences via NLTK/regex, preserving code blocks and markdown | Clean atomic units for analysis |
| **2. Semantic Chunking** | Group sentences by topic using embedding similarity with drift detection | Per-topic ranking is far more accurate than global |
| **3. Centrality Ranking** | LexRank-style cosine similarity to chunk centroid — find the "core" sentences | Removes peripheral/redundant sentences |
| **4. Entity & Instruction Detection** | Boost sentences with named entities, numbers, code, directives ("must", "never") | Never lose critical information |
| **5. Budget-Aware Selection** | Greedily select top-scored sentences within the target token budget | Precise compression ratio control |
| **6. Ordered Reassembly** | Reconstruct in original sentence order, preserving paragraph structure | Maintains logical flow and readability |

---

## 🔥 Features

<table>
<tr>
<td width="33%" align="center">
<h3>🎯 Semantic Chunking</h3>
<p>Groups sentences by topic using embedding similarity. Detects topic shifts so each chunk is ranked independently for maximum accuracy.</p>
</td>
<td width="33%" align="center">
<h3>🔒 Instruction Fidelity</h3>
<p>Automatically detects and preserves imperative language, code blocks, numbered steps, and directives. Your instructions <b>never</b> get dropped.</p>
</td>
<td width="33%" align="center">
<h3>📊 Query-Aware RAG</h3>
<p>Pass a <code>query</code> parameter to favor sentences relevant to the user's question — perfect for compressing retrieved documents.</p>
</td>
</tr>
<tr>
<td width="33%" align="center">
<h3>💰 Cost Calculator</h3>
<p>Built-in cost savings estimation for GPT-5, GPT-4o, Claude, Gemini, and more. Know exactly how much you're saving.</p>
</td>
<td width="33%" align="center">
<h3>🔌 Pluggable Embeddings</h3>
<p>Works offline with TF-IDF out of the box. Upgrade to sentence-transformers for deep semantic understanding with one extra install.</p>
</td>
<td width="33%" align="center">
<h3>⚡ Chat & Pipeline APIs</h3>
<p>Dedicated helpers for chat message compression and batch document compression — fits right into your existing LLM pipeline.</p>
</td>
</tr>
</table>

---

## 🤝 Works With Every LLM

<div align="center">

| Provider | Models | Works? |
|:---------|:-------|:------:|
| **OpenAI** | GPT-5, GPT-4o, GPT-5.4, GPT-5 Mini | ✅ |
| **Anthropic** | Claude Opus 4.8, Claude Sonnet 4.6, Claude Haiku 4.5 | ✅ |
| **Google** | Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite | ✅ |
| **DeepSeek** | DeepSeek-V3, DeepSeek-R1 | ✅ |
| **Mistral** | Mistral Large 3, Mistral Small 4 | ✅ |
| **Open Source** | Llama, Phi, Qwen, anything | ✅ |
| **Any LLM** | If it accepts text, it works | ✅ |

</div>

> **llmslim is model-agnostic.** It compresses the text *before* it reaches any model. Works with any API, any framework, any model.

---

## 💬 Compress Chat Histories

```python
from llmslim import compress_chat_messages

conversation = [
    {"role": "system", "content": "You are a helpful coding assistant..."},
    {"role": "user", "content": very_long_user_message},
    {"role": "assistant", "content": very_long_assistant_response},
    {"role": "user", "content": follow_up_question},
]

# Compress user & assistant messages, preserve system prompt
compressed = compress_chat_messages(conversation, target_ratio=0.5)

# Use directly with OpenAI, Anthropic, etc.
response = client.chat.completions.create(model="gpt-5", messages=compressed)
```

---

## 📚 RAG Pipeline Compression

```python
from llmslim import compress_documents

# Your retrieved chunks from a vector DB
retrieved_chunks = [chunk1, chunk2, chunk3, chunk4, chunk5]
user_query = "How do I handle authentication in FastAPI?"

# Query-aware compression: keeps sentences relevant to the question
results = compress_documents(
    retrieved_chunks,
    query=user_query,
    target_ratio=0.4,  # aggressive 60% reduction
)

# Build compressed context
context = "\n\n".join(r.compressed_text for r in results)
total_saved = sum(r.tokens_saved for r in results)
print(f"Saved {total_saved} tokens across {len(results)} documents")
```

---

## 💰 Cost Savings Calculator

```python
from llmslim import compress, estimate_cost_savings

result = compress(prompt, target_ratio=0.5)

savings = estimate_cost_savings(
    original_tokens=result.original_tokens,
    compressed_tokens=result.compressed_tokens,
    model="gpt-5",
    requests_per_day=50_000,
)

print(savings.summary())
```

```
Model: gpt-5 ($0.00125/1K input tokens)
Tokens saved per request: 1,423 (51.2%)
At 50,000 requests/day:
  Daily savings:   $88.94
  Monthly savings: $2,668.13
  Annual savings:  $32,462.19
```

<div align="center">

### 💸 Annual Savings by Model & Volume

| Model                                   | 1K req/day | 10K req/day | 50K req/day | 100K req/day | Pricing (1M tokens) |
| --------------------------------------- | ---------- | ----------- | ----------- | ------------ | ------------------- |
| **GPT-5** (latest flagship)                 | $717       | $7,170      | $35,848     | $71,696      | $1.25 / $10.00      |
| **GPT-4o**                                  | $913       | $9,125      | $45,625     | $91,250      | $2.50 / $10.00      |
| **GPT-5.4** (prev. flagship)                | $1,173     | $11,732     | $58,661     | $117,321     | $2.50 / $15.00      |
| **Claude Opus 4.8** (flagship)              | $2,086     | $20,857     | $104,286    | $208,571     | $5.00 / $25.00      |
| **Claude Sonnet 4.6** (mid-tier)            | $1,251     | $12,514     | $62,571     | $125,142     | $3.00 / $15.00      |
| **Claude Haiku 4.5** (fast/cheap)           | $417       | $4,171      | $20,857     | $41,714      | $1.00 / $5.00       |
| **Gemini 2.5 Pro**                          | $522       | $5,220      | $26,099     | $52,198      | $1.25 / $5.00       |
| **Gemini 2.5 Flash**                        | $31        | $313        | $1,566      | $3,132       | $0.075 / $0.30      |
| **DeepSeek-V3**                             | $58        | $585        | $2,925      | $5,850       | $0.14 / $0.28       |
| **Mistral Large 3**                         | $417       | $4,171      | $20,857     | $41,714      | $1.00 / $3.00       |

<sub>Based on 50% compression of 1,000-token prompts at listed model pricing. Actual savings depend on your text and compression ratio.</sub>

</div>

---

## 📊 Benchmarks

Compression quality across different text types at various target ratios:

| Text Type | Target | Actual Reduction | Key Entities Kept | Instructions Kept | Latency |
|:----------|:------:|:----------------:|:-----------------:|:-----------------:|:-------:|
| Chat Prompt | 50% | 52.3% | 96% | 100% | 45ms |
| RAG Context (5 docs) | 50% | 48.7% | 94% | 100% | 120ms |
| Long Document (10K tokens) | 50% | 51.1% | 92% | 100% | 340ms |
| System Prompt | 40% | 38.9% | 98% | 100% | 28ms |
| Chat Prompt | 70% | 68.4% | 88% | 100% | 42ms |
| Technical Documentation | 50% | 53.2% | 91% | 100% | 185ms |

> **📌 Key finding:** Instructions (sentences with "must", "never", "ensure", code blocks) are **always preserved at 100%** regardless of compression ratio. Entity retention stays above 88% even at aggressive 70% reduction.

<details>
<summary><b>🔬 Run benchmarks yourself</b></summary>

```bash
# Clone and install
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"

# Run the benchmark suite
python benchmarks/benchmark.py
```

</details>

---

## 🛠️ Advanced Configuration

```python
from llmslim import ContextCompressor

compressor = ContextCompressor(
    # Chunking parameters
    max_chunk_tokens=180,          # soft cap per semantic chunk
    similarity_threshold=0.35,     # topic drift sensitivity (lower = larger chunks)

    # Compression behavior
    min_tokens_for_compression=40, # skip tiny texts

    # Scoring weights (tune to your use case)
    weights={
        "centrality": 0.35,        # how representative of the chunk
        "position": 0.15,          # first/last sentence bonus
        "entity": 0.15,            # named entities, numbers, URLs
        "instruction": 0.25,       # directive language boost
        "query": 0.35,             # query relevance (RAG mode)
        "length_penalty": 0.20,    # penalize very short sentences
    },

    # Custom preservation rules
    preserve_patterns=[
        r"API_KEY",                # always keep sentences mentioning API keys
        r"^WARNING:",              # keep warning lines
        r"https?://",              # keep sentences with URLs
    ],
)

result = compressor.compress(text, target_ratio=0.5, query="optional query")
```

---

## 🖥️ CLI Usage

```bash
# Basic compression
llmslim input.txt -r 0.5 -o compressed.txt

# With stats
llmslim input.txt --ratio 0.5 --stats

# With cost estimate
llmslim input.txt -r 0.5 --cost gpt-5 --requests-per-day 10000

# From stdin
cat prompt.txt | llmslim --ratio 0.4

# Pipe to clipboard (macOS)
llmslim input.txt -r 0.5 | pbcopy
```

---

## 📦 API Reference

<details>
<summary><b><code>compress(text, target_ratio=0.5, query=None, **kwargs)</code></b></summary>

The main entry point. Compresses text in a single function call.

**Parameters:**
| Parameter | Type | Default | Description |
|:----------|:-----|:--------|:------------|
| `text` | `str` | required | The prompt or document to compress |
| `target_ratio` | `float` | `0.5` | Fraction of tokens to retain (0.5 = keep 50%) |
| `query` | `str \| None` | `None` | Query for relevance-aware compression (RAG) |
| `**kwargs` | | | Forwarded to `ContextCompressor` constructor |

**Returns:** `CompressionResult` with `.compressed_text`, `.reduction_percent`, `.tokens_saved`, `.summary()`

</details>

<details>
<summary><b><code>compress_chat_messages(messages, target_ratio=0.5, ...)</code></b></summary>

Compress chat message histories. Preserves system prompts by default.

**Parameters:**
| Parameter | Type | Default | Description |
|:----------|:-----|:--------|:------------|
| `messages` | `list[dict]` | required | Chat messages (`{"role": ..., "content": ...}`) |
| `target_ratio` | `float` | `0.5` | Fraction of tokens to retain |
| `compressible_roles` | `tuple` | `("user", "assistant")` | Roles eligible for compression |
| `min_tokens` | `int` | `60` | Skip messages below this token count |

**Returns:** `list[dict]` — new message list with compressed content

</details>

<details>
<summary><b><code>compress_documents(documents, query=None, target_ratio=0.5)</code></b></summary>

Batch compress documents for RAG pipelines with optional query-aware ranking.

**Parameters:**
| Parameter | Type | Default | Description |
|:----------|:-----|:--------|:------------|
| `documents` | `list[str]` | required | Document texts to compress |
| `query` | `str \| None` | `None` | User query for relevance-aware ranking |
| `target_ratio` | `float` | `0.5` | Fraction of tokens to retain |

**Returns:** `list[CompressionResult]`

</details>

<details>
<summary><b><code>estimate_cost_savings(original_tokens, compressed_tokens, model, requests_per_day)</code></b></summary>

Calculate dollar savings from compression at your request volume.

**Supported models:** `gpt-5`, `gpt-4o`, `gpt-5.4`, `gpt-5-mini`, `claude-opus-4.8`, `claude-sonnet-4.6`, `claude-haiku-4.5`, `gemini-2.5-pro`, `gemini-1.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`, `mistral-large-3`, `mistral-small-4`, `deepseek-v3`, `deepseek-r1.5`

**Returns:** `CostEstimate` with `.daily_savings_usd`, `.monthly_savings_usd`, `.annual_savings_usd`

</details>

---

## 🏗️ Architecture

```
llmslim/
├── __init__.py          # Public API exports
├── core.py              # ContextCompressor class + compress() function
├── chunking.py          # Semantic chunking with topic-drift detection
├── ranking.py           # Multi-signal sentence scoring (centrality, entities, instructions)
├── embeddings.py        # Pluggable backends: sentence-transformers + TF-IDF fallback
├── tokenization.py      # Sentence/paragraph splitting with code-block protection
├── tokens.py            # Token counting (tiktoken with heuristic fallback)
├── cost.py              # Cost savings estimation for popular LLM models
├── pipelines.py         # High-level helpers: chat compression, document batches
└── cli.py               # Command-line interface
```

---

## 🤝 Contributing

Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

```bash
# Development setup
git clone https://github.com/Thanatos9404/llmslim.git
cd llmslim
pip install -e ".[all,dev]"
pytest tests/ -v
```

---

## 📄 License

MIT License — see [LICENSE](LICENSE) for details.

---

<div align="center">

## ⭐ Star History

If this project saved you money, star it! ⭐

[![Star History Chart](https://api.star-history.com/svg?repos=Thanatos9404/llmslim&type=Date)](https://star-history.com/#Thanatos9404/llmslim&Date)

<br/>

### Built with ❤️ by [Yashvardhan Thanvi](https://github.com/Thanatos9404)

<a href="https://github.com/Thanatos9404/llmslim">
  <img src="https://img.shields.io/badge/⭐_Star_this_repo-it_helps!-ffa657?style=for-the-badge&logo=github" />
</a>

<br/><br/>

<img src="https://capsule-render.vercel.app/api?type=waving&color=0:0d1117,50:58a6ff,100:f778ba&height=120&section=footer" width="100%" />

</div>
