Metadata-Version: 2.4
Name: tokenpack-rag
Version: 0.1.2
Summary: Query-aware semantic chunk selection under LLM context-window budgets.
Author-email: Metehan Kizilcik <metekizilcik@gmail.com>
License: Business Source License 1.1
        
        Licensor: TokenPack Contributors
        Licensed Work: TokenPack
        Additional Use Grant: None
        Change Date: 2030-05-10
        Change License: Apache License, Version 2.0
        
        License text copyright © 2017 MariaDB Corporation Ab, All Rights Reserved.
        "Business Source License" is a trademark of MariaDB Corporation Ab.
        
        Terms
        
        The Licensor hereby grants you the right to copy, modify, create derivative
        works, redistribute, and make non-production use of the Licensed Work. The
        Licensor may make an Additional Use Grant, above, permitting limited production
        use.
        
        Effective on the Change Date, or the fourth anniversary of the first publicly
        available distribution of a specific version of the Licensed Work under this
        License, whichever comes first, the Licensor hereby grants you rights under the
        terms of the Change License, and the rights granted in the paragraph above
        terminate.
        
        If your use of the Licensed Work does not comply with the requirements currently
        in effect as described in this License, you must purchase a commercial license
        from the Licensor, its affiliated entities, or authorized resellers, or you must
        refrain from using the Licensed Work.
        
        All copies of the original and modified Licensed Work, and derivative works of
        the Licensed Work, are subject to this License. This License applies separately
        for each version of the Licensed Work and the Change Date may vary for each
        version of the Licensed Work released by Licensor.
        
        You must conspicuously display this License on each original or modified copy of
        the Licensed Work. If you receive the Licensed Work in original or modified form
        from a third party, the terms and conditions set forth in this License apply to
        your use of that work.
        
        Any use of the Licensed Work in violation of this License will automatically
        terminate your rights under this License for the current and all other versions
        of the Licensed Work.
        
        This License does not grant you any right in any trademark or logo of Licensor
        or its affiliates (provided that you may use a trademark or logo of Licensor as
        expressly required by this License).
        
        TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON AN
        "AS IS" BASIS. LICENSOR HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS
        OR IMPLIED, INCLUDING (WITHOUT LIMITATION) WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, AND TITLE.
        
        MariaDB hereby grants you permission to use this License's text to license your
        works, and to refer to it using the trademark "Business Source License", as long
        as you comply with the Covenants of Licensor below.
        
        Covenants of Licensor
        
        In consideration of the right to use this License's text and the "Business
        Source License" name and trademark, Licensor covenants to MariaDB, and to all
        other recipients of the licensed work to be provided by Licensor:
        
        1. To specify as the Change License the GPL Version 2.0 or any later version, or
           a license that is compatible with GPL Version 2.0 or a later version, where
           "compatible" means that software provided under the Change License can be
           included in a program with software provided under GPL Version 2.0 or a later
           version. Licensor may specify additional Change Licenses without limitation.
        
        2. To either: (a) specify an additional grant of rights to use that does not
           impose any additional restriction on the right granted in this License, as the
           Additional Use Grant; or (b) insert the text "None".
        
        3. To specify a Change Date.
        
        4. Not to modify this License in any other way.
        
Project-URL: Homepage, https://github.com/mo-tunn/TokenPack
Project-URL: Repository, https://github.com/mo-tunn/TokenPack
Project-URL: Paper, https://github.com/mo-tunn/TokenPack/blob/main/submission/TokenPack-paper.pdf
Keywords: rag,llm,context-compression,retrieval,knapsack
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sentence-transformers>=3.0.0
Provides-Extra: reranking
Requires-Dist: sentence-transformers>=3.0.0; extra == "reranking"
Provides-Extra: pdf
Requires-Dist: PyMuPDF>=1.24.0; extra == "pdf"
Requires-Dist: pypdf>=4.0.0; extra == "pdf"
Provides-Extra: tokens
Requires-Dist: tiktoken>=0.7.0; extra == "tokens"
Provides-Extra: compression
Requires-Dist: llmlingua>=0.2.2; extra == "compression"
Provides-Extra: modal
Requires-Dist: modal>=0.64.0; extra == "modal"
Requires-Dist: pandas>=2.0.0; extra == "modal"
Provides-Extra: mcp
Requires-Dist: mcp>=1.2.0; extra == "mcp"
Provides-Extra: office
Requires-Dist: python-docx>=1.1.0; extra == "office"
Requires-Dist: python-pptx>=0.6.23; extra == "office"
Requires-Dist: openpyxl>=3.1.0; extra == "office"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: coverage>=7.0.0; extra == "dev"
Requires-Dist: mcp>=1.2.0; extra == "dev"
Dynamic: license-file

<h1 align="center">TokenPack-RAG</h1>

<p align="center">
  <strong>Turn long files into compact, evidence-dense LLM context.</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/tokenpack-rag/"><img src="https://img.shields.io/pypi/v/tokenpack-rag?label=PyPI" alt="PyPI version"></a>
  <a href="https://pypi.org/project/tokenpack-rag/"><img src="https://img.shields.io/pypi/pyversions/tokenpack-rag" alt="Python versions"></a>
  <a href="https://github.com/mo-tunn/TokenPack/actions/workflows/tests.yml"><img src="https://github.com/mo-tunn/TokenPack/actions/workflows/tests.yml/badge.svg" alt="Tests"></a>
  <img src="https://img.shields.io/badge/coverage-74%25-2f6654" alt="Package coverage">
  <a href="https://pypi.org/project/tokenpack-rag/"><img src="https://img.shields.io/badge/downloads-PyPI-2f6654" alt="PyPI downloads"></a>
  <img src="https://img.shields.io/badge/MCP-local%20stdio-7b5d46" alt="Local MCP server">
  <img src="https://img.shields.io/badge/inputs-PDF%20%7C%20Office%20%7C%20Code%20%7C%20Data-476a8a" alt="Supported inputs">
  <img src="https://img.shields.io/badge/license-BSL--1.1-7b5d46" alt="Business Source License 1.1">
</p>

TokenPack-RAG selects the most useful parts of documents, code, PDFs, tables, and folders under a strict token budget. It does **not** call an LLM during packing: it runs local embeddings, evidence scoring, and budget-aware selection, then writes a Markdown context file you can give to any LLM or agent.

In plain English, TokenPack-RAG does three things:

| Step | What happens |
|---|---|
| **Split intelligently** | Breaks the source into chunks that respect headings, paragraphs, code blocks, and semantic shifts. |
| **Score by evidence value** | Ranks chunks by how useful they look for your query, using semantic similarity, keyword support, document position, and structure signals. |
| **Pack the best context** | Fills your token budget with the highest-value chunks first, avoiding the waste of blindly pasting everything. |

Internally, the default pipeline is:

```text
structure-aware semantic chunks + evidence-hybrid scoring + hybrid-greedy packing
```

<p align="center">
  <img src="assets/tokenpack-headline-result.png" alt="TokenPack + LongLLMLingua saves 74.6% context tokens, keeps a +15.6% relative pilot lift, gives a 3.90x latency speedup, and saves about $1.86 per 1M input tokens">
</p>

## Why Use It

Long-context LLMs make it tempting to paste everything into the prompt. In practice, that is expensive, slow, and often noisy. Naive RAG has the opposite problem: top-k retrieval can collect locally relevant chunks while missing the best global use of a fixed token budget.

TokenPack-RAG is built for that middle layer:

- Turns a file or folder into a compact, LLM-ready context file with one command.
- Selects globally useful evidence under a token budget instead of blindly taking top-k chunks.
- Reduces redundant or low-utility context before it reaches the LLM.
- Helps agents work with large local workspaces through MCP without uploading everything.
- Supports broad real-world inputs: docs, code, PDFs, HTML, CSV/JSON, and Office files.
- Can optionally run LLMLingua / LongLLMLingua after evidence selection for extra compression.

## Install

Basic install:

```bash
pip install tokenpack-rag
```

Recommended document install:

```bash
pip install "tokenpack-rag[pdf,office,tokens]"
```

Agent/MCP install:

```bash
pip install "tokenpack-rag[mcp,pdf,office,tokens]"
```

Development install:

```bash
git clone https://github.com/mo-tunn/TokenPack.git
cd TokenPack
pip install -e ".[pdf,office,tokens,compression,mcp,dev]"
```

TokenPack-RAG uses `sentence-transformers/all-MiniLM-L6-v2` as the default embedding model. The CLI tries local model files first and prints progress while loading; first-time users may see a Hugging Face download unless they pass `--offline-models`.

## 30-Second Start

Pick the path that matches what you want:

| Goal | Use this | Output |
|---|---|---|
| **Fast default** | Selection-only packing. No LLM call, no prompt-compression model. | `paper-tp.md` |
| **Best combination** | TokenPack selection + LongLLMLingua compression for the strongest current context-saving setup. | smaller `paper-tp.md` |
| **Folder pack** | Pack a whole project or document folder into one context file. | `docs-tp.md` |

**Fast default**

Use this first for most documents:

```bash
tokenpack-rag pack paper.pdf --query "What are the main contributions?"
```

Writes:

```text
paper-tp.md
```

**Best combination**

Use this when you want the most aggressive current setup from the paper-style experiments: select the best evidence first, then compress the selected context with LongLLMLingua.

```bash
tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --longllmlingua \
  --compression-rate 0.50 \
  --overwrite
```

This is the setup behind the headline result: about **74.6% context-token saving**, **3.90x mean-latency speedup**, and roughly **$1.86 saved per 1M input tokens** at the paper's illustrative `$2.50 / 1M input tokens` price, while retaining TokenPack's **+15.6% relative pilot lift** over full-context prompting. It requires the `compression` extra and a local/cached compression model unless you intentionally add `--allow-download`.

**Folder pack**

Use this for a repo, notes folder, or mixed document set:

```bash
tokenpack-rag pack docs/ --query "Summarize the design decisions in this project."
```

Writes:

```text
docs-tp.md
```

**Manual budget**

Use this when you already know your target context size:

```bash
tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --budget 32000 \
  --overwrite
```

The output is a packed Markdown context file, not a modified PDF. You can paste it into a chat model, upload it to your own LLM workflow, or let an agent read it through MCP.

## Results Snapshot

<p align="center">
  <img src="assets/tokenpack-results-table.png" alt="TokenPack-RAG results table">
</p>

<details>
<summary>Technical result details behind the summary</summary>

| Setting | Technical Result |
|---|---|
| Relevant evidence kept | TokenPack preserves 93.4% of QASPER evidence vs 71.3% for compression-only. |
| All required evidence kept | TokenPack keeps complete evidence for 87.0% of QASPER questions vs 12.0% for compression-only. |
| Selection + compression | TokenPack + LLMLingua-2 reaches 58.4% context saving while keeping 85.1% of required evidence. |
| Pilot answer accuracy | On an 83-case LongBench v2 pilot, TokenPack improves relative accuracy by 15.6% over full-context prompting while saving 50.6% context. |
| Aggressive cascade | TokenPack + LongLLMLingua reaches 74.6% context saving while retaining TokenPack's +15.6% relative pilot lift over full context. |
| Latency impact | The same cascade reduces mean total latency from 4.140s to 1.060s, a 3.90x speedup in the pilot. |
| Cost-scale example | At the paper's illustrative $2.50 per 1M input-token price, the cascade reduces 1M paid input tokens to about 254k, saving about $1.86. |

</details>

The practical takeaway: pack the useful evidence first, then optionally compress it. This is different from blindly compressing the whole retrieved context.

For the full methodology, tables, limitations, and experiment details, read the paper: [`submission/TokenPack-paper.pdf`](submission/TokenPack-paper.pdf).

## Use With Agents / MCP

Run TokenPack-RAG as a local stdio MCP server:

```bash
tokenpack-rag-mcp --workspace /path/to/project
```

Example MCP config:

```json
{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "tokenpack-rag-mcp",
      "args": ["--workspace", "/path/to/project"]
    }
  }
}
```

Or use `uvx` without a permanent install:

```json
{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "uvx",
      "args": [
        "--from",
        "tokenpack-rag[mcp,pdf,office,tokens]",
        "tokenpack-rag-mcp",
        "--workspace",
        "/path/to/project"
      ]
    }
  }
}
```

MCP tools:

| Tool | Purpose |
|---|---|
| `pack_context` | Packs a file or folder into Markdown context and writes the `-tp.md` artifact. |
| `read_packed_context` | Reads a packed context artifact, optionally in slices for large files. |

By default the MCP server can only read and write inside `--workspace`. Use `--allow-any-path` only for trusted local setups.

## Supported Inputs

TokenPack-RAG accepts a single file or a folder. Folder inputs are scanned recursively and unsupported binary/media files are skipped.

| Category | Extensions |
|---|---|
| Text and docs | `.txt`, `.text`, `.md`, `.markdown`, `.rst`, `.adoc`, `.tex`, `.log` |
| PDF | `.pdf` with the `pdf` extra |
| Web | `.html`, `.htm` |
| Data/config | `.json`, `.jsonl`, `.csv`, `.tsv`, `.yaml`, `.yml`, `.toml` |
| Office | `.docx`, `.pptx`, `.xlsx` with the `office` extra |
| Code | `.py`, `.js`, `.jsx`, `.ts`, `.tsx`, `.java`, `.go`, `.rs`, `.c`, `.cpp`, `.cs`, `.php`, `.rb`, `.swift`, `.kt`, `.scala`, `.sh`, `.ps1`, `.sql`, `.css`, `.xml`, and related variants |

## Auto Budget

`--budget` is optional. When omitted, TokenPack-RAG estimates a context budget from the source:

```text
source_tokens = sum(chunk.token_count for chunk in index.chunks)
raw_budget = ceil(source_tokens * 0.50)
budget = clamp(raw_budget, min_budget=1200, max_budget=64000)
reserve_output = min(4000, max(512, int(budget * 0.10)))
selection_budget = budget - reserve_output
```

Example terminal summary:

```text
Source: paper.pdf
Output: paper-tp.md
Source tokens: 142,000
Auto budget: 64,000 tokens (ratio=50%, capped by max-budget)
Reserved for answer: 4,000
Selection budget: 60,000
Selected: 188 chunks / 59,240 tokens
```

Useful controls:

```bash
tokenpack-rag pack paper.pdf --query "..." --budget-ratio 0.35
tokenpack-rag pack paper.pdf --query "..." --max-budget 128000
tokenpack-rag pack paper.pdf --query "..." --reserve-output 2000
```

## Output Files

Default output paths:

| Source | Output |
|---|---|
| `paper.pdf` | `paper-tp.md` |
| `notes.txt` | `notes-tp.md` |
| `docs/` | `docs-tp.md` |

Existing outputs are protected:

```bash
tokenpack-rag pack paper.pdf --query "..."
```

If `paper-tp.md` exists, the command stops. Use:

```bash
tokenpack-rag pack paper.pdf --query "..." --overwrite
tokenpack-rag pack paper.pdf --query "..." --out packed-context.md
```

Internal artifacts go under `.tokenpack/runs/<timestamp>/` unless paths are provided:

```bash
tokenpack-rag pack paper.pdf \
  --query "..." \
  --index-out .tokenpack/paper.index.json \
  --selection-out paper-tp.selection.json
```

The default Markdown is intentionally clean: it keeps the query, source, selected-token summary, and source/page markers, but leaves chunk ids, token counts, and artifact paths out of the LLM context. Use debug output only when you are inspecting the pipeline:

```bash
tokenpack-rag pack paper.pdf --query "..." --output-detail debug
tokenpack-rag pack paper.pdf --query "..." --output-detail none
```

## Optional Compression

TokenPack-RAG is selection-first by default. You can optionally compress the selected evidence:

```bash
tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --compression-rate 0.85
```

LongLLMLingua-style query-conditioned compression:

```bash
tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --longllmlingua \
  --compression-rate 0.85
```

By default, compression models are expected to be cached locally. Add `--allow-download` only when you intentionally want Hugging Face downloads during compression.

## Python API

```python
from tokenpack.embeddings import make_embedder
from tokenpack.pipeline import ingest_path
from tokenpack.scoring import score_chunks
from tokenpack.selectors import select_chunks

embedder = make_embedder()
index = ingest_path(
    "README.md",
    ".tokenpack/readme-index.json",
    embedder=embedder,
    chunker_name="structure-aware",
    target_tokens=250,
    min_tokens=40,
    max_tokens=320,
)

query = "How does TokenPack reduce LLM context cost?"
query_embedding = embedder.embed([query])[0]

scored = score_chunks(
    query_embedding,
    index.chunks,
    index.embeddings,
    scoring="evidence-hybrid",
    query_text=query,
    redundancy_penalty=0.35,
)

result = select_chunks(
    scored,
    strategy="budget-top-k",
    budget=3000,
    candidate_pool=250,
)

print(result.used_tokens, [item.chunk.id for item in result.selected])
```

## Advanced CLI

The one-command `pack` workflow is the main user-facing interface. Lower-level commands remain available for experiments and reproducible paper runs.

```bash
tokenpack-rag ingest README.md --index .tokenpack/readme-index.json

tokenpack-rag select \
  --index .tokenpack/readme-index.json \
  --query "How does TokenPack reduce LLM context cost?" \
  --budget 3000 \
  --reserve-output 500 \
  --output .tokenpack/selection.json

tokenpack-rag export-context \
  --selection .tokenpack/selection.json \
  --output .tokenpack/context.txt
```

Defaults:

```text
chunker: structure-aware semantic boundaries
chunk-size-preset: low-budget
scoring: evidence-hybrid
selector: budget-top-k (TokenPack hybrid-greedy)
```

Historical selectors such as `knapsack`, `knapsack-redundancy`, and `semantic-threshold` chunking remain available for ablation work, but the main pipeline is hybrid-greedy.

## Reproduce Paper Runs

LongBench v2 Modal pilot used in the current paper:

```bash
python -m modal run submission/longbench_eval/app.py::build_and_run \
  --output-dir submission/results/longbench_v2_modal_hybrid_greedy_83_latency \
  --limit 83 \
  --source-min-tokens 8000 \
  --source-max-tokens 24000 \
  --max-scanned 503 \
  --model-id Qwen/Qwen2.5-14B-Instruct \
  --batch-size 1 \
  --context-order score-then-source \
  --latency-mode
```

See [`submission/source_code_manifest.md`](submission/source_code_manifest.md) for the full artifact map.

## Repository Layout

```text
src/tokenpack/                     Python package and CLI implementation
tests/                             Unit and smoke tests
assets/                            README visual result assets
examples/                          Small local examples for the CLI
submission/paper/                  LaTeX paper source, tables, figures
submission/experiments/            QASPER, LongBench, compression, and ablation scripts
submission/results/                Paper result artifacts and readouts
submission/longbench_eval/         Modal LongBench v2 generation harness
submission/modal_generation_eval/  Modal QASPER generation/judge harness
```

## Notes

- The default workflow is output-first: create a packed context file and send that file to your own LLM.
- Ollama is not required for `pack`; MCP support is optional and local-first.
- Evidence-hybrid scoring weights are engineering defaults. The paper calls out weight calibration as future work.

## Limitations

- The LLM answer-quality experiments are pilot-scale and were not fully human-reviewed.
- QASPER results primarily measure evidence preservation, not end-to-end human-judged answer quality.
- LongBench v2 results are descriptive pilot results, not a statistically definitive benchmark claim.
- TokenPack-RAG improves context selection, but it cannot recover information that is missing from the source or unreadable after extraction.
- The default scoring weights are engineering defaults; stronger calibration is future work.

## License

TokenPack-RAG is licensed under the Business Source License 1.1. See [`LICENSE`](LICENSE).

## Citation

If you use TokenPack-RAG in research, cite the paper PDF in [`submission/TokenPack-paper.pdf`](submission/TokenPack-paper.pdf). A BibTeX entry will be added when the public preprint is available.
