Metadata-Version: 2.4
Name: aerochunk
Version: 0.2.0
Summary: A zero-framework, low-latency semantic chunker with visual HTML debugging.
Author: Harsh Rana
Project-URL: Homepage, https://github.com/HarshStats/aerochunk
Project-URL: Bug Tracker, https://github.com/HarshStats/aerochunk/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: sentence-transformers>=2.0.0
Requires-Dist: numpy>=1.21.0

<!-- Shields/Badges -->
![PyPI Version](https://img.shields.io/pypi/v/aerochunk.svg)
![Python Version](https://img.shields.io/pypi/pyversions/aerochunk.svg)
![License](https://img.shields.io/pypi/l/aerochunk.svg)

# AeroChunk: Stateless Semantic Chunker for Production RAG

## 1. Overview

AeroChunk is a high-performance, stateless semantic text chunker engineered for production-grade Retrieval-Augmented Generation (RAG) systems. It provides a robust, thread-safe solution for segmenting text into conceptually coherent blocks, operating with a minimal memory footprint and zero external dependencies or API calls.

The library has been re-architected from a prototype to a production-ready tool, replacing fragile regex-based sentence splitting with a sophisticated NLP-powered pipeline. It is designed for high-concurrency environments (e.g., FastAPI, Django) where statelessness and computational efficiency are critical.

## 2. Quick Start

Install the library and its "enterprise" dependencies, which include `spacy` and the required NLP model.

```bash
pip install "aerochunk[enterprise]"
python -m spacy download en_core_web_sm
```

The `chunk_text` method now returns a `ChunkingResult` object, providing access to the chunks and other metadata in a stateless manner.

```python
from aerochunk import AeroChunker

text = """Machine learning (ML) is a field of study in artificial intelligence. It is concerned with the development and study of statistical algorithms that can learn from data. For example, Mr. Smith noted a 45.5% increase. These algorithms generalize to unseen data. Recently, artificial neural networks have surpassed many previous approaches in performance."""

# Initialize the chunker for stateless, production use
chunker = AeroChunker(enterprise=True)

# Process the text; returns a ChunkingResult object
result = chunker.chunk_text(text)

# Access the chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk}\n")

# The result object also contains the similarity scores and sentence boundaries
# print(result.sentences)
# print(result.similarities)
```

## 3. Core Architecture (v0.2.1+ "Enterprise" Overhaul)

The AeroChunk pipeline is a multi-stage process designed for semantic accuracy and computational efficiency.

1.  **NLP Sentence Boundary Detection:** The input text is first segmented into sentences using `spaCy`'s `sentencizer`. This provides robust, context-aware tokenization that correctly handles complex cases like abbreviations ("Mr. Smith"), decimals ("45.5%"), and nested punctuation, eliminating the fragility of regex-based splitting.
2.  **Vectorization:** Each sentence is converted into a 384-dimensional vector embedding using a local `sentence-transformers` model (`all-MiniLM-L6-v2` by default).
3.  **Windowed Semantic Similarity:** To determine chunk boundaries, a new sentence is not just compared to its immediate predecessor. Instead, its embedding is compared against the **rolling mean embedding** of the current chunk's sentences. This method respects the progressive narrative flow and ensures that new sentences are evaluated against the broader semantic context of the chunk.
4.  **Chunking with Contextual Overlap:** A semantic boundary is declared when the similarity score drops below a threshold. To maintain contextual continuity (e.g., for pronoun resolution in RAG), the `overlap_sentences` parameter carries over the last N sentences of a completed chunk to the beginning of the next one.
5.  **Structural Guardrails:** The chunking process is constrained by `min_sentences` and `max_sentences` parameters. These act as a floor and ceiling, preventing the formation of semantically fragmented micro-chunks or excessively long chunks that could overflow an LLM's context window.

## 4. Performance Benchmarks

AeroChunk's stateless architecture and optimized processing deliver significant performance advantages, especially under high-load conditions. The following benchmark was conducted on a 15,000+ word document.

| Method                  | Peak RAM (MB) | Execution Time (s) | Chunks Generated | Outcome         |
| ----------------------- | ------------- | ------------------ | ---------------- | --------------- |
| **AeroChunk (Batch 32)**| **6.62**      | **11.18**          | **1500**         | **Optimal**     |
| LangChain (Semantic)    | 21.56         | 18.73              | 4                | Failed to Split |

**Key Takeaway:** AeroChunk achieves a **~69% reduction in peak RAM usage** and executes significantly faster, demonstrating its suitability for memory-constrained and low-latency applications.

## 5. Visual Debugger

AeroChunk includes an industry-first tool for visualizing the chunking process. After processing text, call the `export_debug_html()` method to generate an HTML report. This file provides a transparent, interpretable view of the semantic boundaries identified, making it invaluable for fine-tuning the similarity threshold and other parameters.

```python
# (Continuing from the Quick Start example)

# Generate the HTML report
debug_file = chunker.export_debug_html(
    result=result, 
    output_file="aero_debug_report.html"
)

print(f"Visual debug report saved to: {debug_file}")
```

## 6. License

This project is licensed under the **MIT License**.
