Metadata-Version: 2.4
Name: krira-augment
Version: 2.0.5
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: openpyxl>=3.0 ; extra == 'xlsx'
Requires-Dist: pdfplumber>=0.10 ; extra == 'pdf'
Requires-Dist: python-docx>=0.8 ; extra == 'docx'
Requires-Dist: polars>=0.20 ; extra == 'csv'
Requires-Dist: openpyxl>=3.0 ; extra == 'all'
Requires-Dist: pdfplumber>=0.10 ; extra == 'all'
Requires-Dist: python-docx>=0.8 ; extra == 'all'
Requires-Dist: polars>=0.20 ; extra == 'all'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0 ; extra == 'dev'
Requires-Dist: black>=23.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1 ; extra == 'dev'
Provides-Extra: xlsx
Provides-Extra: pdf
Provides-Extra: docx
Provides-Extra: csv
Provides-Extra: all
Provides-Extra: dev
Summary: Production-grade document chunking library for RAG systems - Rust-powered Python library
Keywords: rag,chunking,nlp,document-processing,ai,rust,pyo3
Author-email: Krira Labs <contact@kriralabs.com>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/Krira-Labs/krira-chunker
Project-URL: Repository, https://github.com/Krira-Labs/krira-chunker
Project-URL: Documentation, https://github.com/Krira-Labs/krira-chunker#readme
Project-URL: Issues, https://github.com/Krira-Labs/krira-chunker/issues

# Krira Augment ⚡🦀

**The High-Performance Rust Chunking Engine for RAG Pipelines**

[![PyPI version](https://badge.fury.io/py/krira-augment.svg)](https://badge.fury.io/py/krira-augment)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Rust](https://img.shields.io/badge/Built_with-Rust-orange)](https://www.rust-lang.org/)

**Krira Augment** is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.

It processes gigabytes of raw unstructured data (CSV, PDF, DOCX, JSON, URLs, etc.) into high-quality, clean chunks in seconds—utilizing **zero-copy memory mapping** and **segment-based parallel CPU execution**.

---

## 🚀 Performance Benchmarks

Benchmarks run on a standard 8-core machine (M2 Air equivalent).

| Dataset Size | Legacy (LangChain/Pandas) | Krira V2 (Rust Core) | Speedup |
| :--- | :--- | :--- | :--- |
| **100 MB** | ~45 sec | **~0.8 sec** | **56x** 🚀 |
| **1 GB** | ~8.0 min | **~12.0 sec** | **40x** 🚀 |
| **5.28 GB** | *Crash / OOM* | **~58.0 sec** | **Stable** ✅ |
| **10 GB+** | *N/A* | **~2.1 min** | **Scalable** ✅ |

> **Note:** Krira uses a segment-based parallel strategy. It divides large files into 32MB chunks to ensure CPU saturation while maintaining a strict, low memory footprint.

---

## 📦 Installation

```bash
# Basic installation
pip install krira-augment

# Install with optional multi-format support
pip install "krira-augment[all]"
```

*Requirements: Python 3.8+*

---

## 🛠️ Usage

### 1. Quick Start
The `process` method is now fully flexible. If no `output_path` is provided, Krira automatically generates one based on the input filename.

```python
from krira_augment import Pipeline

# Initialize the pipeline
pipeline = Pipeline()

# Process any file (CSV, JSONL, TXT, XML, etc.)
# Logic: If no output_path is provided, results go to 'my_data_processed.jsonl'
stats = pipeline.process(input_path="my_data.csv")

print(f"✅ Processing complete!")
print(f"Output saved to: {stats.output_file}")
print(f"Throughput: {stats.mb_per_second:.2f} MB/s")
```

### 2. Multi-Format Support
Krira Augment handles the heavy lifting of extracting text from complex formats and passing it to the high-speed Rust core.

```python
pipeline = Pipeline()

# Process a Website URL
pipeline.process("https://example.com/docs")

# Process a PDF Document
pipeline.process("internal_report.pdf")

# Process an Excel Spreadsheet or DOCX
pipeline.process("user_feedback.xlsx")
pipeline.process("contract.docx")
```

### 3. Advanced Configuration (Professional)
For production RAG, you need fine-grained control over chunking strategies and data cleaning.

```python
from krira_augment import Pipeline, PipelineConfig, SplitStrategy

# Define a robust configuration
config = PipelineConfig(
    chunk_size=512,               # Target characters per chunk
    strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
    clean_html=True,              # Remove <div>, <br>, etc.
    clean_unicode=True,           # Normalize whitespace and emojis
)

pipeline = Pipeline(config=config)

# Execute
result = pipeline.process("large_corpus.csv", output_path="custom_output.jsonl")

print(f"Chunks Created: {result.chunks_created}") # -1 if streaming unknown
```

---

## 🏗️ Architecture

Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary with industrial-strength safety.

1.  **Memory Mapping (mmap):** Files are mapped directly from disk. No loading massive files into Python RAM.
2.  **Segmented Parallelism:** The file is sliced into 32MB segments processed via the Rayon work-stealing scheduler.
3.  **Bounded Backpressure:** A 1024-item bounded MPSC channel manages data flow from processing threads to the disk writer, preventing runaway memory growth even if processing speed exceeds disk I/O.
4.  **Serde Serialization:** Chunks are serialized to JSONL directly on Rust threads, bypassing the Python GIL.

---

## 🤝 Integration Example

```python
import json

def stream_chunks(jsonl_path):
    with open(jsonl_path, 'r', encoding='utf-8') as f:
        for line in f:
            yield json.loads(line)

# Usage
for chunk in stream_chunks("my_data_processed.jsonl"):
    # Send to Vector DB or OpenAI Embedding API
    pass
```

---

## 🧑‍💻 Development

1.  **Clone the repo**
2.  **Install Maturin**
    ```bash
    pip install maturin
    ```
3.  **Build and Install locally**
    ```bash
    python -m build
    pip install dist/*.whl --force-reinstall
    ```

---

## License

MIT License. (c) 2024 Krira Labs.

