Metadata-Version: 2.4
Name: kiru
Version: 0.1.11
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Keywords: text,chunking,nlp,rag
Requires-Python: >=3.11
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/bitswired/kiru
Project-URL: Repository, https://github.com/bitswired/kiru
Project-URL: Documentation, https://github.com/bitswired/kiru
Project-URL: Bug Tracker, https://github.com/bitswired/kiru/issues
Project-URL: Changelog, https://github.com/bitswired/kiru/releases

# kiru ⚡🗡️

> **Cut through text at the speed of light**

The fastest text chunking library for RAG applications. Available for both Rust and Python.

[![Crates.io](https://img.shields.io/crates/v/kiru.svg)](https://crates.io/crates/kiru)
[![PyPI](https://img.shields.io/pypi/v/kiru.svg)](https://pypi.org/project/kiru/)
[![Documentation](https://docs.rs/kiru/badge.svg)](https://docs.rs/kiru)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## What is kiru?

kiru is a high-performance text chunking library designed for modern RAG (Retrieval-Augmented Generation) systems. When you need to split millions of documents for vector databases or process streaming data in real-time, kiru delivers unmatched speed without sacrificing correctness.

### Key Features

- **⚡ Blazing Fast (Python)**: 1000+ MB/s throughput for bytes, 300+ MB/s for characters
- **🎯 UTF-8 Safe**: Never breaks multi-byte characters or emoji
- **💾 Memory Efficient**: Stream gigabyte files with constant memory usage
- **🚀 Parallel Processing**: Utilize all CPU cores automatically
- **🔌 Multiple Sources**: Files, URLs, strings, and glob patterns
- **🛠️ Flexible Strategies**: Chunk by bytes or characters
- **🦀 Rust Core**: Rust performance and memory safety
- **🐍 Python Bindings**: Pythonic API for ease of use

## Performance

**Benchmarked on 1MB text file, 1MB chunks, 1KB overlap:**

| Implementation    | Strategy | Source | Time (ms) | Memory (MB) | Throughput (MB/s) |
|-------------------|----------|--------|-----------|-------------|-------------------|
| **kiru (Rust)**   | bytes    | string | 0.23      | -           | **4,370**         |
| **kiru (Python)** | bytes    | string | 0.71      | 2.9         | **1,408**         |
| **kiru (Python)** | chars    | string | 3.13      | 2.9         | **319**           |
| LangChain         | chars    | string | 2,982     | 18.6        | 0.34              |

**kiru is 4,000x faster than LangChain for byte chunking and 940x faster for character chunking!**

Key insights:
- **Rust native performance**: Up to 4,370 MB/s for byte chunking
- **Python bindings overhead**: Still 1,400+ MB/s, beating all pure Python alternatives
- **Character-aware chunking**: 300+ MB/s while respecting grapheme boundaries
- **Memory efficient**: Uses 6x less memory than LangChain

---

## Quick Start

### Python 🐍

```bash
pip install kiru
```

```python
from kiru import Chunker

# Create a chunker
chunker = Chunker.by_bytes(
    chunk_size=1024,  # 1KB chunks
    overlap=128       # 128 bytes overlap
)

# Chunk text
chunks = chunker.on_string("Your text here...").all()

# Chunk files in parallel
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
for chunk in chunker.on_sources_par(sources):
    process(chunk)
```

### Rust 🦀

Add to your `Cargo.toml`:

```toml
[dependencies]
kiru = "0.1"
```

```rust
use kiru::{BytesChunker, Chunker};

// Create a chunker
let chunker = BytesChunker::new(1024, 128)?;

// Chunk text
let chunks: Vec<String> = chunker
    .chunk_string("Your text here...".to_string())
    .collect();

// Stream large files
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("huge.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
    process(chunk);
}
```

---

## Use Cases

### Building RAG Systems

```python
# Perfect for vector database ingestion
chunker = Chunker.by_bytes(512, 50)  # Tuned for embedding models

documents = ["glob://knowledge_base/**/*.md"]
chunks = chunker.on_sources_par(documents, channel_size=10000)

for chunk in chunks:
    embedding = model.encode(chunk)
    vector_db.insert(chunk, embedding)
```

### Real-time Processing

```python
# Stream processing without memory overhead
for chunk in chunker.on_file("10GB_file.txt"):
    # Each chunk generated on-demand
    send_to_queue(chunk)
```

### Parallel Document Processing

```rust
// Process hundreds of documents concurrently
use kiru::{ChunkerBuilder, ChunkerEnum};

let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
    chunk_size: 4096,
    overlap: 512,
});

let sources = vec!["glob://docs/**/*.txt"];
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
```

---

## Chunking Strategies

### Bytes Chunking
- Splits on byte boundaries while respecting UTF-8
- Fastest performance (1000+ MB/s in Rust, 1400+ MB/s in Python)
- Ideal for token-limited models and consistent memory usage

### Characters Chunking  
- Splits on character (grapheme) boundaries
- Ensures exact character counts regardless of byte representation
- Perfect for character-limited APIs (300+ MB/s in Python)

---

## API Reference

### Python API

#### Creating Chunkers

```python
from kiru import Chunker

# Byte-based chunking
chunker = Chunker.by_bytes(chunk_size=1024, overlap=128)

# Character-based chunking
chunker = Chunker.by_characters(chunk_size=1000, overlap=100)
```

#### Input Sources

```python
# Single string
chunks = chunker.on_string("text...").all()

# Single file
chunks = chunker.on_file("/path/to/file.txt").all()

# HTTP/HTTPS URL
chunks = chunker.on_http("https://example.com/page").all()

# Multiple sources (serial)
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
chunks = chunker.on_sources(sources).all()

# Multiple sources (parallel)
chunks = chunker.on_sources_par(sources, channel_size=1000).all()

# Or iterate lazily
for chunk in chunker.on_sources_par(sources):
    process(chunk)
```

#### Source Prefixes

- `file://path/to/file.txt` - Local files
- `http://example.com` or `https://example.com` - URLs
- `text://Inline text content` - Raw text strings
- `glob://*.md` - Glob patterns
- No prefix - Treated as raw text

### Rust API

#### Creating Chunkers

```rust
use kiru::{BytesChunker, CharactersChunker, Chunker};

// Byte-based chunking
let chunker = BytesChunker::new(1024, 128)?;

// Character-based chunking
let chunker = CharactersChunker::new(1000, 100)?;
```

#### Basic Usage

```rust
use kiru::Chunker;

// Chunk a string
let chunks: Vec<String> = chunker
    .chunk_string("Your text here".to_string())
    .collect();

// Stream a file
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
    // Process chunk
}
```

#### Advanced Usage

```rust
use kiru::{ChunkerBuilder, ChunkerEnum, Source, HigherOrderSource, SourceGenerator};

// Create chunker with builder pattern
let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
    chunk_size: 4096,
    overlap: 512,
});

// Single source
let chunks = chunker.on_source(Source::File("doc.txt".to_string()))?;

// Multiple sources (serial)
let sources = vec![
    Source::File("doc1.txt".to_string()),
    Source::Http("https://example.com".to_string()),
];
let chunks = chunker.on_sources(sources)?;

// Multiple sources (parallel) - returns Vec
let chunks: Vec<String> = chunker.on_sources_par(sources)?;

// Multiple sources (parallel streaming) - returns iterator
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
for chunk in chunks {
    // Process as they arrive
}

// Using glob patterns
let sources = vec![HigherOrderSource::SourceGenerator(
    SourceGenerator::Glob("**/*.md".to_string())
)];
let flattened = HigherOrderSource::into_flattened_sources(sources)?;
```

---

## Architecture

```
┌─────────────────────────────────────────┐
│           Application Layer              │
│     (Python or Rust Application)        │
├─────────────────────────────────────────┤
│          kiru-py (PyO3 Bindings)        │
│              [Python only]               │
├─────────────────────────────────────────┤
│         kiru-core (Rust Library)        │
│                                          │
│        ┌──────────┬───────────┐         │
│        │ Chunkers │ Streaming │         │  
│        │  Engine  │   Engine  │         │
│        └──────────┴───────────┘         │
└─────────────────────────────────────────┘
```

---

## Project Structure

```
kiru/
├── README.md              # This file (shared documentation)
├── kiru-core/             # Rust implementation
│   ├── src/               # Core chunking algorithms
│   │   ├── bytes_chunker.rs
│   │   ├── characters_chunker.rs
│   │   ├── chunker.rs     # Builder pattern & parallel processing
│   │   └── stream.rs      # File/HTTP streaming
│   ├── benches/           # Criterion benchmarks
│   └── tests/             # Property-based tests
├── kiru-py/               # Python bindings (PyO3)
│   ├── src/lib.rs         # Python wrapper
│   └── python/            # Python tests & benchmarks
└── utils/                 # Version management scripts
```

---

## Streaming & Memory Efficiency

**kiru's killer feature: true streaming with constant memory usage.**

Unlike traditional chunkers that load entire files into memory, kiru processes data as it arrives using an intelligent buffering system. This means you can chunk **gigabyte-sized files** with minimal RAM usage.

### How Streaming Works

```
File/HTTP Source → Read Blocks (8KB) → UTF-8 Buffer → Chunk Iterator → Your Code
                      ↓                      ↓
                 As needed              Constant size
```

**Key advantages:**

1. **Constant Memory**: Process 10GB files with ~10MB RAM
2. **Immediate Results**: First chunks available instantly, no waiting for full file load
3. **Works Everywhere**: Local files, HTTP/HTTPS streams, any data source
4. **UTF-8 Safe**: Buffer maintains character boundaries automatically

### Python Examples

```python
from kiru import Chunker

chunker = Chunker.by_bytes(chunk_size=4096, overlap=512)

# ⚡ Stream a 10GB file - uses only ~10MB RAM
for chunk in chunker.on_file("huge_dataset.txt"):
    # Process chunk immediately as it arrives
    vector_db.insert(chunk)
    # No waiting, no memory explosion!

# ⚡ Stream from HTTP - process as data downloads
for chunk in chunker.on_http("https://example.com/large_document.txt"):
    process(chunk)
    # Chunks ready while download continues

# ⚡ Stream multiple sources in parallel
sources = [
    "file://10gb_file1.txt",
    "https://example.com/doc.txt",
    "file://10gb_file2.txt"
]
for chunk in chunker.on_sources_par(sources, channel_size=1000):
    # All sources stream in parallel
    # Memory stays constant regardless of file sizes
    send_to_queue(chunk)
```

### Rust Examples

```rust
use kiru::{BytesChunker, Chunker, Source, StreamType};

let chunker = BytesChunker::new(4096, 512)?;

// ⚡ Stream a massive file with constant memory
let stream = StreamType::from_source(&Source::File("10gb_file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
    // Process immediately, no memory buildup
    vector_db.insert(chunk);
}

// ⚡ Stream from HTTP as data arrives
let stream = StreamType::from_source(&Source::Http("https://example.com/doc.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
    process(chunk);
}
```

### Memory Comparison

Processing a **1GB file** with 4KB chunks:

| Library    | Memory Usage | Loads Full File? | Streaming? |
|------------|--------------|------------------|------------|
| **kiru**   | **~10 MB**   | ❌ No            | ✅ Yes     |
| LangChain  | **1000+ MB** | ✅ Yes           | ❌ No      |
| tiktoken   | **1000+ MB** | ✅ Yes           | ❌ No      |

**Result**: kiru uses **100x less memory** while being **4,000x faster**!

---

## Development

### Setup

```bash
# Clone repository
git clone https://github.com/yourusername/kiru.git
cd kiru

# Run all tests
cargo test --workspace

# Run Rust benchmarks
cd kiru-core
cargo bench

# Build Python package
cd ../kiru-py
pip install maturin
maturin develop --release

# Run Python tests
pip install pytest hypothesis
pytest python/test.py

# Run Python benchmarks
python python/bench.py
```

### Running Benchmarks

```bash
# Rust benchmarks
cd kiru-core
cargo bench

# Python benchmarks
cd kiru-py
python python/bench.py
```

---

## Performance Tips

1. **Use byte chunking** for maximum throughput (1000+ MB/s)
2. **Use character chunking** when exact character counts matter (300+ MB/s)
3. **Enable parallel processing** with `on_sources_par()` for multiple files
4. **Tune chunk size** based on your embedding model's context window
5. **Adjust overlap** to balance context preservation and storage
6. **Stream large files** to maintain constant memory usage

---

## Why "kiru"?

"Kiru" (切る) is Japanese for "to cut" - reflecting the library's purpose of cutting text into chunks at lightning speed ⚡🗡️

---

## Contributing

We welcome contributions! Please check out our [Contributing Guide](CONTRIBUTING.md) for guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

---

## Credits

Built with:
- [PyO3](https://pyo3.rs) - Rust bindings for Python
- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism for Rust
- [maturin](https://www.maturin.rs) - Build and publish Rust Python extensions

---

**Ready to cut through text at the speed of light?**

- 🐍 **Python**: `pip install kiru`
- 🦀 **Rust**: Add `kiru = "0.1"` to Cargo.toml

Get started with [PyPI](https://pypi.org/project/kiru/) | [Crates.io](https://crates.io/crates/kiru) | [Documentation](https://docs.rs/kiru)
