Metadata-Version: 2.4
Name: chunkin
Version: 0.1.3
Summary: Document chunking and indexing library for vector stores
Author-email: thevyanshu <thevyanshu@github.com>
License: MIT
Project-URL: Homepage, https://github.com/thevyanshu/chunkin
Project-URL: Documentation, https://thevyanshu.github.io/chunkin
Project-URL: Repository, https://github.com/thevyanshu/chunkin
Project-URL: Issues, https://github.com/thevyanshu/chunkin/issues
Keywords: document,chunking,vector-store,rag,langchain,embeddings,indexing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain>=0.3.0
Requires-Dist: langchain-text-splitters>=0.3.0
Requires-Dist: langchain-community>=0.3.0
Requires-Dist: pypdf>=5.0.0
Requires-Dist: openpyxl>=3.1.0
Provides-Extra: core
Requires-Dist: langchain-openai>=0.3.0; extra == "core"
Requires-Dist: faiss-cpu; extra == "core"
Provides-Extra: semantic
Requires-Dist: langchain-experimental>=0.3.0; extra == "semantic"
Provides-Extra: local
Requires-Dist: langchain-chroma>=0.3.0; extra == "local"
Requires-Dist: langchain-milvus>=0.3.0; extra == "local"
Requires-Dist: langchain-lancedb>=0.3.0; extra == "local"
Requires-Dist: langchain-lambdadb>=0.3.0; extra == "local"
Provides-Extra: aws
Requires-Dist: langchain-aws>=0.3.0; extra == "aws"
Requires-Dist: boto3; extra == "aws"
Provides-Extra: azure
Requires-Dist: langchain-azure-ai>=0.3.0; extra == "azure"
Requires-Dist: langchain-azure-cosmosdb>=0.3.0; extra == "azure"
Provides-Extra: gcp
Requires-Dist: langchain-databricks>=0.3.0; extra == "gcp"
Requires-Dist: langchain-google-vertexai>=0.3.0; extra == "gcp"
Requires-Dist: langchain-google-community>=0.3.0; extra == "gcp"
Provides-Extra: other
Requires-Dist: langchain-pinecone>=0.3.0; extra == "other"
Requires-Dist: langchain-qdrant>=0.3.0; extra == "other"
Requires-Dist: langchain-weaviate>=0.3.0; extra == "other"
Requires-Dist: langchain-postgres>=0.3.0; extra == "other"
Requires-Dist: langchain-mongodb>=0.3.0; extra == "other"
Requires-Dist: langchain-astradb>=0.3.0; extra == "other"
Requires-Dist: langchain-elasticsearch>=0.3.0; extra == "other"
Requires-Dist: langchain-neo4j>=0.3.0; extra == "other"
Requires-Dist: langchain-oracledb>=0.3.0; extra == "other"
Requires-Dist: langchain-cockroachdb>=0.3.0; extra == "other"
Requires-Dist: langchain-couchbase>=0.3.0; extra == "other"
Requires-Dist: langchain-singlestore>=0.3.0; extra == "other"
Requires-Dist: langchain-supabase>=0.3.0; extra == "other"
Requires-Dist: langchain-myscale>=0.3.0; extra == "other"
Requires-Dist: langchain-zilliz>=0.3.0; extra == "other"
Requires-Dist: langchain-meilisearch>=0.3.0; extra == "other"
Requires-Dist: langchain-typesense>=0.3.0; extra == "other"
Requires-Dist: langchain-vectara>=0.3.0; extra == "other"
Requires-Dist: langchain-marqo>=0.3.0; extra == "other"
Requires-Dist: langchain-deeplake>=0.3.0; extra == "other"
Requires-Dist: langchain-turbopuffer>=0.3.0; extra == "other"
Provides-Extra: all
Requires-Dist: langchain-openai>=0.3.0; extra == "all"
Requires-Dist: langchain-experimental>=0.3.0; extra == "all"
Requires-Dist: langchain-chroma>=0.3.0; extra == "all"
Requires-Dist: langchain-milvus>=0.3.0; extra == "all"
Requires-Dist: langchain-lancedb>=0.3.0; extra == "all"
Requires-Dist: langchain-lambdadb>=0.3.0; extra == "all"
Requires-Dist: langchain-aws>=0.3.0; extra == "all"
Requires-Dist: langchain-azure-ai>=0.3.0; extra == "all"
Requires-Dist: langchain-azure-cosmosdb>=0.3.0; extra == "all"
Requires-Dist: langchain-databricks>=0.3.0; extra == "all"
Requires-Dist: langchain-google-vertexai>=0.3.0; extra == "all"
Requires-Dist: langchain-google-community>=0.3.0; extra == "all"
Requires-Dist: langchain-pinecone>=0.3.0; extra == "all"
Requires-Dist: langchain-qdrant>=0.3.0; extra == "all"
Requires-Dist: langchain-weaviate>=0.3.0; extra == "all"
Requires-Dist: langchain-postgres>=0.3.0; extra == "all"
Requires-Dist: langchain-mongodb>=0.3.0; extra == "all"
Requires-Dist: langchain-astradb>=0.3.0; extra == "all"
Requires-Dist: langchain-elasticsearch>=0.3.0; extra == "all"
Requires-Dist: langchain-neo4j>=0.3.0; extra == "all"
Requires-Dist: langchain-oracledb>=0.3.0; extra == "all"
Requires-Dist: langchain-cockroachdb>=0.3.0; extra == "all"
Requires-Dist: langchain-couchbase>=0.3.0; extra == "all"
Requires-Dist: langchain-singlestore>=0.3.0; extra == "all"
Requires-Dist: langchain-supabase>=0.3.0; extra == "all"
Requires-Dist: langchain-myscale>=0.3.0; extra == "all"
Requires-Dist: langchain-zilliz>=0.3.0; extra == "all"
Requires-Dist: langchain-meilisearch>=0.3.0; extra == "all"
Requires-Dist: langchain-typesense>=0.3.0; extra == "all"
Requires-Dist: langchain-vectara>=0.3.0; extra == "all"
Requires-Dist: langchain-marqo>=0.3.0; extra == "all"
Requires-Dist: langchain-deeplake>=0.3.0; extra == "all"
Requires-Dist: langchain-turbopuffer>=0.3.0; extra == "all"
Requires-Dist: faiss-cpu; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Dynamic: license-file

# Chunkin

A Python library for document chunking and indexing into vector stores, built on [LangChain](https://python.langchain.com/).

## Built on LangChain

Chunkin leverages [LangChain](https://python.langchain.com/) for:

- **Document Loaders**: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
- **Text Splitters**: 6 chunking strategies including semantic chunking
- **Vector Stores**: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about [LangChain's document processing capabilities](https://python.langchain.com/docs/modules/data_connection/).

## Modules

| Module | Description |
|--------|-------------|
| `chunkin` | Document chunking using [LangChain text splitters](strategies.md) |
| `chunkin_indexer` | Index chunks to 50+ vector stores via [LangChain integrations](https://python.langchain.com/docs/integrations/vectorstores/) |
| `chunkin_processor` | Unified end-to-end processing |

## Quick Start

```python
from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)
```

## Installation

```bash
# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]
```

## Documentation

- [Overview](docs/index.md)
- [Installation](docs/installation.md)
- [Usage Guide](docs/usage.md)
- [API Reference](docs/api.md)
- [Chunking Strategies](docs/strategies.md)
- [Vector Stores](docs/indexer.md)
- [Doc Processor](docs/processor.md)

## Supported Formats

Chunkin uses [LangChain document loaders](https://python.langchain.com/docs/integrations/document_loaders/):

| Format | Extensions |
|--------|-----------|
| PDF | `.pdf` |
| Word | `.docx`, `.doc` |
| Text | `.txt` |
| Markdown | `.md` |
| CSV | `.csv` |
| Excel | `.xlsx`, `.xls` |
| PowerPoint | `.pptx`, `.ppt` |

## Supported Vector Stores

Built on [LangChain vector store integrations](https://python.langchain.com/docs/integrations/vectorstores/):

### Local (No External Service)
FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

### Amazon AWS
OpenSearch, Valkey, DocumentDB

### Microsoft Azure
Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

### Google Cloud
Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

### Other
Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB,
Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale,
Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See [docs/indexer.md](docs/indexer.md) for full list.

## Supported Chunking Strategies

Uses [LangChain text splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/):

| Strategy | LangChain Class | Description |
|----------|-----------------|-------------|
| `recursive` | RecursiveCharacterTextSplitter | Recursively splits by paragraphs, sentences, words |
| `character` | CharacterTextSplitter | Simple character-based splitting |
| `markdown` | MarkdownTextSplitter | Markdown-aware splitting |
| `markdown_headers` | MarkdownHeaderTextSplitter | Split by markdown headers |
| `html_headers` | HTMLHeaderTextSplitter | Split by HTML header tags |
| `semantic` | SemanticChunker | Embedding-based semantic splitting |

See [docs/strategies.md](docs/strategies.md) for details.

## Project Structure

```
chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md
```

## Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve
```

## LangChain Resources

- [LangChain Documentation](https://python.langchain.com/)
- [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- [Vector Stores](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

## License

MIT License
