Metadata-Version: 2.2
Name: markdown-chunker
Version: 0.1.0
Summary: A robust Markdown chunking library that preserves structure and context
Home-page: https://github.com/hadjebi/markdown_chunker
Author: Saeed Hajebi
Author-email: hajebis@tcd.ie
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=5.1
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# <img src="https://raw.githubusercontent.com/FortAwesome/Font-Awesome/6.x/svgs/solid/puzzle-piece.svg" width="30" height="30"> Markdown Chunker

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Markdown](https://img.shields.io/badge/Markdown-Friendly-green.svg)](https://daringfireball.net/projects/markdown/)
[![PyPI version](https://badge.fury.io/py/markdown-chunker.svg)](https://badge.fury.io/py/markdown-chunker)

A robust Python library for intelligently chunking Markdown documents while preserving structural integrity and maintaining context.

## 📋 Overview

Markdown Chunker helps you divide large Markdown files into smaller, more manageable chunks while preserving the structure, meaning, and context of the original document. It's ideal for:

- Integrating long documents with AI models that have token limits
- Creating semantic chunks for vector databases
- Preparing content for efficient processing by NLP systems
- Splitting documents for parallel processing while maintaining integrity

## ✨ Features

- **🧠 Smart Chunking**: Splits Markdown documents intelligently, preserving structure and meaning
- **🔍 Content-Aware**: Handles various Markdown elements with specialized intelligence:
  - Headings (never split)
  - Tables (split with headers preserved)
  - Code blocks (kept intact)
  - Lists (split between items)
  - Blockquotes (split at paragraph boundaries)
  - Footnotes (kept with their references when possible)
  - YAML Front Matter (kept intact)
  - HTML (preserves tag structure)
- **🔄 Automatic Header/Footer Detection**: Identifies and removes repeating headers and footers
- **🚫 Duplicate Prevention**: Automatically detects and removes duplicate chunks
- **⚙️ Configurable Size Constraints**: Customize minimum and maximum chunk sizes
- **🏗️ Structure Preservation**: Maintains Markdown syntax and document structure
- **📝 Metadata Generation**: Optionally embeds metadata in each chunk as YAML front matter
- **⚡ Parallel Processing**: Efficiently processes large documents using multiple cores

## 🔧 Installation

```bash
# Install from PyPI (recommended)
pip install markdown-chunker

# Install the development version directly from GitHub
pip install git+https://github.com/hadjebi/markdown_chunker.git
```

## 🚀 Quick Start

```python
from markdown_chunker import MarkdownChunkingStrategy

# Create a chunking strategy with default configuration
strategy = MarkdownChunkingStrategy()

# Or customize the parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=512,    # Minimum chunk size (default: 512)
    soft_max_len=1024,    # Preferred maximum chunk size (default: 1024)
    hard_max_len=2048,    # Absolute maximum chunk size (default: 2048)
    detect_headers_footers=True,  # Detect and remove repeating headers/footers
    remove_duplicates=True        # Remove duplicate chunks
)

# Chunk a Markdown document
with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Process the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 80)
```

## 🖥️ Command Line Interface

The package includes a powerful command-line interface for chunking markdown files:

```bash
# Use the markdown-chunker command after installing with pip
markdown-chunker examples/sample.md

# Specify a custom output directory
markdown-chunker examples/sample.md custom_output_dir

# Customize parameters
markdown-chunker --min-chunk-len=256 --soft-max-len=512 --hard-max-len=1024 examples/sample.md

# Enable metadata embedding
markdown-chunker --add-metadata examples/sample.md

# Enable parallel processing for large documents
markdown-chunker --parallel --max-workers=4 examples/large_document.md
```

### Available Options:

- `--min-chunk-len`: Minimum chunk length in characters (default: 512)
- `--soft-max-len`: Soft maximum chunk length in characters (default: 1024)
- `--hard-max-len`: Hard maximum chunk length in characters (default: 2048)
- `--no-headers-footers`: Disable header and footer detection
- `--no-duplicates`: Disable duplicate detection
- `--add-metadata`: Embed metadata in each chunk as YAML front matter
- `--document-title`: Specify a document title for metadata (auto-detected if not provided)
- `--parallel`: Enable parallel processing for large documents
- `--max-workers`: Maximum number of worker processes for parallel processing
- `--verbose`: Enable verbose output

## 🔬 Chunking Strategy

The library implements a sophisticated chunking strategy that follows these rules:

1. **Structure Preservation**
   - Headings are never split
   - Code blocks are kept intact
   - Tables are split only when necessary, with headers preserved
   - Lists are split between items to maintain structure
   - Blockquotes are split at paragraph boundaries
   - Footnotes are kept with their references when possible
   - HTML tags are preserved in their structure

2. **Size Management**
   - Chunks are kept between `min_chunk_len` and `soft_max_len` when possible
   - Content is never split beyond `hard_max_len`
   - Small chunks are merged when below `min_chunk_len`

3. **Header/Footer Handling**
   - Automatically detects repeating headers and footers
   - Removes redundant elements while preserving unique content
   - Uses pattern matching to identify common elements

4. **Duplicate Prevention**
   - Detects and removes duplicate chunks
   - Preserves the first occurrence of duplicate content
   - Uses MD5 hashing for efficient comparison

5. **Embedded Metadata**
   - Optionally embeds metadata in each chunk as YAML front matter
   - Includes document information (title, source)
   - Provides chunk details (id, position, next/previous chunks)
   - Maintains heading hierarchy information
   - Identifies content types (tables, lists, code blocks, etc.)
   - Preserves and merges with existing YAML front matter

## 📝 Examples

### Basic Usage

```python
from markdown_chunker import MarkdownChunkingStrategy

strategy = MarkdownChunkingStrategy()

# Simple document with various elements
content = """
# Main Title

## Section 1

This is a paragraph with some content.

```python
def example():
    return "Hello, World!"
```

1. First item
2. Second item
   - Subitem
   - Another subitem

> Important quote
> spanning multiple lines
"""

chunks = strategy.chunk_markdown(content)
```

### Custom Configuration

```python
from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with custom parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=100,
    soft_max_len=200,
    hard_max_len=300,
    detect_headers_footers=False  # Disable header/footer detection
)

chunks = strategy.chunk_markdown(content)
```

### Embedded Metadata

```python
from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with embedded metadata
strategy = MarkdownChunkingStrategy(
    add_metadata=True,
    document_title="My Document",
    source_document="document.md"
)

with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Each chunk will include YAML front matter like:
'''
---
chunk:
  id: 1
  total: 10
  previous: null
  next: 2
  length: 1024
  position: 10%
document:
  title: My Document
  source: document.md
content:
  types:
  - heading
  - paragraph
  word_count: 180
  characters: 1024
headings:
  main: Section 1
  all:
  - Main Title
  - Section 1
---

# Section 1

This is the beginning of section 1...
'''
```

### Processing Large Documents

```python
from markdown_chunker import MarkdownChunkingStrategy
import os

# Create a strategy with parallel processing for large documents
strategy = MarkdownChunkingStrategy(
    parallel_processing=True,  # Enable parallel processing
    max_workers=4,             # Use 4 worker processes
    add_metadata=True        # Optionally include metadata
)

# Process a large document
with open('large_document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Save chunks to files
os.makedirs('output', exist_ok=True)
for i, chunk in enumerate(chunks):
    with open(f'output/chunk_{i+1:03d}.md', 'w') as f:
        f.write(chunk)
```

## 📂 Sample Output

Example output directories are included in the repository:

- `examples/outputs/basic_example/`: Basic chunking with default parameters
- `examples/outputs/metadata_example/`: Chunking with embedded metadata
- `examples/outputs/custom_params_example/`: Chunking with custom size parameters
- `examples/outputs/bmw_example/`: Chunking of a large document (BMW Annual Report) with parallel processing

## 🔍 Advanced Usage

### Parallel Processing

For large documents, you can enable parallel processing to significantly improve performance:

```python
strategy = MarkdownChunkingStrategy(
    parallel_processing=True,
    max_workers=4  # Number of worker processes
)
```

### Custom Content Handlers

The library is designed to be extensible. You can create custom content handlers for specialized Markdown elements:

```python
from markdown_chunker import ContentHandler
from markdown_chunker.utils import is_special_element

class CustomElementHandler(ContentHandler):
    def can_handle(self, content):
        return is_special_element(content)
        
    def split(self, content, max_length):
        # Custom splitting logic
        return split_parts

# Add to strategy
strategy = MarkdownChunkingStrategy()
strategy.content_handlers.append(CustomElementHandler())
```

## 🤝 Contributing

Contributions are welcome! Here's how you can help:

1. Fork the repository
2. Create a feature branch: `git checkout -b new-feature`
3. Make your changes and commit: `git commit -m 'Add new feature'`
4. Push to your branch: `git push origin new-feature`
5. Create a pull request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 📚 Documentation

For complete documentation, see the [docs](docs/) directory.

## 🙏 Acknowledgements

- Inspired by the needs of AI developers working with large documents
- Built upon the shoulders of the Python Markdown ecosystem
