Metadata-Version: 2.4
Name: markitdown-chunker
Version: 0.1.0
Summary: Convert documents to markdown, chunk them intelligently, and export structured data
Home-page: https://github.com/Naveenkumarar/markitdown-chunker
Author: Naveen Kumar Rajarajan
Author-email: Naveen Kumar Rajarajan <rajarajannaveenkumar@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Naveenkumarar/markitdown-chunker
Project-URL: Bug Reports, https://github.com/Naveenkumarar/markitdown-chunker/issues
Project-URL: Source Code, https://github.com/Naveenkumarar/markitdown-chunker
Keywords: markdown,converter,chunker,document-processing,langchain,markitdown
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: markitdown>=0.0.1
Requires-Dist: langchain>=0.1.0
Requires-Dist: langchain-text-splitters>=0.0.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: images
Requires-Dist: pymupdf>=1.23.0; extra == "images"
Requires-Dist: python-docx>=0.8.11; extra == "images"
Requires-Dist: python-pptx>=0.6.21; extra == "images"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# MarkitDown Chunker

A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the [markitdown](https://github.com/microsoft/markitdown) package with advanced chunking capabilities using LangChain.

## ✨ Features

- 📄 **Multi-format Support**: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
- 🖼️ **Image Extraction**: Automatically extract images from PDF, DOCX, PPTX files ([requires optional dependencies](docs/IMAGE_EXTRACTION.md))
- 🎨 **Image Summarization**: Optional AI-powered image descriptions for better context
- ✂️ **Smart Chunking**: Markdown-aware text splitting that respects document structure
- 📊 **Structured Export**: Export chunks with metadata to JSON format
- 🔧 **Flexible Pipeline**: Run individual steps or complete pipeline as needed
- 🎯 **CLI & Python API**: Use from command line or integrate into your Python applications

## 📦 Installation

### Basic Installation

```bash
pip install markitdown-chunker
```

### With Image Extraction Support

To extract images from PDF, DOCX, and PPTX files:

```bash
pip install "markitdown-chunker[images]"
```

See [Image Extraction Guide](docs/IMAGE_EXTRACTION.md) for details.

### From Source

```bash
git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"
```

## 🚀 Quick Start

### Command Line Interface

```bash
# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/

# Convert only
markitdown-chunker document.docx output/ --convert-only

# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only

# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400

# List supported formats
markitdown-chunker --list-formats
```

### Python API

#### Complete Pipeline

```python
from markitdown_chunker import MarkitDownProcessor

# Initialize processor with custom settings
processor = MarkitDownProcessor(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True
)

# Process a document (all steps)
result = processor.process(
    file_path="document.pdf",
    output_dir="output/"
)

print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")
```

#### Step-by-Step Processing

```python
from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter

# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
    file_path="document.pdf",
    output_dir="output/",
    save_images=True
)

# Step 2: Chunk the markdown
chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = chunker.chunk_file(
    markdown_path=conversion_result['markdown_path']
)

# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
    chunks=chunks,
    output_path="output/chunks.json"
)
```

## 📚 Supported File Formats

- **Documents**: PDF, DOCX, DOC, RTF, ODT, TXT, MD
- **Presentations**: PPTX, PPT, ODP
- **Spreadsheets**: XLSX, XLS, ODS
- **Web**: HTML, HTM

> **Note**: Audio/video files (MP3, MP4, etc.) require ffmpeg. See [docs/FFMPEG_AUDIO.md](docs/FFMPEG_AUDIO.md) for details.

## 📂 Output Directory Structure

After processing a document, the output directory will contain:

```
output/
├── document.md                    # Converted markdown file
├── document_chunks.json           # Chunks with metadata and statistics
└── images/                        # Extracted images (if any)
    ├── page1_img1.png
    ├── page2_img1.jpg
    ├── page3_img1.png
    └── page3_img2.jpg
```

### Example Output Files

**`document.md`** - Markdown conversion with image references:
```markdown
# Document Title

Document content converted to markdown format...

## Extracted Images

![Page 1 Image 1](images/page1_img1.png)

![Page 2 Image 1](images/page2_img1.jpg)
```

**`document_chunks.json`** - Structured chunk data:
```json
{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Document content chunk...",
      "metadata": {
        "Header 1": "Introduction",
        "chunk_index": 0,
        "source_file": "output/document.md"
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-10T10:30:45.123456"
}
```

**`images/`** - Extracted images with organized naming:
- PDF images: `page{N}_img{M}.{ext}` (e.g., `page1_img1.png`)
- DOCX images: `docx_img{N}.{ext}` (e.g., `docx_img1.jpg`)
- PPTX images: `slide{N}_img{M}.{ext}` (e.g., `slide1_img1.png`)

> 💡 **Tip**: The images directory is only created if the document contains images and `save_images=True` (default).

## 🎛️ Configuration Options

### Chunking Parameters

```python
processor = MarkitDownProcessor(
    chunk_size=1000,           # Maximum characters per chunk
    chunk_overlap=200,          # Overlap between consecutive chunks
    use_markdown_splitter=True, # Use markdown-aware splitting
    json_indent=2              # JSON formatting
)
```

### Processing Options

```python
result = processor.process(
    file_path="input.pdf",
    output_dir="output/",
    save_images=True,                    # Save extracted images
    include_image_summaries=False,       # Add image summaries to chunks
    image_summarizer=my_summarizer_func, # Custom image summarizer
    skip_conversion=False,               # Skip if already markdown
    skip_chunking=False,                 # Only convert
    skip_export=False                    # Don't export JSON
)
```

## 🔬 Advanced Usage

### Custom Image Summarization

```python
def summarize_image(image_path: str) -> str:
    """Your custom image summarization logic."""
    # Example: Use vision AI model
    from my_vision_model import analyze_image
    return analyze_image(image_path)

processor = MarkitDownProcessor()
result = processor.process(
    file_path="document.pdf",
    output_dir="output/",
    include_image_summaries=True,
    image_summarizer=summarize_image
)
```

### Batch Processing

```python
processor = MarkitDownProcessor()

files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
    file_paths=files,
    output_dir="output/"
)

for result in results:
    if "error" in result:
        print(f"Failed: {result['input_file']} - {result['error']}")
    else:
        print(f"Success: {result['input_file']}")
```

### Individual Step Processing

```python
processor = MarkitDownProcessor()

# Only convert to markdown
conversion = processor.convert_only(
    file_path="document.pdf",
    output_dir="output/"
)

# Only chunk existing markdown
chunks = processor.chunk_only(
    markdown_path="document.md"
)

# Only export chunks
processor.export_only(
    chunks=chunks,
    output_path="output/chunks.json",
    source_info={"source": "document.md"}
)
```

### Custom Markdown Header Splitting

```python
from markitdown_chunker import DocumentChunker

chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True,
    headers_to_split_on=[
        ("#", "Title"),
        ("##", "Section"),
        ("###", "Subsection"),
        ("####", "Paragraph")
    ]
)

chunks = chunker.chunk_file("document.md")
```

## 📤 Output Format

### JSON Structure

```json
{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "output_dir": "output/",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Chunk content here...",
      "metadata": {
        "Header 1": "Introduction",
        "Header 2": "Overview",
        "sub_chunk_index": 0,
        "total_sub_chunks": 1,
        "source_file": "output/document.md",
        "chunk_size_config": 1000,
        "chunk_overlap_config": 200
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-09T10:30:45.123456"
}
```

## 🛠️ CLI Reference

```bash
markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
                    [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
                    [--no-markdown-splitter] [--no-images]
                    [--include-image-summaries] [--json-indent JSON_INDENT]
                    [--list-formats] [--version] [-v]
                    input output

Positional Arguments:
  input                 Input file path
  output                Output directory

Optional Arguments:
  -h, --help            Show help message
  --convert-only        Only convert to markdown
  --chunk-only          Only chunk existing markdown
  --no-export           Skip JSON export
  --chunk-size SIZE     Maximum chunk size (default: 1000)
  --overlap SIZE        Chunk overlap (default: 200)
  --no-markdown-splitter Disable markdown-aware splitting
  --no-images           Don't save extracted images
  --json-indent N       JSON indentation (default: 2)
  --list-formats        List supported formats
  --version             Show version
  -v, --verbose         Enable verbose output
```

## 🧪 Development

### Setup Development Environment

```bash
git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"
```

### Run Tests

```bash
pytest tests/
pytest --cov=markitdown_chunker tests/
```

### Code Formatting

```bash
black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Built on top of [markitdown](https://github.com/microsoft/markitdown) by Microsoft
- Uses [LangChain](https://github.com/langchain-ai/langchain) text splitters for intelligent chunking

## 📞 Support

- 🐛 [Report a bug](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- 💡 [Request a feature](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- 📖 [Documentation](https://github.com/Naveenkumarar/markitdown-chunker)
- 🖼️ [Image Extraction Guide](docs/IMAGE_EXTRACTION.md)
- 🎵 [Audio/Video Processing Guide](docs/FFMPEG_AUDIO.md)

## 🗺️ Roadmap

- [ ] Support for more document formats
- [ ] Advanced chunking strategies (semantic, sentence-based)
- [ ] Integration with vector databases
- [ ] Web UI for document processing
- [ ] Cloud storage integration (S3, GCS, Azure)
- [ ] Parallel batch processing
- [ ] Custom output formats (CSV, Parquet, etc.)

---

Made with ❤️ by the MarkitDown Chunker community

