Metadata-Version: 2.4
Name: docpipe-mini
Version: 0.1.0a1
Summary: Minimal document-to-jsonl serializer with coordinates for AI
Project-URL: Homepage, https://github.com/docpipe/docpipe-mini
Project-URL: Repository, https://github.com/docpipe/docpipe-mini
Project-URL: Documentation, https://docpipe-mini.readthedocs.io
Project-URL: Issues, https://github.com/docpipe/docpipe-mini/issues
Author-email: DocPipe Team <team@docpipe.ai>
License: MIT
Keywords: ai,coordinates,document,jsonl,pdf,serialization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.11
Requires-Dist: pymupdf>=1.26.5
Provides-Extra: all
Requires-Dist: mypy>=1.8.0; extra == 'all'
Requires-Dist: openpyxl>=3.1.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: pypdfium2>=4.18.0; extra == 'all'
Requires-Dist: pytest-benchmark>=4.0.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: python-docx>=0.8.11; extra == 'all'
Requires-Dist: rich>=13.0.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: typer>=0.9.0; extra == 'all'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'all'
Provides-Extra: cli
Requires-Dist: rich>=13.0.0; extra == 'cli'
Requires-Dist: typer>=0.9.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pypdfium2>=4.18.0; extra == 'dev'
Requires-Dist: pytest-benchmark>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: rich>=13.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: typer>=0.9.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
Provides-Extra: docx
Requires-Dist: python-docx>=0.8.11; extra == 'docx'
Provides-Extra: image
Requires-Dist: pillow>=10.0.0; extra == 'image'
Provides-Extra: pdf
Requires-Dist: pypdfium2>=4.18.0; extra == 'pdf'
Provides-Extra: pdf-fast
Requires-Dist: pymupdf>=1.23.0; extra == 'pdf-fast'
Provides-Extra: xlsx
Requires-Dist: openpyxl>=3.1.0; extra == 'xlsx'
Description-Content-Type: text/markdown

# docpipe-mini

**Minimal document-to-jsonl serializer with coordinates for AI**

`docpipe-mini` converts documents into JSONL (JSON Lines) format with coordinate information, perfect for AI consumption. Focus on speed, minimal dependencies, and clean output.

## 🚀 Quick Start

```bash
# Install (5 MB core, zero dependencies)
pip install docpipe-mini

# Install PDF support (+11 MB, BSD license)
pip install docpipe-mini[pdf]

# Convert document to JSONL
python -m docpipe serialize-cmd document.pdf > document.jsonl
```

## 📖 Usage

### Python API

```python
import docpipe_mini as dp

# Simple serialization
for chunk in dp.serialize("paper.pdf"):
    print(chunk.to_jsonl())
    # {"doc_id":"uuid","page":1,"x":0.1,"y":0.2,"w":0.8,"h":0.1,"type":"text","text":"...","tokens":42}

# Direct JSONL output
for line in dp.serialize_to_jsonl("paper.pdf"):
    print(line)

# List supported formats
print(dp.list_formats())
```

### Command Line

```bash
# Basic usage
python -m docpipe serialize-cmd document.pdf > output.jsonl

# Save to file
python -m docpipe serialize-cmd document.pdf -o output.jsonl

# Include images and export them
python -m docpipe serialize-cmd document.docx --include-binary --export-images ./images

# Filter content types
python -m docpipe serialize-cmd document.pdf --types text,table

# Show processing statistics
python -m docpipe serialize-cmd document.pdf --stats

# List supported formats
python -m docpipe formats

# Show system information
python -m docpipe info

# Validate document without full processing
python -m docpipe validate document.pdf
```

## ✨ Key Features

### 🖼️ Image Extraction
- **PDF Images**: Accurate extraction with coordinates using PyMuPDF
- **Word Images**: Standard library extraction from DOCX files
- **Multiple Formats**: PNG, JPEG, GIF, BMP, TIFF, WebP support
- **Export Options**: Base64 encoding in JSON or save to separate files

### 🎛️ Rich CLI Interface
- **Progress Bars**: Real-time processing progress with Rich
- **Statistics**: Detailed processing metrics and content breakdown
- **Content Filtering**: Filter by content type (text, table, image)
- **Memory Management**: Built-in memory limits and monitoring

### 📍 Coordinate-Based Ordering
- **Reading Order**: Content appears in document reading order
- **Accurate Positioning**: Normalized coordinates (0-1 range)
- **Multi-Content Support**: Text, tables, and images positioned correctly

## 📊 Output Format

Each line is a JSON object with:

```json
{
  "doc_id": "uuid",           # Document identifier
  "page": 1,                  # Page number (1-based)
  "x": 0.123,                 # Normalized X coordinate (0-1)
  "y": 0.456,                 # Normalized Y coordinate (0-1)
  "w": 0.7,                   # Normalized width (0-1)
  "h": 0.08,                  # Normalized height (0-1)
  "type": "text",             # Content type: "text" | "table" | "image"
  "text": "...",              # Text content (null for images)
  "tokens": 42,               # Estimated token count
  "binary_data": "base64...", # Binary data for images (base64 encoded, optional)
  "binary_encoding": "base64",# Binary encoding format
  "metadata": {               # Additional metadata
    "source_file": "doc.pdf",
    "file_name": "doc.pdf",
    "file_extension": ".pdf",
    "file_size": 1048576,
    "extraction_method": "pymupdf"
  }
}
```

## 📦 Installation

### Core Installation (5 MB)
```bash
pip install docpipe-mini
```
Zero third-party dependencies. Add format support as needed.

### Optional Formats

```bash
# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe-mini[pdf]

# CLI with Rich interface (typer, +2 MB)
pip install docpipe-mini[cli]

# All optional dependencies
pip install docpipe-mini[all]
```

### Development
```bash
git clone https://github.com/docpipe/docpipe-mini
cd docpipe-mini
uv sync --extra dev
pytest
```

## 🎯 Design Goals

- **Minimal Dependencies**: Core uses only Python standard library
- **Fast Processing**: ~300ms/MB on typical hardware
- **AI-Ready Output**: Clean JSONL with coordinates for LLM consumption
- **Type Safety**: Full type hints and mypy strict compliance
- **Memory Safe**: Built-in memory limits and lazy processing
- **Rich CLI**: Beautiful command-line interface with progress bars and statistics
- **Image Support**: Automatic image extraction with base64 encoding and file export
- **Coordinate Ordering**: Content is output in document reading order (top-to-bottom, left-to-right)

## 🏗️ Architecture

```
Document → Serializer → DocumentChunk → JSONL
```

1. **Loaders**: Zero-dependency document parsers
2. **Processors**: Coordinate extraction and text chunking
3. **Output**: Standardized JSONL format

## 📋 Supported Formats

| Format | Status | Library | License | Features |
|--------|--------|---------|---------|----------|
| PDF | ✅ | PyMuPDF | AGPL | Text, images, tables with accurate coordinates |
| DOCX | ✅ | Standard Library | MIT | Text, images with coordinate estimation |
| XLSX | 🚧 | Planned | - | Coming soon |
| Images | 🚧 | Planned | - | Coming soon |

## 🔧 Configuration

```python
import docpipe_mini as dp

# Memory limit
for chunk in dp.serialize("large.pdf", max_mem_mb=256):
    # Process with 256MB memory limit
    pass

# Custom document ID
for chunk in dp.serialize("paper.pdf", doc_id="my-paper"):
    # Use custom ID instead of UUID
    pass

# Process with image extraction
for chunk in dp.serialize("document.docx"):
    # Images are automatically extracted and base64 encoded
    if chunk.type == "image":
        print(f"Found image: {chunk.metadata['image_format']}, size: {chunk.metadata['image_size_bytes']} bytes")
```

### CLI Options

```bash
# Memory management
python -m docpipe serialize-cmd large.pdf --max-mem 256

# Content filtering
python -m docpipe serialize-cmd document.pdf --types text,table  # Only text and tables
python -m docpipe serialize-cmd document.docx --types image      # Only images

# Image handling
python -m docpipe serialize-cmd document.docx --include-binary   # Include base64 image data
python -m docpipe serialize-cmd document.docx --export-images ./img  # Export images to files

# Output control
python -m docpipe serialize-cmd document.pdf --no-jsonl  # Plain text output
python -m docpipe serialize-cmd document.pdf --stats     # Show processing statistics
```

## 🧪 Testing

```bash
# Run tests
pytest

# Run benchmarks
pytest -m benchmark

# Type checking
mypy --strict
```

## 📈 Performance

- **Installation**: 5 MB core, zero dependencies
- **Processing**: ~300ms/MB for PDF documents
- **Memory**: ~3x document size peak usage
- **Output**: 1-2x input size (JSON overhead)

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure `mypy --strict` passes
5. Submit a pull request

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🔗 Links

- [Documentation](https://docpipe-mini.readthedocs.io)
- [Repository](https://github.com/docpipe/docpipe-mini)
- [Issues](https://github.com/docpipe/docpipe-mini/issues)

---

**docpipe-mini** - Fast, minimal document serialization for AI applications.