Metadata-Version: 2.4
Name: xgen-doc2chunk
Version: 0.2.21
Summary: Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking
Project-URL: Homepage, https://github.com/master0419/doc2chunk
Project-URL: Documentation, https://github.com/master0419/doc2chunk#readme
Project-URL: Repository, https://github.com/master0419/doc2chunk.git
Project-URL: Issues, https://github.com/master0419/doc2chunk/issues
Project-URL: Changelog, https://github.com/master0419/doc2chunk/releases
Author-email: master0419 <7slwm7@khu.ac.kr>
Maintainer-email: master0419 <7slwm7@khu.ac.kr>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,chunking,document-processing,docx,hwp,langchain,llm,ocr,pdf,text-extraction,xlsx
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.12
Requires-Dist: beautifulsoup4==4.14.3
Requires-Dist: cachetools==6.2.4
Requires-Dist: chardet==5.2.0
Requires-Dist: docx2pdf==0.1.8
Requires-Dist: langchain-anthropic==1.3.1
Requires-Dist: langchain-aws==1.2.0
Requires-Dist: langchain-community==0.4.1
Requires-Dist: langchain-core==1.2.6
Requires-Dist: langchain-google-genai==4.1.3
Requires-Dist: langchain-openai==1.1.7
Requires-Dist: langchain-text-splitters==1.1.0
Requires-Dist: langchain==1.2.3
Requires-Dist: langgraph==1.0.5
Requires-Dist: langsmith==0.6.2
Requires-Dist: olefile==0.47
Requires-Dist: openpyxl==3.1.5
Requires-Dist: orjson==3.11.5
Requires-Dist: pandas==2.3.3
Requires-Dist: pdf2image==1.17.0
Requires-Dist: pdfminer-six==20231228
Requires-Dist: pdfplumber==0.11.5
Requires-Dist: pi-heif==1.1.1
Requires-Dist: psutil==7.0.0
Requires-Dist: pydantic-core==2.41.5
Requires-Dist: pydantic-settings==2.12.0
Requires-Dist: pydantic==2.12.5
Requires-Dist: pyhwp==0.1b15
Requires-Dist: pymupdf==1.26.5
Requires-Dist: pytesseract==0.3.13
Requires-Dist: python-docx==1.2.0
Requires-Dist: python-dotenv==1.2.1
Requires-Dist: python-multipart==0.0.20
Requires-Dist: python-pptx==1.0.2
Requires-Dist: striprtf==0.0.29
Requires-Dist: xlrd==2.0.2
Description-Content-Type: text/markdown

# xgen-doc2chunk

**xgen-doc2chunk** is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.

## Features

- **Multi-format Support**: Process a wide variety of document formats including:
  - PDF (with table detection, OCR fallback, and complex layout handling)
  - Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
  - Korean documents: HWP, HWPX (Hangul Word Processor)
  - Text formats: TXT, MD, RTF, CSV, HTML
  - Code files: Python, JavaScript, TypeScript, and 20+ languages

- **Intelligent Text Extraction**: 
  - Preserves document structure (headings, paragraphs, lists)
  - Extracts tables as HTML with proper `rowspan`/`colspan` handling
  - Handles merged cells and complex table layouts
  - Extracts and processes inline images

- **OCR Integration**:
  - Pluggable OCR engine architecture
  - Supports OpenAI, Anthropic, Google Gemini, and vLLM backends
  - Automatic OCR fallback for scanned documents or image-based PDFs

- **Smart Chunking**:
  - Semantic text chunking with configurable size and overlap
  - Table-aware chunking that preserves table integrity
  - Protected regions for code blocks and special content

- **Metadata Extraction**:
  - Extracts document metadata (title, author, creation date, etc.)
  - Formats metadata in a structured, parseable format

## Installation

```bash
pip install xgen-doc2chunk
```

Or using uv:

```bash
uv add xgen-doc2chunk
```

## Quick Start

### Basic Usage

```python
from xgen_doc2chunk import DocumentProcessor

# Create processor instance
processor = DocumentProcessor()

# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)

# Extract text and chunk in one step
result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

# Access chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i + 1}: {chunk[:100]}...")

# Save chunks to markdown file
result.save_to_md("output/chunks.md")
```

### With OCR Processing

```python
from xgen_doc2chunk import DocumentProcessor
from xgen_doc2chunk.ocr.ocr_engine.openai_ocr import OpenAIOCREngine

# Initialize OCR engine
ocr_engine = OpenAIOCREngine(api_key="sk-...", model="gpt-4o")

# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)

# Extract text with OCR processing enabled
text = processor.extract_text(
    "scanned_document.pdf",
    ocr_processing=True
)
```

## Supported Formats

| Category | Extensions |
|----------|------------|
| Documents | `.pdf`, `.docx`, `.doc`, `.pptx`, `.ppt`, `.hwp`, `.hwpx` |
| Spreadsheets | `.xlsx`, `.xls`, `.csv`, `.tsv` |
| Text | `.txt`, `.md`, `.rtf` |
| Web | `.html`, `.htm`, `.xml` |
| Code | `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, and more |
| Config | `.json`, `.yaml`, `.yml`, `.toml`, `.ini`, `.env` |

## Architecture

```
libs/
├── core/
│   ├── document_processor.py    # Main entry point
│   ├── processor/               # Format-specific handlers
│   │   ├── pdf_handler.py       # PDF processing with V4 engine
│   │   ├── docx_handler.py      # DOCX processing
│   │   ├── ppt_handler.py       # PowerPoint processing
│   │   ├── excel_handler.py     # Excel processing
│   │   ├── hwp_processor.py     # HWP 5.0 OLE processing
│   │   ├── hwpx_processor.py    # HWPX (ZIP/XML) processing
│   │   └── ...
│   └── functions/
│       └── img_processor.py     # Image handling utilities
├── chunking/
│   ├── chunking.py              # Main chunking interface
│   ├── text_chunker.py          # Text-based chunking
│   ├── table_chunker.py         # Table-aware chunking
│   └── page_chunker.py          # Page-based chunking
└── ocr/
    ├── base.py                  # OCR base class
    ├── ocr_processor.py         # OCR processing utilities
    └── ocr_engine/              # OCR engine implementations
        ├── openai_ocr.py
        ├── anthropic_ocr.py
        ├── gemini_ocr.py
        └── vllm_ocr.py
```

## Requirements

- Python 3.12+
- Required dependencies are automatically installed (see `pyproject.toml`)

### System Dependencies

For full functionality, you may need:

- **Tesseract OCR**: For local OCR fallback
- **LibreOffice**: For DOC/RTF conversion (optional)
- **Poppler**: For PDF image extraction

## Configuration

```python
# Custom configuration
config = {
    "pdf": {
        "extract_images": True,
        "ocr_fallback": True,
    },
    "chunking": {
        "default_size": 1000,
        "default_overlap": 200,
    }
}

processor = DocumentProcessor(config=config)
```

## License

Apache License 2.0 - see [LICENSE](LICENSE) for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
