Metadata-Version: 2.4
Name: kiwi-pdf-chunker
Version: 0.2.1
Summary: A tool for parsing PDF document layouts and chunking content
Author-email: Vahan Martirosyan / Kiwi Data <vahan@kiwidata.com>
Maintainer-email: Vahan Martirosyan <vahan@kiwidata.com>
License: MIT
Project-URL: Homepage, https://github.com/NeolicenseVahan/kiwi-pdf-chunker
Project-URL: Repository, https://github.com/NeolicenseVahan/kiwi-pdf-chunker
Project-URL: Issues, https://github.com/NeolicenseVahan/kiwi-pdf-chunker
Keywords: pdf,ocr,parsing,document,layout,detection,chunking,cv
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf==1.25.3
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: doclayout-yolo==0.0.3
Requires-Dist: pdf2image==1.17.0
Requires-Dist: opencv-python==4.11.0.86
Requires-Dist: huggingface_hub==0.28.1
Requires-Dist: tqdm==4.67.1
Requires-Dist: pillow==11.1.0
Requires-Dist: numpy==1.26.4
Requires-Dist: click==8.1.8
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: azure-ai-documentintelligence==1.0.2
Requires-Dist: azure-core==1.33.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: openai==1.77.0
Requires-Dist: docling==2.28.4
Dynamic: license-file

# PDF Parser

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

## Features

- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- **(Optional)** Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder 

## Installation

### Prerequisites

- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

### Steps

1.  **Install the Package:**
    ```bash
    # pip install kiwi-pdf-chunker
    ```

## User-Provided Data

This package requires the user to provide certain data externally:

1.  **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.
2.  **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.
