Metadata-Version: 2.4
Name: llm-data-converter
Version: 2.1.1
Summary: Convert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models
Project-URL: Homepage, https://github.com/nanonets/llm-data-converter
Project-URL: Repository, https://github.com/nanonets/llm-data-converter
Project-URL: Documentation, https://github.com/nanonets/llm-data-converter#readme
Project-URL: Issues, https://github.com/nanonets/llm-data-converter/issues
Author-email: Nanonets <team@nanonets.com>
License: MIT
License-File: LICENSE
Keywords: document-conversion,document-processing,document-understanding,image-processing,intelligent-document-processing,llm,markdown,pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.8
Requires-Dist: beautifulsoup4>=4.9.0
Requires-Dist: docling-ibm-models>=0.1.0
Requires-Dist: easyocr>=1.7.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: lxml>=4.6.0
Requires-Dist: markdownify>=0.11.6
Requires-Dist: numpy<2.0.0,>=1.21.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: pillow>=9.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pypandoc>=1.11
Requires-Dist: python-docx>=0.8.11
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: requests>=2.25.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: dev
Requires-Dist: black>=21.0.0; extra == 'dev'
Requires-Dist: flake8>=3.8.0; extra == 'dev'
Requires-Dist: mypy>=0.800; extra == 'dev'
Requires-Dist: pytest-cov>=2.10.0; extra == 'dev'
Requires-Dist: pytest>=6.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# LLM Data Converter

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

## Installation

```bash
pip install llm-data-converter
```

**Requirements:**
- Python 3.8 or higher

### System Dependencies for Intelligent Document Processing

For this library to work properly, you may need to install additional system dependencies:

**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools
```

**macOS:**
```bash
# Usually not needed, but if you encounter OpenGL issues:
brew install mesa
```

**Note:** The package will automatically download and cache intelligent models on first use.

## Quick Start

```python
from llm_converter import FileConverter

# Basic conversion 
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```

## Features

- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text
- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries
- **Local Processing**: Process documents locally without external dependencies
- **Layout Preservation**: Maintain document structure and formatting
- **Intelligent Document Processing**: Advanced document understanding and conversion powered by pre-trained models:
  - **Layout Detection**: Intelligent models for document structure understanding
  - **Text Recognition**: High-accuracy text extraction with confidence scoring
  - **Table Structure**: Intelligent table detection and conversion to markdown format
  - **Automatic Model Download**: Models are automatically downloaded and cached


## Usage Examples

### Convert PDF to Markdown

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```


### Convert Image to HTML

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)
```

### Chain with LLM

```python
from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)
```

## Supported Formats

### Input Formats
- **Documents**: PDF, DOCX, TXT
- **Web**: URLs, HTML files
- **Data**: Excel (XLSX, XLS), CSV
- **Images**: PNG, JPG, JPEG 

### Output Formats
- **Markdown**: Clean, structured markdown with proper table formatting
- **HTML**: Formatted HTML with styling
- **JSON**: Structured JSON data
- **Plain Text**: Simple text extraction


## CLI usage

The `llm-converter` command-line tool provides easy access to all conversion features:

### Basic Usage

```bash
# Convert a PDF to markdown (default)
llm-converter document.pdf

# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text


```

### Advanced Options

```bash
# Save output to file
llm-converter document.pdf --output-file output.md

# For image input
llm-converter image.png 

# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown
```

### List Supported Formats

```bash
# See all supported input formats
llm-converter --list-formats
```

### Examples

```bash
# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown

# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html

# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json

# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md
```

### Output Formats

- **markdown** (default): Clean, structured markdown
- **html**: Formatted HTML with styling
- **json**: Structured JSON data
- **text**: Plain text extraction


## API Reference for library

### FileConverter

Main class for converting documents to LLM-ready formats.

#### Methods

- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format
- `convert_url(url: str) -> ConversionResult`: Convert a URL page contents to internal format
- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format

### ConversionResult

Result object with methods to export to different formats.

#### Methods

- `to_markdown() -> str`: Export as markdown
- `to_html() -> str`: Export as HTML
- `to_json() -> dict`: Export as JSON
- `to_text() -> str`: Export as plain text


## License

MIT License - see LICENSE file for details. 