Metadata-Version: 2.4
Name: llm-data-converter
Version: 0.4.0
Summary: Convert any document, text, or URL into LLM-ready data format
Project-URL: Homepage, https://github.com/nanonets/llm-data-converter
Project-URL: Repository, https://github.com/nanonets/llm-data-converter
Project-URL: Documentation, https://github.com/nanonets/llm-data-converter#readme
Project-URL: Issues, https://github.com/nanonets/llm-data-converter/issues
Author-email: Nanonets <team@nanonets.com>
License: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: markdownify>=0.11.6
Requires-Dist: opencv-python>=4.8.0
Requires-Dist: openpyxl>=3.0.10
Requires-Dist: paddleocr>=2.7.0
Requires-Dist: paddlepaddle>=2.5.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pillow>=9.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pypandoc>=1.15.0
Requires-Dist: pypdf2>=3.0.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: python-docx>=0.8.11
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: requests>=2.28.0
Provides-Extra: dev
Requires-Dist: black>=22.0.0; extra == 'dev'
Requires-Dist: flake8>=5.0.0; extra == 'dev'
Requires-Dist: isort>=5.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# LLM Data Converter

Convert any document, text, or URL into LLM-ready data format.

## Installation

```bash
pip install llm-data-converter
```

**Requirements:**
- Python 3.8 or higher

### System Dependencies for OCR

For OCR functionality to work properly, you may need to install additional system dependencies:

**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install -y libgl1-mesa-glx libglib2.0-0
```

**macOS:**
```bash
# Usually not needed, but if you encounter OpenGL issues:
brew install mesa
```

**Note:** The package will automatically detect if OpenGL is available and provide helpful warnings if system dependencies are missing.

## Quick Start

```python
from llm_converter import FileConverter

# Basic conversion
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```

## Features

- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text
- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries
- **Local Processing**: Process documents locally without external dependencies
- **Layout Preservation**: Maintain document structure and formatting

## Usage Examples

### Convert PDF to Markdown

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```

### Convert URL to HTML

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("https://example.com").to_html()
print(result)
```

### Convert Excel to JSON

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("data.xlsx").to_json()
print(result)
```

### Chain with LLM

```python
from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)
```

## Supported Formats

### Input Formats
- **Documents**: PDF, DOCX, TXT
- **Web**: URLs, HTML files
- **Data**: Excel (XLSX, XLS), CSV
- **Images**: PNG, JPG, JPEG (with OCR capabilities)

### Output Formats
- **Markdown**: Clean, structured markdown
- **HTML**: Formatted HTML with styling
- **JSON**: Structured JSON data
- **Plain Text**: Simple text extraction

## Advanced Usage

### Custom Configuration

```python
from llm_converter import FileConverter

converter = FileConverter()

result = converter.convert("document.pdf").to_markdown()

print(result)
```

### Batch Processing

```python
from llm_converter import FileConverter

converter = FileConverter()
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]

results = []
for file in files:
    result = converter.convert(file).to_markdown()
    results.append(result)
```

## API Reference

### FileConverter

Main class for converting documents to LLM-ready formats.

#### Methods

- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format
- `convert_url(url: str) -> ConversionResult`: Convert a URL to internal format
- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format

### ConversionResult

Result object with methods to export to different formats.

#### Methods

- `to_markdown() -> str`: Export as markdown
- `to_html() -> str`: Export as HTML
- `to_json() -> dict`: Export as JSON
- `to_text() -> str`: Export as plain text

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Third-Party Dependencies

This project uses several third-party libraries:

- **PaddleOCR** - Apache 2.0 License (https://github.com/PaddlePaddle/PaddleOCR)
- **PyMuPDF** - GNU Affero General Public License v3.0
- **python-docx** - MIT License
- **pandas** - BSD 3-Clause License
- **Pillow** - HPND License

All dependencies are used in accordance with their respective licenses. 