Metadata-Version: 2.4
Name: llmaix
Version: 0.0.3
Summary: Add your description here
Project-URL: Homepage, http://github.com/KatherLab/llmaixlib
Project-URL: Documentation, http://github.com/KatherLab/llmaixlib
Project-URL: Issues, http://github.com/KatherLab/llmaixlib/issues
License-Expression: MIT
License-File: LICENSE
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.12
Requires-Dist: click~=8.1.8
Requires-Dist: hatchling>=1.26
Requires-Dist: markitdown[docx,pdf,xls,xlsx]~=0.1.1
Requires-Dist: ocrmypdf~=16.10.1
Requires-Dist: openai~=1.77.0
Requires-Dist: pymupdf4llm~=0.0.17
Requires-Dist: pymupdf==1.25.3
Requires-Dist: pytest>=8.3.5
Provides-Extra: dev
Requires-Dist: pytest>=8.3.5; extra == 'dev'
Provides-Extra: surya
Requires-Dist: surya-ocr~=0.13.1; extra == 'surya'
Description-Content-Type: text/markdown

![Tests](https://github.com/KatherLab/llmaixlib/actions/workflows/tests.yml/badge.svg?branch=main)

# LLMAIx (v2) Library

The llmaix library contains the core functionality of the LLMAIx framework.

>[!CAUTION]
> The interface of the library is still in development and may change in the future. The library is not yet ready for production use.

## Features

- **Preprocessing**: The library provides tools for extracting text from various file formats, including PDF, DOCX, and TXT. It can apply OCR to images and PDFs, using tesseract, surya-ocr and others.

- **Information Extraction**: The library provides a wrapper helping you to get a JSON response from an LLM. All OpenAI-API compatible models are supported!

## Installation

```bash
pip install llmaixlib
```

## Usage

### CLI

```bash
llmaix --help
```

### Python

**Preprocessing a PDF file without OCR:**
```python
from llmaix import preprocess_file

filename = "tests/testfiles/987462_text.pdf"

extracted_text = preprocess_file(filename)
```

**Preprocessing a PDF file with OCR:**
```python
from llmaix import preprocess_file

filename = "tests/testfiles/987462_notext.pdf"

extracted_text = preprocess_file(filename, use_ocr=True)
```

**Extracting information from a text:**

1. Provide a .env file with your OpenAI API key:
```bash
echo "OPENAI_API_KEY=your_openai_api_key" > .env
```
2. To use a custom base url, set the `OPENAI_API_BASE` environment variable:
```bash
echo "OPENAI_API_BASE=https://your_custom_base_url/v1" >> .env
```

3. Use the `extract_info` function to extract information from a text. In this example, a pydantic model is used to define the expected output format. The output will be a JSON object.
```python
from llmaix import extract_info
from pydantic import BaseModel

extracted_text = "The KatherLab is a research group at the University of Technology Dresden, lead by Prof. Jakob N. Kather."

class LabInformation(BaseModel):
    name: str
    location: str
    lead: str

extracted_info = extract_info(
    prompt=f"Extract the name, location and lead of the lab from the following text: {extracted_text}",
    llm_model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    pydantic_model=LabInformation,
)
```

Clone the repository and install the dependencies:
```bash
git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
```

## Tests

Run the tests using the following command:

```bash
uv run pytest
```

Example to just run test for preprocessing with the ocrmypdf backend:
```bash
uv run pytest tests/test_preprocess.py --ocr-backend ocrmypdf
```