Metadata-Version: 2.4
Name: FragenAntwortLLMGPU
Version: 0.1.15
Summary: A package for processing documents and generating questions and answers using LLMs on GPU and CPU.
Author: Mehrdad Almasi, Demival Vasques, and Lars Wieneke
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyMuPDF
Requires-Dist: tokenizers
Requires-Dist: semantic-text-splitter==0.13.3
Requires-Dist: langchain
Requires-Dist: langchain_community
Requires-Dist: torchvision
Requires-Dist: torchaudio
Requires-Dist: ctransformers
Requires-Dist: transformers>=4.37.0
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# FragenAntwortLLMGPU

**FragenAntwortLLMGPU** is a Python package for processing PDF documents and generating efficient **Question & Answer (Q&A) pairs** using LLM backends on **CPU or GPU**. The generated Q&A pairs can be used for **LLM fine-tuning**.
[![Downloads](https://static.pepy.tech/badge/FragenAntwortLLMGPU)](https://pepy.tech/project/FragenAntwortLLMGPU)

It supports:
- **Mistral (GGUF via CTransformers)** — local GGUF models
- **Qwen (Hugging Face Transformers)** — HF models like Qwen2.5 Instruct

---

## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Model selection](#model-selection)
- [Features](#features)
- [Contributing](#contributing)
- [License](#license)
- [Authors](#authors)

---

## Installation

### Install the package
```bash
pip install FragenAntwortLLMGPU
```

### Notes on PyTorch / GPU
This project uses **PyTorch**. Install a PyTorch build that matches your system (CPU or your CUDA version).  
If you already have PyTorch installed, you can skip this.

PyTorch install guide:
- https://pytorch.org/get-started/locally/

### Optional backend dependencies

#### Qwen (Transformers backend)
```bash
pip install transformers accelerate
```

#### Mistral GGUF (CTransformers backend)
```bash
pip install ctransformers
```

---

## Usage

### Python example
```python
from FragenAntwortLLMGPU import DocumentProcessor

processor = DocumentProcessor(
    book_path="/path/to/your/book/",      # directory containing the PDF
    temp_folder="/path/to/temp/folder",
    output_file="/path/to/output/QA.jsonl",
    book_name="example.pdf",
    start_page=9,
    end_page=77,
    gpu_layers=100,
    number_Q_A="five",  # written number: "one", "two", ...
    target_information="foods and locations",
    max_new_tokens=1000,
    temperature=0.1,
    context_length=2100,
    max_tokens_chunk=800,
    arbitrary_prompt="",
    model="mistral",  # default
)

processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()
```

---

## Model selection

### Default (Mistral GGUF via CTransformers)
The default backend is GGUF via CTransformers.

Example model source (GGUF):
- https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF

Use:
```python
processor = DocumentProcessor(..., model="mistral")
```

### Qwen (Hugging Face Transformers)
Use:
```python
processor = DocumentProcessor(
    ...,
    model="qwen",
    # optionally override the HF model id
    hf_model_id="Qwen/Qwen2.5-7B-Instruct",
)
```

---

## Features
- Extracts text from PDF documents
- Splits text into manageable chunks for processing
- Generates efficient Q&A pairs based on specific target information
- Supports custom prompts for question generation
- Runs on CPU and GPU (depending on backend and installation)
- Multilingual input: accepts PDF books in French, German, or English and generates Q&A pairs in English

---

## Contributing
Contributions are welcome! Please fork the repository and submit pull requests.

---

## License
This project is licensed under the **MIT License**. See the `LICENSE` file for details.

---

## Authors
Mehrdad Almasi, Lars Wieneke, and Demival Vasques Filho
