Metadata-Version: 2.4
Name: pdf2tex
Version: 0.2.0
Summary: High-performance RAG-based PDF to LaTeX conversion module
Project-URL: Documentation, https://github.com/pdf2tex/pdf2tex#readme
Project-URL: Repository, https://github.com/pdf2tex/pdf2tex
Project-URL: Issues, https://github.com/pdf2tex/pdf2tex/issues
Author: PDF2TeX Team
License-Expression: MIT
Keywords: document-conversion,latex,llm,ocr,pdf,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Markup :: LaTeX
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23.2.0
Requires-Dist: alembic>=1.13.0
Requires-Dist: asyncpg>=0.29.0
Requires-Dist: fastapi>=0.111.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: marker-pdf>=0.2.0
Requires-Dist: minio>=7.2.0
Requires-Dist: nougat-ocr>=0.1.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: paddleocr>=2.7.0
Requires-Dist: paddlepaddle>=2.6.0
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pydantic-settings>=2.2.0
Requires-Dist: pydantic>=2.7.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: qdrant-client>=1.9.0
Requires-Dist: ray[default]>=2.20.0
Requires-Dist: redis>=5.0.0
Requires-Dist: rich>=13.7.0
Requires-Dist: sentence-transformers>=2.7.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: structlog>=24.1.0
Requires-Dist: tenacity>=8.3.0
Requires-Dist: torch>=2.2.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: typer>=0.12.0
Requires-Dist: uvicorn[standard]>=0.29.0
Provides-Extra: dev
Requires-Dist: ipython>=8.24.0; extra == 'dev'
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pre-commit>=3.7.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# PDF2TeX

High-performance RAG-based PDF to LaTeX conversion module for large documents (2000+ pages).

## Features

- **Intelligent PDF Extraction**: Multi-path content processing with PyMuPDF, Nougat, and PaddleOCR
- **Math-First Approach**: 95%+ accuracy on mathematical content using neural equation recognition
- **RAG-Powered Generation**: Context-aware LaTeX synthesis with Hugging Face LLMs
- **Distributed Processing**: Ray-based parallel processing for high throughput
- **Chapter-Based Output**: One `.tex` file per chapter with master document

## Architecture

```
PDF Input → Extract + OCR → Chunk + Index → RAG + LLM → LaTeX Output
                ↓                ↓              ↓
            Ray Distributed Workers Pool
                        ↓
        Qdrant | Redis | MinIO | Postgres
```

## Quick Start

### Prerequisites

- Python 3.11+
- Docker & Docker Compose
- NVIDIA GPU (recommended)
- Hugging Face API token

### Installation

```bash
# Clone repository
git clone https://github.com/pdf2tex/pdf2tex.git
cd pdf2tex

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

# Start infrastructure
docker-compose up -d

# Run conversion
pdf2tex convert input.pdf --output ./output
```

### Configuration

Create a `.env` file:

```env
# Hugging Face
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxx

# Database
POSTGRES_URL=postgresql+asyncpg://pdf2tex:password@localhost:5432/pdf2tex

# Vector Store
QDRANT_URL=http://localhost:6333

# Redis
REDIS_URL=redis://localhost:6379

# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin

# Ray
RAY_ADDRESS=auto
```

## Usage

### CLI

```bash
# Convert a PDF
pdf2tex convert document.pdf --output ./output

# Resume failed conversion
pdf2tex resume doc_abc123

# Check status
pdf2tex status doc_abc123
```

### API

```bash
# Start API server
uvicorn pdf2tex.api.app:app --host 0.0.0.0 --port 8000

# Submit document
curl -X POST http://localhost:8000/documents \
  -F "file=@textbook.pdf"

# Check status
curl http://localhost:8000/documents/doc_abc123
```

### Python SDK

```python
from pdf2tex import PDF2TeX

converter = PDF2TeX()
result = await converter.convert("textbook.pdf", output_dir="./output")
print(f"Converted {result.total_pages} pages in {result.duration}")
```

## Project Structure

```
pdf2tex/
├── src/pdf2tex/
│   ├── extraction/     # PDF parsing, OCR, math extraction
│   ├── chunking/       # Text splitting, chapter detection
│   ├── rag/            # Embeddings, vector store, retrieval
│   ├── generation/     # LLM integration, LaTeX synthesis
│   ├── pipeline/       # Orchestration, distributed workers
│   └── api/            # FastAPI endpoints
├── tests/
├── docker-compose.yml
└── pyproject.toml
```

## Performance

| Document Size | Processing Time | Workers |
|--------------|-----------------|---------|
| 500 pages    | ~20 min         | 10      |
| 1000 pages   | ~40 min         | 20      |
| 2000 pages   | ~72 min         | 20      |

## License

MIT License - see [LICENSE](LICENSE) for details.
