Metadata-Version: 2.4
Name: pdf2mcq
Version: 1.2.1
Summary: Convert PDF files (text, scanned, mixed) into MCQ questions using AI
License-Expression: MIT
Project-URL: Homepage, https://github.com/manjur-ai/pdf2mcq
Project-URL: Issues, https://github.com/manjur-ai/pdf2mcq/issues
Keywords: mcq,quiz,ai,education,pdf,ocr,llm,openrouter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education
Classifier: Topic :: Text Processing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.25
Requires-Dist: openai>=1.30
Requires-Dist: pymupdf>=1.24
Requires-Dist: Pillow>=10.0
Requires-Dist: pytesseract>=0.3
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# pdf2mcq

Convert PDF files — text PDFs, scanned books, mixed documents — into high-quality MCQ questions using AI.

Built on top of **html2mcq**'s PDF pipeline, extracted as a standalone library focused purely on PDF-to-MCQ generation.

---

## Features

- **Smart PDF detection** — automatically detects text PDFs, scanned PDFs, and mixed documents
- **Text PDFs** — fast extraction via PyMuPDF with chunking at sentence boundaries
- **Scanned PDFs** — renders pages as images → vision API OCR (or pytesseract fallback)
- **Mixed PDFs** — text pages via PyMuPDF + scanned pages via OCR, combined intelligently
- **Multiple AI providers:** OpenRouter, Anthropic, OpenAI, Ollama
- **Auto model failover** for MCQ generation
- **CLI & Python API**

---

## Quick Start

### CLI

```bash
# Single PDF
pdf2mcq --pdf-path textbook.pdf -n 10

# Multiple PDF URLs
pdf2mcq --pdf-url https://example.com/chapter1.pdf --pdf-url https://example.com/chapter2.pdf

# Scan a folder of PDFs
pdf2mcq --pdf-folder ./textbooks/

# Output as JSON
pdf2mcq --pdf-path notes.pdf -o questions.json --format json
```

### Python API

```python
from pdf2mcq import PDFMCQGenerator

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    provider="openrouter",
    mcq_model="google/gemini-2.5-flash-lite",
)

# From local PDF
mcq = gen.from_pdf_paths("textbook.pdf", n=5)
print(mcq.to_pretty_str())

# From URL
mcq = gen.from_pdf_urls("https://example.com/notes.pdf", n=3)
print(mcq.to_json())

# Multiple PDFs
mcq = gen.from_pdf_paths(["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"])
```

### Custom Instructions

```python
mcq = gen.from_pdf_paths(
    "lecture-notes.pdf",
    n=10,
    difficulty_mix="50% easy, 50% hard",
    focus_topics=["machine learning", "neural networks"],
    custom_instructions="Focus on mathematical derivations",
)
```

### Auto Model Selection

```python
gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    mcq_model="auto",
    mcq_model_list=[
        "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free",
        "google/gemma-4-31b-it:free",
    ],
)
```

### Environment Variables

| Variable | Purpose |
|---|---|
| `OPENROUTER_API_KEY` | Default API key for OpenRouter |
| `ANTHROPIC_API_KEY` | API key for Anthropic |
| `OPENAI_API_KEY` | API key for OpenAI |
| `PDF2MCQ_MCQ_MODELS` | Comma-separated MCQ model priority list for `mcq_model="auto"` |
| `PDF2MCQ_OCR_MODELS` | Comma-separated OCR model priority list for scanned PDFs |

---

## Output Format

```python
# Pretty-print
print(mcq.to_pretty_str())

# JSON
print(mcq.to_json())
# {
#   "total_exam_time": 20,
#   "questions": [
#     {
#       "question_html": "What is gradient descent?",
#       "options": ["...", "...", "...", "..."],
#       "answers": [0],
#       "multi": false,
#       "marks": 1.0,
#       "negative_marks": 0.25,
#       "difficulty": "easy",
#       "explaination": "..."
#     }
#   ]
# }
```

---

## Installation

```bash
pip install pdf2mcq
```

Requires **PyMuPDF** (fitz) — installed automatically as a dependency.

For scanned PDF OCR, also install [Tesseract](https://github.com/tesseract-ocr/tesseract).

---

## License

MIT
