Metadata-Version: 2.4
Name: pdf2mj
Version: 0.1.1
Summary: Convert PDF documents to Markdown and structured JSON for RAG and LLM pipelines
Project-URL: Homepage, https://github.com/Ronit-Pai/pdf2mj
Project-URL: Documentation, https://github.com/Ronit-Pai/pdf2mj#readme
Project-URL: Repository, https://github.com/Ronit-Pai/pdf2mj
Author: PDF2MJ Contributors
License-Expression: MIT
Keywords: document-conversion,json,llm,markdown,pdf,rag
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Requires-Dist: pandas>=2.0.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf4llm>=0.0.17
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: rich>=13.7.0
Requires-Dist: typer>=0.12.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Provides-Extra: ocr
Requires-Dist: opencv-python-headless>=4.8.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.10; extra == 'ocr'
Description-Content-Type: text/markdown

# PDF2MJ

Convert PDF documents to **Markdown** and **structured JSON** for RAG pipelines, LLM preprocessing, and knowledge bases.

## Installation (For Users)

**PyPI:** `pdf2mj` is not published on PyPI yet. Install from source (see [Development Setup](#development-setup-for-contributors)) or publish the package first.

When available on PyPI:

```bash
pip install pdf2mj
```

With OCR support:

```bash
pip install "pdf2mj[ocr]"
```

### OCR Requirements

OCR is optional and requires:

* Tesseract OCR installed on your system
* OCR extras installed via:

```bash
pip install "pdf2mj[ocr]"
```

### First Run

On the first `pdf2mj` invocation (no arguments), a Rich-powered welcome screen is shown once. State is stored in:

- Linux/macOS: `~/.config/pdf2mj/config.json`
- Windows: `%APPDATA%\pdf2mj\config.json`

```bash
pdf2mj welcome   # show the welcome screen again
pdf2mj doctor    # verify dependencies and environment
```

### Quick Start

Convert a PDF to Markdown and JSON:

```bash
pdf2mj document.pdf
```

Output files are generated next to the source PDF:

```text
document.md
document.json
```

Specify an output directory:

```bash
pdf2mj document.pdf --output ./output
```

### Common Examples

Generate all outputs:

```bash
pdf2mj document.pdf --all --output ./output
```

Extract images:

```bash
pdf2mj document.pdf --extract-images
```

Generate RAG chunks:

```bash
pdf2mj document.pdf --chunk-size 1000
```

Use OCR for scanned PDFs:

```bash
pdf2mj document.pdf --ocr
```

### CLI Options

| Flag                           | Description                            |
| ------------------------------ | -------------------------------------- |
| `--markdown` / `--no-markdown` | Generate Markdown (default: on)        |
| `--json` / `--no-json`         | Generate structured JSON (default: on) |
| `--ocr`                        | OCR scanned pages                      |
| `--extract-images`             | Extract embedded images                |
| `--figures`                    | Alias for `--extract-images`           |
| `--chunk-size N`               | Generate RAG chunks                    |
| `--chunk-overlap N`            | Chunk overlap (default: 200)           |
| `--output`, `-o`               | Output directory                       |
| `--verbose`, `-v`              | Detailed logging                       |
| `--metadata`                   | Export metadata JSON                   |
| `--tables` / `--no-tables`     | Extract tables                         |
| `--all`                        | Enable all supported outputs           |

### Utility Commands

| Command           | Description                                      |
| ----------------- | ------------------------------------------------ |
| `pdf2mj welcome` | Show the onboarding welcome screen              |
| `pdf2mj doctor`  | Check Python, dependencies, OCR, and write access |


# Development Setup (For Contributors)

## Prerequisites

* Python 3.12+
* Git
* Optional: Tesseract OCR

## Clone the Repository

```bash
git clone https://github.com/Ronit-Pai/pdf2mj.git
cd pdf2mj
```

## Create a Development Environment

Using pip:

```bash
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

pip install -e ".[dev]"
```

Using uv:

```bash
uv venv
source .venv/bin/activate

uv pip install -e ".[dev]"
```

With OCR support:

```bash
pip install -e ".[dev,ocr]"
```

## Running Tests

```bash
pytest
```

Coverage:

```bash
pytest --cov=pdf2mj --cov-report=html
```

## Project Structure

```text
src/pdf2mj/
  cli.py
  config.py
  welcome.py
  doctor.py
  converter.py
  models.py
  markdown.py
  json_export.py
  metadata.py
  table_extractor.py
  image_extractor.py
  ocr.py
  chunker.py
  console_util.py

tests/
sample_pdfs/
```

## Local Development

Run directly from source:

```bash
pdf2mj sample.pdf
```

or

```bash
python -m pdf2mj sample.pdf
```
