Metadata-Version: 2.4
Name: documint2md
Version: 2.0.0
Summary: Convert PDF, DOCX, CSV, and image files to Markdown.
Author: documint2md Contributors
License-Expression: MIT
Keywords: markdown,pdf,docx,csv,cli
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: Microsoft :: Windows
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4==4.14.3
Requires-Dist: lxml==6.1.0
Requires-Dist: mammoth==1.11.0
Requires-Dist: markdownify==1.2.2
Requires-Dist: pandas==2.3.3
Requires-Dist: pdfminer.six==20251230
Requires-Dist: prompt-toolkit==3.0.48
Requires-Dist: tabulate==0.9.0
Provides-Extra: dev
Requires-Dist: pytest==9.0.3; extra == "dev"
Requires-Dist: pip-tools==7.5.3; extra == "dev"
Requires-Dist: build==1.4.0; extra == "dev"
Requires-Dist: backports.tarfile==1.2.0; python_version < "3.12" and extra == "dev"
Requires-Dist: colorama==0.4.6; sys_platform == "win32" and extra == "dev"
Requires-Dist: importlib-metadata==9.0.0; python_version < "3.12" and extra == "dev"
Requires-Dist: pywin32-ctypes==0.2.3; sys_platform == "win32" and extra == "dev"
Requires-Dist: setuptools==81.0.0; extra == "dev"
Requires-Dist: twine==6.2.0; extra == "dev"
Requires-Dist: zipp==3.23.0; python_version < "3.12" and extra == "dev"
Provides-Extra: pdftext
Requires-Dist: pdftext==0.6.3; extra == "pdftext"
Provides-Extra: marker
Requires-Dist: marker-pdf==1.10.1; extra == "marker"
Provides-Extra: pymupdf4llm
Requires-Dist: pymupdf4llm==1.27.2.3; extra == "pymupdf4llm"
Provides-Extra: markdown
Requires-Dist: mdformat==1.0.0; extra == "markdown"
Requires-Dist: mdformat-gfm==1.0.0; extra == "markdown"
Provides-Extra: universal-lite
Requires-Dist: markitdown==0.1.5; extra == "universal-lite"
Provides-Extra: docling
Requires-Dist: docling==2.92.0; extra == "docling"
Provides-Extra: universal
Requires-Dist: docling==2.92.0; extra == "universal"
Requires-Dist: markitdown==0.1.5; extra == "universal"
Provides-Extra: pypdfium2
Requires-Dist: pypdfium2==4.30.0; extra == "pypdfium2"
Provides-Extra: ocr
Requires-Dist: paddleocr==3.4.0; extra == "ocr"
Requires-Dist: paddlepaddle==3.2.2; extra == "ocr"
Requires-Dist: pypdfium2==4.30.0; extra == "ocr"
Requires-Dist: pi-heif==0.15.0; extra == "ocr"
Provides-Extra: all
Requires-Dist: pdftext==0.6.3; extra == "all"
Requires-Dist: pymupdf4llm==1.27.2.3; extra == "all"
Requires-Dist: mdformat==1.0.0; extra == "all"
Requires-Dist: mdformat-gfm==1.0.0; extra == "all"
Requires-Dist: docling==2.92.0; extra == "all"
Requires-Dist: markitdown==0.1.5; extra == "all"
Requires-Dist: pypdfium2==4.30.0; extra == "all"
Requires-Dist: paddleocr==3.4.0; extra == "all"
Requires-Dist: paddlepaddle==3.2.2; extra == "all"
Requires-Dist: pi-heif==0.15.0; extra == "all"
Requires-Dist: setuptools==81.0.0; extra == "all"
Dynamic: license-file

# documint2md - Convert PDF, DOCX, CSV, and Images to Markdown
 
documint2md is a small Python CLI and library (package `doc2md`) that turns PDF, DOCX, CSV, and image files into consistent, deterministic Markdown. It is built for documentation flows where the same source should always produce the same Markdown output, even when run on different machines or in CI.

## Highlights

- Text-first conversions for PDF (`pdfminer.six`), DOCX (Mammoth → BeautifulSoup → `markdownify`), and CSV (Pandas + Markdown table) controls the format you care about.
- OCR support for images and scanned PDFs (opt-in for PDFs), including HEIC/HEIF via an optional decoder.
- Small CLI plus a library API that can drop right into scripts, CI, or exploratory sessions.
- Deterministic normalization (newline, whitespace, blank lines) and CLI contracts that keep automation predictable.
- Interactive terminal UI with a short `/` command list plus `/more` for advanced tools and OCR/session controls.

## Quick start

### WSL quick start

Use the native WSL checkout for development:

```text
/home/marco/dev/documint2md
```

From PowerShell:

```powershell
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc "~/.local/bin/uv venv .venv --python 3.12 --seed"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc ".venv/bin/python -m pip install --require-hashes -r requirements-dev.txt"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc ".venv/bin/python -m pip install -e '.[markdown,ocr,universal-lite,pymupdf4llm]'"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc "./scripts/verify_wsl_workflows.sh"
```

See [docs/WSL_DEVELOPMENT.md](docs/WSL_DEVELOPMENT.md) for clean-environment checks, live OCR testing, and optional engine notes. Avoid using `/mnt/c/dev/documint2md` for active WSL development.

### Windows quick start

1. Create a virtualenv, install reproducible dependencies, and activate it (Python 3.11+):
   ```powershell
   Set-Location 'C:\path\to\documint2md'
   py -m venv .venv
   & .\.venv\Scripts\Activate.ps1
   python -m pip install --upgrade pip
   python -m pip install --require-hashes -r requirements.txt
   ```
2. Convert a few sample files so “it works”:
   ```powershell
   doc2md .\tests\fixtures\in\sample.docx
   python -m doc2md.cli .\tests\fixtures\in\sample.pdf
   python -m doc2md.cli .\tests\fixtures\in\sample.csv
   python -m doc2md.cli .\tests\fixtures\in\sample.png
   python -m doc2md.cli .\docs_in\iphone_scan.heic
   ```
3. Drop into interactive mode (no inputs) to explore `/files`, `/format`, and `/output`.

### Reproducible installs (Windows)

- Core runtime:
  ```powershell
  python -m pip install --require-hashes -r requirements.txt
  ```
- Full feature set (PDF engines + OCR):
  ```powershell
  python -m pip install --require-hashes -r requirements-all.txt
  ```
- Dev/test dependencies:
  ```powershell
  python -m pip install --require-hashes -r requirements-dev.txt
  ```
- Regenerate lock files when dependencies change:
  ```powershell
  .\scripts\lock_requirements.ps1
  ```

## Installation

### From TestPyPI (for testing)

```powershell
py -m pip install --upgrade pip
py -m pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --pre documint2md
doc2md --help
```

### From PyPI (production)

```powershell
py -m pip install --upgrade pip
py -m pip install documint2md
doc2md --help
```

Optional extras when installing from PyPI:

```powershell
py -m pip install "documint2md[all]"
py -m pip install "documint2md[markdown]"
py -m pip install "documint2md[pymupdf4llm]"
py -m pip install "documint2md[universal-lite]"
py -m pip install "documint2md[docling]"
py -m pip install "documint2md[universal]"
```

`universal-lite` installs MarkItDown only. `docling` installs the heavier structured conversion stack. `universal` remains a backward-compatible alias that installs both.

## CLI usage

Run `doc2md <file>` (or `python -m doc2md.cli <file>`) to convert a single input. By default the Markdown lands in `docs_out/<input filename>.md`. Use `-o <file>` to force a path and `-o -` to stream to stdout. Omit inputs to open the interactive picker, or pass `--interactive` for the picker even inside scripts.

```
python -m doc2md.cli file.docx -o file.md
python -m doc2md.cli file.pdf
python -m doc2md.cli table.csv
python -m doc2md.cli scan.png
doc2md  # interactive mode
```

### CLI contract

- Default output is `docs_out/<input filename>.md`; `-o <file>` overrides the destination, `-o -` writes to stdout.
- Interactive mode (no input) opens a curses-like UI tied to `docs_in`; `/files` loads the list and `/more` exposes advanced commands (history, profiles, UI, session toggles).
- Errors and diagnostics stream to stderr.
- Exit codes: `2` usage/argument error, `3` unsupported format, `4` conversion failure, `5` output write failure.

### CLI options

- `--format pdf|docx|csv|image|any` forces the parser instead of inferring from the extension. Use `any` with a universal engine for formats such as PPTX, XLSX, HTML, JSON, XML, and EPUB.
- `--engine pdfminer|pdftext|marker|pymupdf4llm|docling|markitdown` selects the conversion engine. `pdfminer` remains the default; `docling` and `markitdown` are universal opt-in engines.
- `--md-style normalize|gfm|none` controls Markdown post-processing (default `normalize`). `gfm` requires the optional `markdown` extra.
- `--ocr` or `--ocr-mode auto` enables OCR fallback for PDFs when text extraction is empty.
- `--ocr-mode never|auto|always` controls OCR behavior for PDFs (default `never`).
- `--ocr-lang es` sets OCR language (default `es`).
- `--ocr-device cpu|gpu:0` overrides OCR device selection.
- `--ocr-render-scale 2.0` controls PDF render scale for OCR.
- `--ocr-min-score 0.5` filters low-confidence OCR text.
- `--ocr-layout plain|blocks|heuristic` controls OCR layout reconstruction. `plain` preserves OCR lines, `blocks` merges lines into paragraph blocks, and `heuristic` also promotes high-confidence headings and simple grid tables.
- `--ocr-debug-json <path>` writes OCR geometry, row/block roles, confidence, and rendered Markdown to JSON for diagnosis.
- `--ocr-debug-image <path>` writes a bbox overlay image for image inputs.
- `--csv-na ""` controls how empty values render.
- `--csv-float-format "%.6g"` stabilizes floating-point output when needed.
- `--profile <name>` loads defaults from `doc2md.toml`
- `--stats`, `--profile-report`, `--quiet`, `--debug`, `--version`, `--theme`, `--interactive`, `--no-input` toggle output, logging, and interactivity.

### OCR setup (optional)

Recommended (CPU + GPU side-by-side):

```powershell
.\scripts\setup_ocr_envs.ps1
```

See `docs/OCR Dual Environment Setup.md` for GPU verification, fallback index, and usage.

Quick run (GPU):

```powershell
.\scripts\doc2md-gpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device gpu:0 --yes -o docs_out\sample_text.gpu.md
```

Quick run (CPU):

```powershell
.\scripts\doc2md-cpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device cpu --yes -o docs_out\sample_text.cpu.md
```

Project skill for folder-based image OCR:
- [image-folder-ocr-to-markdown](/mnt/c/dev/documint2md/.opencode/skills/image-folder-ocr-to-markdown/SKILL.md) defines the exact repo workflow for converting `JPG` and `HEIC` images in one folder into `.md` files in a chosen output folder.

CPU:
```powershell
python -m pip install paddlepaddle==3.2.2
python -m pip install paddleocr==3.4.0
```

GPU (Windows; choose one CUDA index):
```powershell
python -m pip install paddlepaddle-gpu==3.2.2 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr==3.4.0
```

If model download issues:
```powershell
$env:PADDLE_PDX_MODEL_SOURCE = "BOS"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"
```

Performance tips:
- Batch multiple files in one command to reuse OCR initialization.
- For scanned PDFs, use `--ocr-render-scale 1.0` to trade accuracy for speed.
- Prefer `--ocr-mode auto` for PDFs so OCR runs only on textless pages.
- First OCR run is slow due to model downloads; subsequent runs are faster.

Layout tips:
- Use `--ocr-layout plain` when exact OCR line order matters.
- Use `--ocr-layout blocks` for readable prose without heading/table inference.
- Use `--ocr-layout heuristic --md-style gfm` for best-effort formatted Markdown.
- Use `--ocr-debug-json` and `--ocr-debug-image` when headings, tables, or reading order need inspection.
- Keep OCR layout changes covered by `tests/fixtures/ocr_layout_benchmarks.json`; it provides deterministic receipt, table, prose, and low-confidence-noise cases without requiring live OCR model output.

## Interactive mode

When you run `doc2md` without inputs, the CLI opens a full-screen picker. Interact with `/files` (space to select, enter to convert), type `/` to see the short command list, and use `/more` for advanced tools (history, profiles, UI theme, session toggles). OCR is configured via `/ocr` subcommands (e.g. `/ocr mode auto`, `/ocr lang es`). The footer keeps the current format/engine/output in view while the header shows version + cwd. Use `Ctrl+P/Ctrl+N` for command history.

## Library API

- `doc2md.pdf_to_markdown(path)` – extracts text-only Markdown from PDFs (OCR optional via `ocr_mode`).
- `doc2md.docx_to_markdown(path)` – converts DOCX → Mammoth HTML → Markdown via `markdownify` with deterministic heading/list settings.
- `doc2md.csv_to_markdown(path)` – parses CSV files with `pandas` and emits clean Markdown tables.
- `doc2md.image_to_markdown(path)` – runs OCR on image files and returns Markdown text.
- `doc2md.any_to_markdown(path, engine)` – uses an optional universal engine (`docling` or `markitdown`) for additional formats.
- Input types: `str | PathLike`; return type: `str`.
- Exceptions: `ConversionError` for failures, `UnsupportedFormatError` for unsupported formats/engines.

## Normalization rules

- Normalize newlines to `\n`.
- Strip trailing whitespace per line.
- Cap consecutive blank lines at two.
- Remove trailing blank lines and end every non-empty output with a single newline.

## Markdown formatting

Every CLI conversion has a post-conversion Markdown style step:

- `normalize` keeps the existing deterministic newline and trailing-whitespace contract.
- `gfm` runs `mdformat` with the GFM plugin for consistent tables, task lists, strikethrough, and autolinks.
- `none` leaves converter output untouched after the converter itself returns.

Version `2.0.0` keeps `normalize` as the default. This can change exact Markdown bytes compared with `1.x`; use `--md-style none` for raw converter output in compatibility-sensitive workflows.

Examples:

```powershell
python -m doc2md.cli .\tests\fixtures\in\sample.csv --preview --md-style normalize
python -m doc2md.cli .\tests\fixtures\in\sample.csv --preview --md-style gfm
python -m doc2md.cli .\docs_in\slides.pptx --format any --engine markitdown --md-style gfm
```

## Testing & fixtures

```powershell
python -m pip install --require-hashes -r requirements-dev.txt
python -m pytest
python -m compileall .
python -m doc2md.cli .\tests\fixtures\in\sample.docx -o .\docs_out\sample.docx.md
python -m doc2md.cli .\tests\fixtures\in\sample.pdf -o .\docs_out\sample.pdf.md
python -m doc2md.cli .\tests\fixtures\in\sample.csv -o .\docs_out\sample.csv.md
```

Live OCR tests are opt-in because they can download models and vary by runtime:

```powershell
$env:DOC2MD_RUN_LIVE_OCR = "1"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"
python -m pytest tests\test_ocr_integration.py
```

Edge-case fixtures live in `tests/fixtures/in` with golden Markdown in `tests/fixtures/out`. Use `docs_in` as your local drop zone.

## Publishing

Releases are tag-driven via GitHub Actions + Trusted Publishing.

- TestPyPI: push a tag like `v1.0.1rc1` to trigger `release-testpypi.yml`.
- PyPI: push a tag like `v1.0.1` to trigger `release-pypi.yml`.

## Release checklist

- Update `pyproject.toml` version.
- Regenerate `requirements.txt`, `requirements-all.txt`, and `requirements-dev.txt`.
- Run tests and CLI smoke conversions.
- For WSL, run `./scripts/verify_wsl_workflows.sh`.
- Build and check distributions before upload.

## Contributing

- Start feature work from the latest `dev` on a short-lived branch such as `codex/<topic>`.
- Open feature branch PRs into `dev`. Do not target `main` directly for feature work.
- When `dev` is validated and release-ready, open a separate `dev` -> `main` PR. Tags on `main` trigger publishing workflows.
- Repo-local PR workflow skill: `.opencode/skills/documint-pr-workflow/SKILL.md`.
- Drop samples into `docs_in` and run the CLI to confirm conversions. Read `.github/copilot-instructions.md` for repo-specific guidance, keep diffs small, and explain fixture changes when extraction output shifts.

## Notes

- The interactive UI pauses ~2 seconds after success so the confirmation stays on screen unless you pass `--quiet`.
- History helpers: `doc2md history`, `search`, `rerun`, `jump`, `recent`, `explain`, and `ui`.
- The CLI exposes both quick (`/files`, `/format`, `/output`) and advanced (`/more`) helpers to explore settings without re-running the command.
