Converts messy PDF-extracted text into clean Markdown files. Fully offline, modular pipeline, CLI + SDK. No AI APIs.
When you run a CLI command, execution passes through exactly four stages:
pdf path, --output, --verbose). Validates the file exists. Sets up logging. Calls Converter.convert_to_file(). Catches all errors and shows user-friendly messages — no raw stack traces ever reach the terminal.
TextBlock objects, capturing font size and name per line. Produces a Document (list of PageData) passed to the next stage.
_heading_level attribute (set by HeadingDetector) and renders #, ##, ### or plain paragraph text. Collapses blank lines. Returns a clean Markdown string.
| File / Folder | Responsibility |
|---|---|
| cli.py | User-facing entry point. Parses CLI args, handles errors, calls Converter. Never touches PDF logic directly. |
| converter.py | Public SDK class. Wires extractor + pipeline + formatter together. The only DI seam — swap any layer here. |
| config.py | Singleton Config loaded from .env once at startup. Single source of truth for every tunable value. |
| models.py | Shared data structures: TextBlock, PageData, Document. The "baton" passed between all layers. |
| exceptions.py | Custom exception hierarchy. Keeps error messages user-friendly and typed. |
| logging_setup.py | Configures Rich structured logging for CLI; plain stderr for library use. |
| extractor/base.py | Abstract contract for extractors. Swap pdfplumber for OCR here in future. |
| extractor/pdfplumber_extractor.py | Concrete extractor. Reads PDF, groups words by vertical position into TextBlocks with font metadata. |
| cleaners/base.py | Abstract contract for cleaners. One method: clean(document) → document. |
| cleaners/pipeline.py | Iterates the cleaner list in order. Catches and re-raises errors with context. |
| cleaners/page_number.py | Removes standalone digits, "Page N of M", and repeated footer/header text. |
| cleaners/whitespace.py | Collapses spaces, strips control chars, removes empty blocks. |
| cleaners/heading.py | Tags blocks as H1/H2/H3 via font-size heuristic + ALL-CAPS/numbered regex. |
| cleaners/line_wrap.py | Merges soft-wrapped lines: no terminal punctuation + next line starts lowercase = merge. |
| formatter/base.py | Abstract contract for formatters. Swap Markdown for HTML here in future. |
| formatter/markdown.py | Renders Document to a Markdown string using heading tags and plain paragraphs. |
| .env.example | All configurable values with defaults. Copy to .env to override. |
| pyproject.toml | Package metadata, dependencies, CLI entry point, test config. One file to rule them all. |
| tests/fixtures/generate_fixture.py | Creates deterministic test PDFs via reportlab. Run once to regenerate fixtures. |
The CLI command name is defined in exactly one place — pyproject.toml:
# pyproject.toml
[project.scripts]
pdfcleanerx = "pdfcleanerx.cli:app"
# ↑ this is what you type in terminal
To rename to, say, cleanpdf:
[project.scripts]
cleanpdf = "pdfcleanerx.cli:app"
Then reinstall:
pip install -e .
You can pass any path — relative, absolute, or shell glob. The PDF does not need to be in your current directory.
# relative path
pdfcleanerx convert report.pdf
pdfcleanerx convert ./docs/report.pdf
# absolute path
pdfcleanerx convert /home/user/Downloads/report.pdf
# glob — shell expands, all matched PDFs are converted
pdfcleanerx convert ~/Documents/*.pdf --output-dir ./out
Internally, cli.py passes the path to converter.py, which wraps it in Python's pathlib.Path. Typer validates the file exists before any PDF logic runs. Relative paths resolve against wherever you run the command.
--output flag only works for single-file mode. For multiple PDFs, use --output-dir.Copy .env.example to .env. All values have defaults — you only need to set what you want to override.
| Variable | Default | What it controls |
|---|---|---|
| LOG_LEVEL | INFO | DEBUG / INFO / WARNING / ERROR / CRITICAL |
| OUTPUT_DIR | ./output | Default directory for generated .md files |
| HEADING_FONT_SIZE_THRESHOLD | 1.2 | Font size multiplier above body median to qualify as heading |
| HEADING_MIN_LENGTH | 3 | Min characters for heading candidate |
| HEADING_MAX_LENGTH | 120 | Max characters for heading candidate |
| LINE_TERMINAL_PUNCTUATION | .!?:; | Characters that mark a line as "complete" (no merge) |
| PAGE_NUMBER_STRIP_STANDALONE | true | Remove lines that are just a number (e.g. "42") |
| PAGE_NUMBER_STRIP_PAGE_N_OF_M | true | Remove "Page 3 of 10" style lines |
| PAGE_NUMBER_STRIP_REPEATED_FOOTER | true | Remove text repeating on 3+ consecutive pages |
| FOOTER_REPEAT_THRESHOLD | 3 | Pages a footer must repeat on to be stripped |
| CLEANER_PAGE_NUMBERS | true | Enable/disable page-number removal stage |
| CLEANER_LINE_WRAP | true | Enable/disable line-wrap merging stage |
| CLEANER_WHITESPACE | true | Enable/disable whitespace normalisation stage |
| CLEANER_HEADINGS | true | Enable/disable heading detection stage |
pip install -e ".[dev]"
cp .env.example .env
# single file → output/report.md
pdfcleanerx convert report.pdf
# named output
pdfcleanerx convert report.pdf --output clean.md
# print to stdout
pdfcleanerx convert report.pdf --stdout
# batch
pdfcleanerx convert docs/*.pdf --output-dir ./markdown
# debug logging
pdfcleanerx convert report.pdf --verbose
from pdfcleanerx import Converter
converter = Converter()
markdown = converter.convert("report.pdf")
converter.convert_to_file("report.pdf", "report.md")
from pdfcleanerx import Converter
from pdfcleanerx.cleaners import CleanerPipeline, PageNumberCleaner, WhitespaceCleaner
pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner()])
converter = Converter(pipeline=pipeline)
markdown = converter.convert("report.pdf")
pytest # 43 tests
pytest --cov=pdfcleanerx # with coverage
pip install build twine
python -m build
twine upload --repository testpypi dist/* # test first
twine upload dist/* # then publish
from pdfcleanerx.cleaners.base import BaseCleaner
from pdfcleanerx.models import Document
class MyCleaner(BaseCleaner):
def clean(self, document: Document) -> Document:
for page in document.pages:
for block in page.blocks:
block.text = block.text.replace("©", "")
return document
from pdfcleanerx.formatter.base import BaseFormatter
from pdfcleanerx.models import Document
class HtmlFormatter(BaseFormatter):
def format(self, document: Document) -> str:
...
converter = Converter(formatter=HtmlFormatter())
| Feature | What to do |
|---|---|
| OCR support | Subclass BaseExtractor, inject into Converter |
| HTML output | Subclass BaseFormatter, inject into Converter |
| Custom cleaner | Subclass BaseCleaner, add to CleanerPipeline |
| Plugin system | Auto-discover subclasses via entry points in pyproject.toml |
| Rename CLI command | Edit left side of [project.scripts] in pyproject.toml |