Python Package · v0.1.0

pdfcleanerx

Converts messy PDF-extracted text into clean Markdown files. Fully offline, modular pipeline, CLI + SDK. No AI APIs.

43 tests passing Python 3.10+ PyPI-ready MIT license pdfplumber

1 · Project Workflow

When you run a CLI command, execution passes through exactly four stages:

cli.py converter.py extractor/ cleaners/ formatter/ report.md
  1. cli.py — entry point Typer parses your arguments (pdf path, --output, --verbose). Validates the file exists. Sets up logging. Calls Converter.convert_to_file(). Catches all errors and shows user-friendly messages — no raw stack traces ever reach the terminal.
  2. converter.py — orchestrator The only file that knows all layers exist. Wires together the extractor, pipeline, and formatter via dependency injection. Calls them in order and writes the final string to disk.
  3. extractor/pdfplumber_extractor.py — reads PDF Opens the PDF with pdfplumber. Groups words into line-level TextBlock objects, capturing font size and name per line. Produces a Document (list of PageData) passed to the next stage.
  4. cleaners/pipeline.py — cleaning chain Runs four cleaners in order: PageNumberCleaner strips standalone digits and "Page N of M". WhitespaceCleaner collapses spaces and removes empty blocks. HeadingDetector tags blocks by font size and regex. LineWrapCleaner merges soft-wrapped lines.
  5. formatter/markdown.py — renders output Reads each block's _heading_level attribute (set by HeadingDetector) and renders #, ##, ### or plain paragraph text. Collapses blank lines. Returns a clean Markdown string.

2 · File Roles

File / FolderResponsibility
cli.pyUser-facing entry point. Parses CLI args, handles errors, calls Converter. Never touches PDF logic directly.
converter.pyPublic SDK class. Wires extractor + pipeline + formatter together. The only DI seam — swap any layer here.
config.pySingleton Config loaded from .env once at startup. Single source of truth for every tunable value.
models.pyShared data structures: TextBlock, PageData, Document. The "baton" passed between all layers.
exceptions.pyCustom exception hierarchy. Keeps error messages user-friendly and typed.
logging_setup.pyConfigures Rich structured logging for CLI; plain stderr for library use.
extractor/base.pyAbstract contract for extractors. Swap pdfplumber for OCR here in future.
extractor/pdfplumber_extractor.pyConcrete extractor. Reads PDF, groups words by vertical position into TextBlocks with font metadata.
cleaners/base.pyAbstract contract for cleaners. One method: clean(document) → document.
cleaners/pipeline.pyIterates the cleaner list in order. Catches and re-raises errors with context.
cleaners/page_number.pyRemoves standalone digits, "Page N of M", and repeated footer/header text.
cleaners/whitespace.pyCollapses spaces, strips control chars, removes empty blocks.
cleaners/heading.pyTags blocks as H1/H2/H3 via font-size heuristic + ALL-CAPS/numbered regex.
cleaners/line_wrap.pyMerges soft-wrapped lines: no terminal punctuation + next line starts lowercase = merge.
formatter/base.pyAbstract contract for formatters. Swap Markdown for HTML here in future.
formatter/markdown.pyRenders Document to a Markdown string using heading tags and plain paragraphs.
.env.exampleAll configurable values with defaults. Copy to .env to override.
pyproject.tomlPackage metadata, dependencies, CLI entry point, test config. One file to rule them all.
tests/fixtures/generate_fixture.pyCreates deterministic test PDFs via reportlab. Run once to regenerate fixtures.

3 · Context Diagram

USER terminal cmd report.pdf any path cli.py parses args handles errors converter.py orchestrates layers dependency injection public SDK entry extractor/ pdfplumber → Document font metadata per block cleaners/pipeline PageNumbers · Whitespace HeadingDetector · LineWrap runs in order, each swappable formatter/ renders Markdown # ## ### + paragraphs report.md written to output dir or printed to stdout writes SHARED BY ALL LAYERS config.py models.py exceptions.py logging_setup.py

4 · Changing the Command Name

The CLI command name is defined in exactly one placepyproject.toml:

# pyproject.toml
[project.scripts]
pdfcleanerx = "pdfcleanerx.cli:app"
# ↑ this is what you type in terminal

To rename to, say, cleanpdf:

[project.scripts]
cleanpdf = "pdfcleanerx.cli:app"

Then reinstall:

pip install -e .
The command name on the left is independent of the Python module name on the right. They can differ — no other file needs updating.

5 · PDF Path Handling

You can pass any path — relative, absolute, or shell glob. The PDF does not need to be in your current directory.

# relative path
pdfcleanerx convert report.pdf
pdfcleanerx convert ./docs/report.pdf

# absolute path
pdfcleanerx convert /home/user/Downloads/report.pdf

# glob — shell expands, all matched PDFs are converted
pdfcleanerx convert ~/Documents/*.pdf --output-dir ./out

Internally, cli.py passes the path to converter.py, which wraps it in Python's pathlib.Path. Typer validates the file exists before any PDF logic runs. Relative paths resolve against wherever you run the command.

The --output flag only works for single-file mode. For multiple PDFs, use --output-dir.

6 · Configuration Reference (.env)

Copy .env.example to .env. All values have defaults — you only need to set what you want to override.

VariableDefaultWhat it controls
LOG_LEVELINFODEBUG / INFO / WARNING / ERROR / CRITICAL
OUTPUT_DIR./outputDefault directory for generated .md files
HEADING_FONT_SIZE_THRESHOLD1.2Font size multiplier above body median to qualify as heading
HEADING_MIN_LENGTH3Min characters for heading candidate
HEADING_MAX_LENGTH120Max characters for heading candidate
LINE_TERMINAL_PUNCTUATION.!?:;Characters that mark a line as "complete" (no merge)
PAGE_NUMBER_STRIP_STANDALONEtrueRemove lines that are just a number (e.g. "42")
PAGE_NUMBER_STRIP_PAGE_N_OF_MtrueRemove "Page 3 of 10" style lines
PAGE_NUMBER_STRIP_REPEATED_FOOTERtrueRemove text repeating on 3+ consecutive pages
FOOTER_REPEAT_THRESHOLD3Pages a footer must repeat on to be stripped
CLEANER_PAGE_NUMBERStrueEnable/disable page-number removal stage
CLEANER_LINE_WRAPtrueEnable/disable line-wrap merging stage
CLEANER_WHITESPACEtrueEnable/disable whitespace normalisation stage
CLEANER_HEADINGStrueEnable/disable heading detection stage

7 · Quick Start

Install

pip install -e ".[dev]"
cp .env.example .env

CLI Usage

# single file → output/report.md
pdfcleanerx convert report.pdf

# named output
pdfcleanerx convert report.pdf --output clean.md

# print to stdout
pdfcleanerx convert report.pdf --stdout

# batch
pdfcleanerx convert docs/*.pdf --output-dir ./markdown

# debug logging
pdfcleanerx convert report.pdf --verbose

Python SDK

from pdfcleanerx import Converter

converter = Converter()
markdown = converter.convert("report.pdf")
converter.convert_to_file("report.pdf", "report.md")

Custom Pipeline (Dependency Injection)

from pdfcleanerx import Converter
from pdfcleanerx.cleaners import CleanerPipeline, PageNumberCleaner, WhitespaceCleaner

pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner()])
converter = Converter(pipeline=pipeline)
markdown = converter.convert("report.pdf")

Run Tests

pytest                            # 43 tests
pytest --cov=pdfcleanerx         # with coverage

Build & Publish

pip install build twine
python -m build
twine upload --repository testpypi dist/*   # test first
twine upload dist/*                         # then publish

8 · Adding Custom Extensions

Custom Cleaner

from pdfcleanerx.cleaners.base import BaseCleaner
from pdfcleanerx.models import Document

class MyCleaner(BaseCleaner):
    def clean(self, document: Document) -> Document:
        for page in document.pages:
            for block in page.blocks:
                block.text = block.text.replace("©", "")
        return document

Custom Formatter (e.g. HTML output)

from pdfcleanerx.formatter.base import BaseFormatter
from pdfcleanerx.models import Document

class HtmlFormatter(BaseFormatter):
    def format(self, document: Document) -> str:
        ...

converter = Converter(formatter=HtmlFormatter())

Future Extension Points

FeatureWhat to do
OCR supportSubclass BaseExtractor, inject into Converter
HTML outputSubclass BaseFormatter, inject into Converter
Custom cleanerSubclass BaseCleaner, add to CleanerPipeline
Plugin systemAuto-discover subclasses via entry points in pyproject.toml
Rename CLI commandEdit left side of [project.scripts] in pyproject.toml