Metadata-Version: 2.4
Name: pdfa-parser
Version: 0.1.0
Summary: Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.
Project-URL: Homepage, https://github.com/joao/pdfa-parser
Project-URL: Issues, https://github.com/joao/pdfa-parser/issues
Project-URL: Repository, https://github.com/joao/pdfa-parser
Author: João
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Keywords: conversion,ghostscript,pdf,pdf-a,pdfa,validation,verapdf
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: reportlab>=4.1; extra == 'dev'
Description-Content-Type: text/markdown

# pdfa-parser

> Convert PDFs to **PDF/A** using [GhostScript](https://www.ghostscript.com/) and validate compliance with [VeraPDF](https://verapdf.org/).

**pdfa-parser** is a lightweight Python library (Python ≥ 3.14) that wraps
GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation.
Both synchronous and asynchronous APIs are provided out of the box.

---

## Features

- **PDF → PDF/A conversion** via GhostScript (levels 1, 2, 3).
- **PDF/A validation** via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
- **Sync & async** – every public method has an `a_` async counterpart.
- **Factory function** (`create_parser()`) for zero-config quick start.
- **Adapter pattern** – swap GhostScript/VeraPDF for any CLI tool by
  implementing `IBaseAdapter`.
- **CLI** – `python -m pdfa_parser input.pdf output.pdf`.

---

## Project structure

```
pdfa-parser/
├── scripts/
│   └── setup_binaries.py          # Download & install GhostScript + VeraPDF
├── src/
│   ├── bin/
│   │   ├── ghostscript/            # GhostScript binary (gswin64c.exe / gs)
│   │   └── verapdf/                # VeraPDF CLI (verapdf.bat / verapdf)
│   └── pdfa_parser/
│       ├── __init__.py             # Public API + create_parser()
│       ├── __main__.py             # python -m pdfa_parser
│       ├── main.py                 # CLI entry-point
│       ├── pdf_parser.py           # PdfParser – convert / validate
│       ├── settings.py             # Binary path resolution
│       ├── interfaces/
│       │   ├── base_adapter.py     # IBaseAdapter (ABC)
│       │   └── binary_executer.py  # BinaryExecuter (facade)
│       └── implementations/
│           ├── ghostscript_adapter.py
│           └── verapdf_adapter.py
├── tests/
│   ├── conftest.py                 # Fixtures: PDF generation, skip markers
│   ├── test_unit.py                # 26 unit tests (no binaries needed)
│   └── test_integration.py         # 20 integration tests (need binaries)
├── pyproject.toml
└── README.md
```

---

## Requirements

| Dependency    | Required for     | Notes                                    |
| ------------- | ---------------- | ---------------------------------------- |
| Python ≥ 3.14 | Everything       | Uses `match/case`, `type \| union`, etc. |
| GhostScript   | Conversion       | `gswin64c.exe` (Win) or `gs` (Unix)      |
| VeraPDF       | Validation       | Requires **Java ≥ 11** on `PATH`         |

---

## Installation

### 1. Clone & create virtual environment

```bash
git clone <repo-url> pdfa-parser
cd pdfa-parser
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate
```

### 2. Install the package (editable + dev dependencies)

```bash
# Using uv (recommended)
uv pip install -e ".[dev]"

# Or plain pip
pip install -e ".[dev]"
```

### 3. Install binaries

The automated setup script downloads GhostScript and VeraPDF into `src/bin/`:

```bash
python scripts/setup_binaries.py          # both
python scripts/setup_binaries.py --gs      # GhostScript only
python scripts/setup_binaries.py --verapdf # VeraPDF only
```

**Prerequisites for the script:**

| Binary      | Windows requirement | Unix requirement |
| ----------- | ------------------- | ---------------- |
| GhostScript | [7-Zip](https://7-zip.org/) on `PATH` (`7z`) | `tar` (pre-installed) |
| VeraPDF     | Java ≥ 11 on `PATH` | Java ≥ 11 on `PATH` |

> **Tip:** You can also install GhostScript / VeraPDF manually and copy (or
> symlink) the executables into `src/bin/ghostscript/` and `src/bin/verapdf/`.

---

## Quick start

### Python API

```python
from pdfa_parser import create_parser

# Create a parser with default adapters
parser = create_parser()                    # GhostScript + VeraPDF
parser = create_parser(with_verapdf=False)  # GhostScript only

# Convert PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant
```

### Async API

Every method has an `a_` prefix async twin:

```python
import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())
```

### CLI

```bash
# Basic conversion
python -m pdfa_parser input.pdf output.pdf

# With validation
python -m pdfa_parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
python -m pdfa_parser input.pdf output.pdf --level 1 --validate --flavour 1b
```

---

## Advanced usage

### Custom adapters

```python
from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/usr/local/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)
```

### Custom PDF/A level & extra GhostScript flags

```python
parser = create_parser(pdfa_level=3, extra_gs_args=("-dQUIET", "-r300"))
```

---

## Testing

```bash
# Unit tests (no binaries required) – always runnable
pytest tests/test_unit.py -v

# Integration tests (require GhostScript + VeraPDF in src/bin/)
pytest tests/test_integration.py -v

# Everything
pytest -v
```

### Test coverage summary

| Suite              | Tests | Requires binaries | What it covers                                      |
| ------------------ | ----: | ----------------- | --------------------------------------------------- |
| `test_unit.py`     |    26 | No                | Helpers, XML parsing, arg building, mocked convert/validate, async, factory |
| `test_integration` |    20 | Yes               | Real conversion (portrait, landscape, color, multi-page, text-heavy), VeraPDF validation, round-trip, async, PDF/A-1b |

Integration tests generate various PDF types using **reportlab** (portrait,
landscape, coloured shapes, multi-page, text-heavy) and run them through the
full GhostScript → VeraPDF pipeline.  Tests are **auto-skipped** when binaries
are missing.

---

## License

See [LICENSE](LICENSE).
