Metadata-Version: 2.4
Name: codex-pdf
Version: 1.45.0
Summary: Authoritative, versioned PDF facts contract for Think Neverland tools. v1.21.x adds codex_pdf.errors (RFC 7807 Problem Details) as the cross-stack HTTP error envelope.
Author-email: Think Neverland <dev@thinkneverland.com>
License-Expression: AGPL-3.0-or-later
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: fastapi>=0.115
Requires-Dist: gunicorn>=23.0
Requires-Dist: httpx>=0.27
Requires-Dist: jsonschema>=4.23
Requires-Dist: numpy>=1.26
Requires-Dist: pikepdf>=9.0
Requires-Dist: pillow>=10.0
Requires-Dist: pydantic>=2.8
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: structlog>=24.1
Requires-Dist: tifffile>=2024.1
Requires-Dist: uvicorn>=0.30
Provides-Extra: ai
Requires-Dist: anthropic>=0.40; extra == 'ai'
Requires-Dist: pillow>=10.0; extra == 'ai'
Requires-Dist: pytesseract>=0.3.10; extra == 'ai'
Provides-Extra: ai-barcodes
Requires-Dist: pillow>=10.0; extra == 'ai-barcodes'
Requires-Dist: pylibdmtx>=0.1.10; extra == 'ai-barcodes'
Requires-Dist: pyzbar>=0.1.9; extra == 'ai-barcodes'
Provides-Extra: geom
Requires-Dist: pyclipr>=0.1.8; extra == 'geom'
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == 'redis'
Provides-Extra: retain
Requires-Dist: boto3>=1.34; extra == 'retain'
Provides-Extra: vision
Requires-Dist: httpx>=0.27; extra == 'vision'
Requires-Dist: imagehash>=4.3; extra == 'vision'
Requires-Dist: pillow>=10.0; extra == 'vision'
Description-Content-Type: text/markdown

---
title: "Overview"
description: "Authoritative read-only PDF facts + render engine for the Print with Synergy tool family. Versioned contract, schema-validated output, deployed as three services."
group: "Getting started"
order: 1
slug: "overview"
---

# codexPDF

[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/codexpdf)

`codexPDF` is the authoritative, read-only PDF facts + render reference
for the Print with Synergy tool family.

Other engines consult `codexPDF` for canonical document facts instead
of re-parsing PDFs independently. The contract is versioned and
schema-validated.

## Status

`codex-pdf 1.39.0`. Current surface includes:

- Python package (`codex_pdf`) with typed `pydantic` models.
- CLI (`codex-pdf extract|schema|contract|validate|probe|parity|render|serve`).
- HTTP API (`/v1/extract`, `/v1/probe`, `/v1/extract/stream`,
  `/v1/render/{page,separations,heatmap,layer}`,
  `/v1/coverage`, `/v1/plates/extract`,
  `/v1/sample/{color,density}`, `/v1/walk/{type4,content-stream}`,
  `/v1/color/{resolve,match-pantone,neutral-density,inkbook}`,
  `/v1/geom/{tile,intersect,union,difference,offset}`,
  `/v1/retention/delete`).
- **Unified input (1.36.0).** `POST /v1/extract` + `POST /v1/probe` accept
  PDFs, Adobe Illustrator `.ai` files (sliced to their embedded PDF; legacy
  `.ai` converts via Ghostscript when available), EPS / composite PostScript
  (`.eps`/`.ps` — always normalized to PDF via Ghostscript; clean 422 when gs
  is absent), and raster plate / tooling files (1-bit TIFF + Esko LEN, single
  or repeatable set; multi-page TIFF where each page is a separation; TIFF/IT
  ISO 12639; DCS / DCS 2.0 + copydot; CIP3 PPF ink-coverage). Every input
  returns a normal `CodexDocument` — `summary.source_format` records `pdf` /
  `ai` / `eps` / `plate_tiff` / `plate_len` / `plate_tiff_it` / `plate_dcs` /
  `plate_cip3` / the structural die families `cff2` / `ddes` / `dxf` (with a
  human-friendly `summary.source_format_label`, e.g. `"1-bit TIFF plate"`), and
  plate inputs additionally carry `summary.{ink_coverage,embellishments,plate_set}`.
  On the default full PDF extract (Ghostscript present) `summary.ink_coverage`
  also carries compact base64 PNG previews — a `tac_heatmap_png` plus a
  per-separation `preview_png` (downsampled, size-capped; disable with
  `CODEX_EXTRACT_INK_COVERAGE=false`).
  Plate coverage / screen ruling / min-dot are computed directly from the
  rasters (gs-free, deterministic). Structural die / CAD files (CFF2, DDES,
  DXF) are parsed gs-free into `summary.dieline` (authoritative die size +
  cut/crease/score/perf candidates). Encrypted `.lenx`, proprietary Scitex
  CT/LW (`.ct`/`.lw` CEPS), and proprietary binary structural CAD (`.ard`
  ArtiosCAD / `.dwg` AutoCAD) are detected and rejected cleanly with a precise
  remediation note.

### Supported input formats

Every accepted input flows through `POST /v1/extract` + `POST /v1/probe` and
returns a normal `CodexDocument`; `summary.source_format` records the family.
Tier-1 = fully decoded; Tier-2 = detected and rejected with a remediation note
(never a silent zero-finding result).

| Format | Ext | Tier | `source_format` | Notes |
|---|---|---|---|---|
| PDF | `.pdf` | 1 | `pdf` | the canonical path |
| Adobe Illustrator | `.ai` | 1 | `ai` | embedded-PDF slice; legacy → Ghostscript |
| EPS / composite PostScript | `.eps`/`.ps` | 1 | `eps` | always normalized to PDF via Ghostscript |
| 1-bit TIFF / Esko LEN | `.tif`/`.tiff`/`.len` | 1 | `plate_tiff` / `plate_len` | gs-free raster facts |
| Multi-page 1-bit TIFF | `.tif` | 1 | `plate_tiff` | one separation per page |
| TIFF/IT (ISO 12639) | `.ct`/`.lw`/`.fp`/`.tif` | 1 | `plate_tiff_it` | contone CT split into CMYK channels |
| DCS / DCS 2.0 + copydot | `.dcs`/`.eps` | 1 | `plate_dcs` | embedded/sidecar separations |
| CIP3 PPF | `.ppf` | 1 | `plate_cip3` | per-separation ink coverage + sheet geometry |
| CFF2 / CF2 (structural die) | `.cf2`/`.cff2` | 1 | `cff2` | ASCII vector die — cut/crease/score/perf geometry + die size |
| DDES2 / DDES3 (structural die) | `.dd3`/`.ds2`/`.ddes` | 1 | `ddes` | ASCII die-cutting exchange — line-type → subtype + die size |
| DXF (structural / CAD) | `.dxf` | 1 | `dxf` | ASCII ENTITIES (LINE/LWPOLYLINE/POLYLINE/ARC) grouped by layer → subtype |
| Encrypted Esko LEN | `.lenx` | 2 | — | closed CDI-Crystal — clean 422 |
| Scitex CT / LW (CEPS) | `.ct`/`.lw` | 2 | — | proprietary — clean 415, convert to TIFF/IT |
| ArtiosCAD ARD | `.ard` | 2 | — | proprietary Esko binary — clean 415, export CFF2/DDES/DXF |
| AutoCAD DWG | `.dwg` | 2 | — | proprietary binary CAD — clean 415, export ASCII DXF |

DCS / CIP3 separation rasters decode gs-free when the embedded data is TIFF;
EPS / legacy `.ai` always need Ghostscript and return a clean 422 when it is
absent. **Structural die / CAD files** (CFF2, DDES, DXF) are parsed gs-free
into a `CodexDocument` carrying `summary.dieline` — an authoritative die
**size** (high confidence) plus one **candidate per cutting sub-type**
(cut / crease / score / perf / kiss_cut / fold / glue / bleed) with
`source="structural"`. CFF2 / DDES sub-types come from the file's numeric
line-type code; DXF sub-types come from the entity's layer name (mapped via the
shared dieline vocabulary). The proprietary binary structural formats (ARD,
DWG) are detected and rejected with a remediation note (export CFF2/DDES/DXF).
- TypeScript client (`@printwithsynergy/codex-client`) mirroring the
  Python `codex_pdf.client` surface, with SSE streaming for probe
  and extract, plus `computeCoverage` / `platesExtract`.
- Versioned schemas in `schemas/v1/` (document, color, geom, embellishment,
  ink-coverage, plate-set).
- Cloudflare Worker (`codex-edge`) providing a KV-backed
  write-through cache layer in front of the API.
- Redis-Streams speculator (`codex-speculator`) that pre-warms
  Phase 1 + Phase 2 caches.
- Opt-in retention to Cloudflare R2 for the marketing demo:
  `retain_for_training=true` on `POST /v1/extract` persists the
  PDF + extract + metadata under a hive-partitioned key; the
  default remains "delete bytes on response". See
  [`CLAUDE.md`](./CLAUDE.md) for the deployed bucket layout.

See [`CLAUDE.md`](./CLAUDE.md) for the full deployed-service map
(URLs, account IDs, version-bump checklist).

## Quickstart

```bash
uv sync
uv run codex-pdf probe input.pdf --json
uv run codex-pdf extract input.pdf --pretty > out.json
uv run codex-pdf validate out.json
uv run codex-pdf parity --fixtures-root tests/fixtures --profile summary --max-files 5
```

Run the HTTP API locally:

```bash
uv run codex-pdf serve --host 0.0.0.0 --port 8080
curl localhost:8080/v1/version
```

## Contract

The public surface is the JSON contract rooted at `CodexDocument`,
plus the per-section contracts under color and geom.

- Document schema: `schemas/v1/codex-document.schema.json`
- Runtime model: `codex_pdf.models.v1.CodexDocument`
- Stability policy: SemVer (`major` for breaking contract changes;
  field additions are minor bumps).
- Live contract endpoint: `GET /v1/contract` returns the endpoint
  inventory plus `section_schema_versions`.

## Documentation

| Topic | Doc |
| --- | --- |
| **Standard data-request pattern** (`requestAsset`) | [docs/data-requests.md](./docs/data-requests.md) |
| Architecture and boundaries | [docs/architecture.md](./docs/architecture.md) |
| CLI commands and usage | [docs/cli.md](./docs/cli.md) |
| Contract and schema versioning | [docs/contract.md](./docs/contract.md) |
| Determinism + transparency posture (model disclosure, version-pinned cache) | [docs/determinism.md](./docs/determinism.md) |
| Accuracy methodology (deterministic lanes) | [docs/accuracy.md](./docs/accuracy.md) |
| Deploying the API + speculator + edge | [docs/deploy.md](./docs/deploy.md) |
| Parity profiles and baselines | [docs/parity.md](./docs/parity.md) |
| Preflight ingest adapters | [docs/preflight-ingest.md](./docs/preflight-ingest.md) |
| Codex change ripple rule | [docs/operations/codex-change-ripple.md](./docs/operations/codex-change-ripple.md) |
| Marketing deploy template | [docs/operations/marketing-deploy-template.md](./docs/operations/marketing-deploy-template.md) |

## Contributing

We welcome PRs that fit codex's lane (extraction, normalization,
detection signals). Display concerns belong in **Lens**; file
comparison (two-file difference facts) belongs in **collate**; rule
pass/fail logic belongs in **Lint**.

Read [`CONTRIBUTING.md`](./CONTRIBUTING.md) for the dev setup, test
commands, schema-bump rules, and release checklist.

## Security

Please report vulnerabilities privately to
**`security@thinkneverland.com`** — do not open a public issue.

The full disclosure policy, supported-version matrix, and scope
(including the read-only PDF invariant) live in
[`SECURITY.md`](./SECURITY.md).

## License

`codexPDF` is distributed under the **GNU Affero General Public
License v3.0 or later** (`SPDX-License-Identifier:
AGPL-3.0-or-later`). The full license text is in
[`LICENSE`](./LICENSE).

AGPL applies in particular when codex is reachable over a network —
modifications served to remote users must be made available to
those users under the same terms.
