Metadata-Version: 2.4
Name: profgen
Version: 0.0.1rc1
Summary: Convert candidate CVs into a standardised Word profile, with no invented facts.
Author-email: Kevin Steptoe <kevin.steptoe@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/ksteptoe/profgen
Project-URL: Repository, https://github.com/ksteptoe/profgen
Project-URL: Documentation, https://profgen.readthedocs.io/
Keywords: cv,resume,docx,anthropic,claude,semiconductor,recruitment
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Other Audience
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE.txt
License-File: AUTHORS.md
Requires-Dist: click>=8.1
Requires-Dist: pydantic>=2.6
Requires-Dist: python-docx>=1.1
Requires-Dist: pdfplumber>=0.11
Requires-Dist: anthropic>=0.40
Provides-Extra: pdf-fast
Requires-Dist: pymupdf>=1.24; extra == "pdf-fast"
Provides-Extra: docs
Requires-Dist: sphinx>=7; extra == "docs"
Requires-Dist: myst-parser>=2; extra == "docs"
Provides-Extra: dev
Requires-Dist: profgen[docs,pdf-fast]; extra == "dev"
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-cov>=5; extra == "dev"
Requires-Dist: pytest-xdist>=3.6; extra == "dev"
Requires-Dist: pytest-timeout>=2.3; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: reportlab>=4.0; extra == "dev"
Requires-Dist: pre-commit>=3.7; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5; extra == "dev"
Dynamic: license-file

# profgen (cv_formatter)

[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)

> Convert candidate CVs into standardised Word profiles — without inventing facts.

`profgen` (the tool is called **cv_formatter**) turns a candidate CV
(PDF/DOCX/TXT) into a standardised Word profile through a
verbatim-extract → typed-structure → grounding-check → render → review-report
pipeline. The profile is rendered against a template you supply, so any house
style — including a private or corporate one — can be applied without the
template living in the package.

## The one hard rule: no invented facts

Omitting information is acceptable; fabricating a company, tool, date,
qualification, institution or project is a defect. Concretely:

- Anything absent from the source CV is marked `"Not stated"` (scalars) or left
  as an empty list — never guessed.
- A deterministic, LLM-independent **grounding check** verifies that every
  extracted tool, certification, institution and project name actually appears in
  the source text. Anything it cannot find is flagged in the review report.
- **Employers are anonymised.** Experience is rendered as `Project N | <domain>`
  rather than by company name (the company is still extracted, purely so the
  grounding check can confirm nothing was invented).
- **No derived fields.** Years of experience, seniority and similar figures are
  never computed; the skills table's "Years Experience" column always reads
  `"Not stated"` unless the CV states a figure explicitly.

Each conversion therefore writes two files: the `.docx` profile **and** a sibling
`*.review.md` listing missing information and everything to verify before customer
submission.

## Installation

Not yet published to PyPI. Install from source (Python 3.11+):

```bash
git clone https://github.com/ksteptoe/profgen
cd profgen
make dev                  # editable install with all dev/docs extras
# or, equivalently:
pip install -e ".[dev]"
```

## Quickstart

```bash
# 1. Generate a starter .docx style-donor template (neutral default styles).
profgen make-template templates/profile_template.docx

# 2. Convert a CV offline (no API key, no network) — produces out.docx AND out.review.md.
profgen convert cv.pdf --output out.docx --offline
```

When `--output` is omitted the profile is written to `<source-stem>_profile.docx`
in the current directory (so `cv.pdf` becomes `cv_profile.docx`), with the review
report alongside.

`cv-formatter` is an identical alias for `profgen`, and `python -m profgen` works
too. Run `profgen convert --help` for the full option list.

### Bring your own template

The renderer binds content to five **logical roles** — `title`, `date_heading`,
`body`, `bullet` and `legal` — rather than to fixed style names. By default each
role maps to a neutral built-in or starter style (`DEFAULT_STYLE_MAP`):

| Role           | Default style   |
|----------------|-----------------|
| `title`        | `Profile Title` |
| `date_heading` | `Profile Date`  |
| `body`         | `Normal`        |
| `bullet`       | `List Bullet`   |
| `legal`        | `Profile Legal` |

To apply your own house style, pass your branded document with `--template` and a
TOML **style map** with `--style-map` that points each role at the real paragraph
style names in *your* document:

```bash
profgen convert cv.pdf --template my_template.docx --style-map my-style-map.toml
```

```toml
# my-style-map.toml — map the logical roles to YOUR template's style names.
title        = "My Heading Style"
date_heading = "My Date Style"
legal        = "My Legal Style"
```

The map may be partial: any role you omit falls back to its default. This is how
a private or corporate template can be applied without it ever living in the
package.

### One-step branded profile (`make profile`)

For repeated runs against a confidential template there is a convenience target:

```bash
cp examples/style-map.example.toml local/style-map.toml   # then edit to taste
# drop your branded template at local/template.docx
make profile CV=cv.pdf            # Claude path (needs ANTHROPIC_API_KEY)
make profile CV=cv.pdf OFFLINE=1  # deterministic, network-free path
make profile CV=cv.pdf OUT=out.docx
```

`make profile` renders against `local/template.docx` using `local/style-map.toml`.
The `local/` directory and `.env` are **gitignored**, so confidential templates
and API keys stay out of the repository.

## Offline vs real Claude path

The structuring stage has two interchangeable backends behind one interface:

- **Offline (`--offline`)** — the deterministic, network-free
  `HeuristicStructuringClient`. Needs no API key, makes no network call, and is
  what the entire test suite uses. Ideal for plumbing checks and CI.
- **Real Claude (default)** — the `ClaudeStructuringClient`, which calls the
  Anthropic API and needs `ANTHROPIC_API_KEY`. This path is deliberately **never**
  exercised in CI; it is smoke-tested only behind an explicit opt-in (see
  `examples/smoke_real_path.py`).

## Example

A runnable, fully-offline example builds a profile from a bundled synthetic CV
with no API key:

```bash
.venv/bin/python examples/build_example_profile.py
```

It reads `examples/input_cvs/sample_cv.txt`, runs the offline pipeline, and writes
the profile and its review report into `examples/output_profiles/` (gitignored).

## Development

```bash
make dev      # editable install with all dependencies
make test     # run the fully-offline test suite
make lint     # ruff
make format   # ruff --fix
make docs     # build the Sphinx HTML User Guide
make docs-pdf # build a single PDF of the docs (needs a LaTeX toolchain)
```

Quality gates: `ruff` clean, `mypy --strict` clean (scoped to `src/`), and
`pytest` green with the network disabled. See the Sphinx **User Guide**
(`make docs`) for the full pipeline walkthrough, and the `cv_formatter_SPEC.md`
file in the repository root for the build contract.

`make docs-pdf` produces `docs/_build/latex/profgen.pdf`. It needs a system LaTeX
toolchain on `PATH` — `xelatex`, `latexmk`, and `makeindex` (install
[TeX Live](https://www.tug.org/texlive/) or, on Windows,
[MiKTeX](https://miktex.org/)). The toolchain is **not** pip-installable and is
optional: the target fails fast with a clear message if `latexmk` is missing.

## Note

This project has been set up using [PyScaffold](https://pyscaffold.org/) 4.6
with the [ClickStart](https://github.com/ksteptoe/pyscaffoldext-ClickStart) extension.
