Metadata-Version: 2.4
Name: vaxtract
Version: 0.1.0
Summary: Schema-validated extraction of neoantigen cancer-vaccine immunogenicity data from primary papers, on the Claude Agent SDK (bring-your-own-key).
Author-email: Samuel Ahuno <ekwame001@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/sahuno/democancerVaccineDBAgent
Project-URL: Repository, https://github.com/sahuno/democancerVaccineDBAgent
Keywords: neoantigen,cancer-vaccine,immunogenicity,extraction,claude,bioinformatics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2
Provides-Extra: agent
Requires-Dist: claude-agent-sdk>=0.2.87; extra == "agent"
Requires-Dist: pypdf>=4; extra == "agent"
Requires-Dist: openpyxl>=3.1; extra == "agent"
Requires-Dist: python-docx>=1.1; extra == "agent"
Provides-Extra: figures
Requires-Dist: vaxtract[agent]; extra == "figures"
Requires-Dist: pymupdf>=1.23; extra == "figures"
Requires-Dist: Pillow>=10; extra == "figures"
Provides-Extra: all
Requires-Dist: vaxtract[agent,figures]; extra == "all"
Provides-Extra: dev
Requires-Dist: vaxtract[agent,figures]; extra == "dev"
Requires-Dist: pytest>=7; extra == "dev"
Dynamic: license-file

# vaxtract

Schema-validated extraction of **neoantigen cancer-vaccine immunogenicity data**
from primary papers, built on the [Claude Agent SDK](https://github.com/anthropics/claude-agent-sdk-python).
Point it at a folder of paper files (PDF / XLSX / DOCX) and it returns a
**schema-validated, provenance-tracked JSON** extraction (per-peptide / per-epitope
immunogenicity, HLA restriction, evidence, survival outcomes, …) for human sign-off.

> **Bring your own key (BYOK).** You run the agent and pay for your own Anthropic
> usage (~$3/paper, varies). Your files never leave your machine — there is no
> hosted service.

> **Output is *silver*, not gold.** Every record carries provenance and is meant for
> a curator to review before use, not to be treated as ground truth.

## Install

```bash
pip install vaxtract                    # core: the schema/vocab data contract only
pip install "vaxtract[agent]"           # + the extraction agent (Claude Agent SDK + readers)
pip install "vaxtract[agent,figures]"   # + figure/image reading (PyMuPDF + Pillow)
```

`pip install vaxtract` pulls only `pydantic`, so you can `import vaxtract.schema`
to validate records without the Claude Agent SDK. **Running the agent** (the
`vaxtract` console script or `vaxtract.extract_paper`) needs the `[agent]` extra.

Running the agent also requires Python ≥ 3.10 **and** the Claude Code CLI on your
`PATH` — the Claude Agent SDK shells out to the `claude` binary:

```bash
npm install -g @anthropic-ai/claude-code
```

(The Docker image below bundles this for you.)

## Authenticate (pick one)

```bash
# A) API key — pay-per-token
export ANTHROPIC_API_KEY=sk-ant-...

# B) Claude subscription — use a logged-in plan via the `claude` CLI
#    (pass --subscription; the key is ignored)
```

## Run

```bash
vaxtract ./my_paper_dir out.json
vaxtract --subscription ./my_paper_dir out.json   # use plan quota
```

`my_paper_dir` is a folder containing the paper's `.pdf` and any supplementary
`.xlsx` / `.docx`. The agent reads the tables/text/figures, builds the record,
self-validates against the schema, and writes `out.json`.

### As a library

```python
import asyncio
from vaxtract import extract_paper

asyncio.run(extract_paper("./my_paper_dir", "out.json"))
```

The data contract is importable without the SDK:

```python
from vaxtract.schema import ExtractedPaper, SCHEMA_VERSION
```

## What it extracts

Per paper: studies, patients, immunizing peptides, minimal epitopes, pools,
immunogenicity evidence (assay/outcome/magnitude), neoantigen mutations, survival
outcomes, clinical-benefit signals, safety, and vaccine-delivery covariates — all
validated against a versioned Pydantic schema (`SCHEMA_VERSION`).

## Notes

- The agent is restricted to a curated toolset and is headless-safe (no host shell
  access; it cannot read or write outside the files you give it and the output path).
- Cost/turn backstops (`max_turns`, `max_budget_usd`) guard against runaway runs.
- Figure reading is optional; install the `[figures]` extra and ensure a working
  PyMuPDF (it bundles its own libraries — no system Poppler needed).

## License

MIT © 2026 Samuel Ahuno
