Metadata-Version: 2.4
Name: asta-papers
Version: 0.0.1
Summary: Legal-only paper full-text retrieval and conversion. DOI/PMID/PMCID/arXiv/Corpus ID + BYO PDF/JATS → markdown with license classification.
Project-URL: Homepage, https://github.com/allenai/asta-sdk
Project-URL: Repository, https://github.com/allenai/asta-sdk
Project-URL: Issues, https://github.com/allenai/asta-sdk/issues
Project-URL: Documentation, https://github.com/allenai/asta-sdk/tree/main/src/python/asta/papers/docs
Project-URL: Changelog, https://github.com/allenai/asta-sdk/blob/main/src/python/asta/papers/docs/CHANGELOG.md
Author: Allen Institute for AI
License: MIT
License-File: LICENSE
Keywords: arxiv,europepmc,jats,license-classification,ocr,open-access,pdf,pubmed,scientific-papers,text-mining,unpaywall
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: requests>=2.31
Provides-Extra: all
Requires-Dist: boto3>=1.34; extra == 'all'
Requires-Dist: mistralai<2,>=1.5; extra == 'all'
Requires-Dist: olmocr>=0.4; extra == 'all'
Requires-Dist: openai>=1.40; extra == 'all'
Requires-Dist: pillow>=10.0; extra == 'all'
Requires-Dist: pypdf>=4.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: freezegun>=1.5; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-benchmark>=4.0; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.6; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: responses>=0.25; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: mistral
Requires-Dist: mistralai<2,>=1.5; extra == 'mistral'
Provides-Extra: olmocr
Requires-Dist: olmocr>=0.4; extra == 'olmocr'
Requires-Dist: openai>=1.40; extra == 'olmocr'
Requires-Dist: pillow>=10.0; extra == 'olmocr'
Requires-Dist: pypdf>=4.0; extra == 'olmocr'
Provides-Extra: s3
Requires-Dist: boto3>=1.34; extra == 's3'
Description-Content-Type: text/markdown

# asta-papers

Legal-only paper full-text retrieval and conversion. Identifier (DOI / PMID /
PMCID / arXiv / Semantic Scholar Corpus ID) — or BYO PDF / JATS — to markdown
with explicit license classification.

Lifts paper recovery on biomedical literature from ~22% (Mistral OCR alone)
to ~85% using only publisher-blessed legal channels (NCBI E-utilities,
Unpaywall, EuropePMC, bioRxiv, institutional repositories).

## Install

```bash
pip install asta-papers                 # core (JATS conversion only)
pip install 'asta-papers[mistral]'      # + Mistral OCR for PDFs
pip install 'asta-papers[olmocr]'       # + local olmOCR for PDFs (offline)
pip install 'asta-papers[s3]'           # + s3:// BYO support
pip install 'asta-papers[all]'          # everything
```

## Quickstart

```python
import os
from asta_papers import Client
from asta_papers.converters.mistral import MistralConverter

c = Client(
    email="me@allenai.org",
    ncbi_api_key=os.environ.get("NCBI_API_KEY"),       # optional, lifts NCBI 3→10 rps
    converters=[MistralConverter()],                    # for PDF→markdown
)

# By identifier
r = c.fetch(doi="10.1186/s12943-024-02093-w")
print(r.success, r.license_class, r.markdown[:200])

# Storage-tier policy
if r.may_redistribute:                                   # CC BY / CC0 / CC BY-SA
    save_artifact(r.bytes)
elif r.may_use_for_tdm:                                  # TDM-permissive licenses; not BRONZE/UNKNOWN/CLOSED
    extract_inline(r.markdown)

# BYO PDF — bytes, local path, or URI
r = c.fetch(pdf=b"%PDF-...")
r = c.fetch(pdf="paper.pdf")
r = c.fetch(pdf="s3://my-bucket/paper.pdf")              # requires [s3]

# Batch with bounded concurrency + per-host rate limits
results = c.fetch_many([
    {"doi": "10.1038/foo"},
    {"pmcid": "PMC123"},
    {"pdf": "paper.pdf", "doi": "10.99/local"},
])
```

## How it works

A strategy ladder runs against legal aggregator APIs in order, returning the
first successful result:

1. **PMC E-utilities efetch** — JATS XML for PMC OA Subset articles
2. **NCBI elink** — PMID → PMC self-link when S2 didn't surface it
3. **Published-version handoff** — when input is a preprint DOI, route to
   the published version (bioRxiv API or Crossref `relation`) so callers
   get the most-recent public version of the paper
4. **arXiv** — direct PDF for arXiv papers
5. **bioRxiv / medRxiv API** — JATS XML or PDF for preprints
6. **Unpaywall** — best legal OA URL; routes PMC URLs through efetch
7. **Institutional repo scrape** — `hdl.handle.net`, `pure.eur.nl`, etc.
   (now respects `robots.txt`)
8. **EuropePMC PDF render** — text-mining-licensed PDFs for free-to-read
   papers NCBI's OA Subset doesn't include

Per-host token-bucket rate limiting honors every publisher's published quota
exactly. arXiv at 0.33 rps (their explicit rule). NCBI at 3 rps (10 with key)
shared across `eutils.*`, `pmc.*`, `www.*` hostnames. Independent services
run fully in parallel.

## License classification

Every successful fetch carries a `LicenseClass`:

```
cc-by  cc-by-sa  cc-by-nd  cc-by-nc  cc-by-nc-sa  cc-by-nc-nd  cc0
text-mining-only   bronze   arxiv-default   closed   unknown
```

Plus helper booleans on `FetchResult`: `may_redistribute`,
`may_redistribute_nc`, `may_make_derivatives`, `may_train_models`,
`may_use_for_tdm`, plus a `source_type` field (`publisher` /
`repository` / `other`) for callers who want publisher-vs-repository
policy without parsing strings. Storage-tier policy is a one-line check.

Successful results also carry an attribution blockquote at the top of
the markdown by default (source URL + DOI/PMCID + license + retrieval
strategy), so the markdown is self-attributing when it travels to end
users. Disable with `Client(include_attribution=False)`.

## What's NOT here

- No Sci-Hub, no archive scraping, no UA spoofing past WAFs.
- No title-only paper search (use PaperFinder / S2 first to get an identifier).
- No multi-tenant `Credentials` per-call object (one Client per credential set).
- No async API (sync only in v0.1).

## Configuration

See `Client.__init__` and `docs/concepts.md` for the full list. Required:
`email` (Crossref polite-pool identifier; kwarg or `ASTA_PAPERS_EMAIL` env).
Recommended: `NCBI_API_KEY` env (free, 5-minute registration, 3.3× throughput).

## Tests

```bash
pytest tests/integration -v                  # 29 real-API tests
python tools/check_test_legitimacy.py --strict  # asserts mock-ratio < 30%
```

The full integration suite hits live upstream APIs — no mocks. Tests run in
~90 seconds. A per-paper snapshot recovery benchmark (53 biomedical DOIs
that fail Mistral-OCR-only retrieval; 46/53 = 87% recovered) gates
regressions.

## Design

Full design at [`docs/DESIGN.md`](docs/DESIGN.md).
