Metadata-Version: 2.4
Name: mdtero
Version: 2026.4.26.1
Summary: Mdtero local CLI for source-first paper acquisition, parsing, Zotero import, and RAG handoff.
Author: Mdtero
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://mdtero.com
Project-URL: Repository, https://github.com/JonbinC/mdtero-backend
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: bibtexparser
Requires-Dist: citeproc-py==0.9.0
Requires-Dist: citeproc-py-styles==0.1.5
Requires-Dist: curl_cffi<0.12,>=0.11
Requires-Dist: email-validator
Requires-Dist: fastapi
Requires-Dist: google-cloud-storage==2.14.0
Requires-Dist: httpx
Requires-Dist: lxml
Requires-Dist: fastmcp
Requires-Dist: paperscraper
Requires-Dist: psycopg[binary]==3.2.1
Requires-Dist: pubmed-parser
Requires-Dist: pydantic-settings
Requires-Dist: PyMuPDF<1.26,>=1.24
Requires-Dist: python-multipart
Requires-Dist: pyzotero
Requires-Dist: requests
Requires-Dist: resend==2.23.0
Requires-Dist: stripe==14.4.1
Requires-Dist: uvicorn

# Mdtero Private Backend

This repository is the production backend source of truth for Mdtero.

Public beta messaging lives in the public repos. This repository stays focused on:

- FastAPI services and task orchestration
- auth, API keys, billing, and usage accounting
- parsing, translation, artifact packaging, and retrieval adapters
- source-first agent skill documents served from `api.mdtero.com`

## Public / Private Split

- `mdtero`: website, dashboard, guide, API docs, and public extension source/install guidance
- `mdtero-backend`: backend-only services, deployment assets, source-routing CLI surfaces, and private operator docs
- Current first-party extension scope is limited to web OAuth login/session handoff, parse, translate, PDF/EPUB upload, artifact download, and install guidance; do not add extension UI/source here.

## Deployment Truth

- Production `mdtero-api` is served from OCI SG AMD-2 behind Caddy at `https://api.mdtero.com`; pushes to `main` no longer deploy Cloud Run automatically.
- Use the manual GitHub Actions workflow in `.github/workflows/deploy-backend.yml` to run tests, build/push the production image to GHCR as `ghcr.io/jonbinc/mdtero-api`, execute the deploy job on the AMD-2 self-hosted runner, update `/opt/mdtero-api/docker-compose.yml`, and smoke-test production. See `docs/ops/amd2-backend-deploy.md`.
- `cloudbuild.yaml` is retained as a rollback-only Cloud Run definition. Do not attach it to an automatic Cloud Build trigger unless intentionally rolling production back to Cloud Run.
- Local filesystem storage on AMD-2 is the production default artifact backend: `STORAGE_BACKEND=local`, `LOCAL_STORAGE_DIR=/app/storage`, host path `/opt/mdtero-api/storage`, container path `/app/storage`.
- GCS artifact storage remains rollback-only / legacy compatibility; do not treat `GCS_BUCKET_NAME` or `GOOGLE_APPLICATION_CREDENTIALS` as the default AMD-2 production path.
- Live discovery search is production-enabled only after the runtime secret store contains the provider key; bind `OPENALEX_API_KEY` and set `MDTERO_ENABLE_DISCOVERY_SEARCH=true` in the same rollout. Do not put the key value in source, and do not reference a missing secret from the default deploy path.
- Standalone GROBID is retired and is not part of the GitHub/Cloud Run production deployment chain; `docs/ops/standalone-grobid-cloud-run.md` is kept only as an archival note.
- Direct upload-PDF / cloud PDF handling is the maintained PDF path. The longer-term direction is to move the remaining scholarly PDF credentials and routing into MinerU-managed production configuration rather than bringing back a separate GROBID deploy.
- This repository is the only backend code SSOT; do not recreate a shadow backend under `mdtero/private/backend` or any other parallel path.
- Do not assume a push to `mdtero` deploys the public backend API.

When the public install flow changes, update the matching skill or install guide here in the same round.

## Install / Bootstrap Truth

- The formal `mdtero` CLI release path is PyPI + `uv tool install mdtero`.
- `curl -Ls https://mdtero.com/install.sh | sh -s -- --agent <target>` remains a legacy/bootstrap path until the public install surfaces are fully switched over.
- `npm install -g mdtero-install@0.1.8` is a legacy/bootstrap package, not the formal `mdtero` CLI release channel.
- This repository already defines the intended Python package in `pyproject.toml`: `project.name = "mdtero"` and `project.scripts.mdtero = "mdtero_cli:main"`.
- `npx mdtero-install install <target>` installs the matching agent skill bundle; it is not part of the formal `mdtero` CLI release gate.
- `uv tool install mdtero` is the formal user install path for the published Python package; that package includes the local CLI surfaces for `curl_cffi` acquisition, Zotero import, RAG, MCP handoff, parse, translate, status, and download.
- OpenClaw remains separate: use `clawhub install mdtero`, not the install-script target list.

Maintainer-only Python package smoke before publishing:

```bash
python3.12 -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt
./.venv/bin/python -m pip wheel --no-deps -w /tmp/mdtero-wheel .
UV_TOOL_DIR=/tmp/mdtero-tools UV_TOOL_BIN_DIR=/tmp/mdtero-bin \
  uv tool install /tmp/mdtero-wheel/mdtero-*.whl --python 3.12
/tmp/mdtero-bin/mdtero --help
```

The smoke must show one installed `mdtero` executable and `mdtero --help` must list `parse-files`, `zotero import`, `rag serve`, and the other local runtime commands. Publishing to PyPI is a separate maintainer action and is the only formal external release step for the CLI.

For maintainers, the repo now separates package verification from external release:

- `.github/workflows/package-mdtero-cli.yml` builds and smokes the package without publishing.
- `.github/workflows/publish-mdtero-cli.yml` is the manual Trusted Publishing path for PyPI after the `mdtero` project on PyPI is linked to this GitHub repository.
- Before the first external publish, create the PyPI account/project binding for `mdtero`, enable Trusted Publishing for this repository/workflow, and then trigger the publish workflow manually.

## Repository Map

- [`service`](service): app routes, orchestration, auth, billing, and provider adapters
- [`tests/service`](tests/service): backend test coverage
- [`service/legacy_parser`](service/legacy_parser): quarantined compatibility parser code that no longer owns runtime design
- [`tests/legacy_parser`](tests/legacy_parser): regression coverage for compatibility parser code
- [`skills`](skills): install and workflow documents served to agents
- [`scripts`](scripts): operator and migration utilities
- [`docs`](docs): internal product and engineering notes
  - [`docs/partner`](docs/partner): partner-facing API, capability, and maturity package
  - [`docs/feedback_audit_ssot.md`](docs/feedback_audit_ssot.md): quality feedback review SSOT for human and AI-assisted auditing
  - [`docs/architecture/backend-ssot-refactor-blueprint.md`](docs/architecture/backend-ssot-refactor-blueprint.md): backend-first SSOT and readability refactor blueprint
  - [`docs/superpowers/specs/2026-05-06-cloud-parse-sdk-design.md`](docs/superpowers/specs/2026-05-06-cloud-parse-sdk-design.md): planned Python import API boundary for cloud parse tasks
  - [`docs/PYTHON_SURFACE_MAP.md`](docs/PYTHON_SURFACE_MAP.md): runtime-vs-compatibility ownership map for Python files
  - [`docs/RUNTIME_BOUNDARY_RULES.md`](docs/RUNTIME_BOUNDARY_RULES.md): repository boundary rules for runtime owners vs. labs, runs, tmp, and scripts
  - [`docs/SCRIPT_SURFACE_INDEX.md`](docs/SCRIPT_SURFACE_INDEX.md): maintained script-bucket ownership index

## Experimental Parsing Progress

The production parse chain is still the release gate.

In parallel, the experimental `parser_v2` line under [`service/parser_v2`](service/parser_v2) is now materially ahead on architecture and source coverage:

- shared `AST -> Markdown` kernel
- `JATS`, `Elsevier XML`, and `TEI` importers
- server-side `curl_cffi` HTML/XML/EPUB parsing with publisher adapters and quality gates
- local/OA `EPUB` parsing
- experimental `PDF -> GROBID -> TEI -> AST -> Markdown` fallback
- experimental unified upload entrypoint `POST /tasks/parse-fulltext-v2` for structured full-text handoff
- authenticated runtime diagnostics entrypoint `GET /diagnostics/parser-v2/shadow` for shadow-flag visibility

Mainline adoption status on `2026-03-25`:

- `arxiv_native` is already executed through the V2 AST/Markdown kernel on the production parse path
- generic structured XML uploads now route through the V2 structured-XML path for supported families (`Elsevier XML`, `JATS`, `TEI`) even when entering from the legacy `/tasks/parse-upload` surface
- remote structured-fulltext routes that land on uploaded XML parsing therefore also benefit from the same V2 normalization path
- `PDF -> GROBID -> TEI -> AST -> Markdown` is now considered the only maintained scholarly PDF fallback path through the V2 upload surfaces, and should remain low-profile rather than a promoted primary route
- non-PDF project attachments are being standardized behind a separate `MarkItDown` sidecar plan so generic file ingestion does not pollute the scholarly parser runtime
- Playwright, browser extension capture, browser bridge, helper-bundle upload, and helper self-update are retired from the maintained CLI/runtime path
- legacy XML upload wrappers are now thin delegators into `service/parser_v2/uploaded_parse.py`
- production parse subprocess now enters through `python -m service.parse_cli`, which keeps the runtime entry stable while the underlying implementation continues migrating
- `service.parse_cli` now delegates into `service/parser_v2/cli.py` instead of importing the root legacy parser directly
- the arXiv runtime flow is now owned by `service/parser_v2/arxiv_runtime.py`; legacy arXiv compatibility code lives under `service/legacy_parser/`
- AI-markdown sidecar rendering is now owned by `service/parser_v2/markdown.py` rather than the root parser module
- structured XML figure/table asset localization should now be source-first:
  - importers should preserve native figure references where available
  - missing assets should be fused from publisher HTML/PDF/MinerU sources where lawful and available
- raw full-text cache policy is now short-lived by default:
  - `L1` ephemeral execution cache stays at `24h`
  - `L2` user-private raw cache now defaults to `7 days`
  - `L4` public-open raw cache now defaults to `7 days`
- source-first production rollout should prefer native/API/XML/EPUB/HTML routes before PDF fallback:
  - `/tasks/parse` should route DOI / URL acquisition through server/native/legal machine-friendly formats when available
  - local-only content should enter through direct upload or Zotero attachment import, not browser automation
  - CLI / signed-in frontend can inspect current connector shadow posture through `GET /diagnostics/parser-v2/shadow`
  - local project authority for the helper CLI now starts with `mdtero init` and can be inspected with `mdtero status --json`, which reports `.mdtero/` identity, config-source precedence, and stored diagnostics without implying parse readiness
- legacy parser compatibility code now lives under `service/legacy_parser/`; repository root should not host parser entrypoints
- the current Python ownership map and archive boundary are tracked in [`docs/PYTHON_SURFACE_MAP.md`](docs/PYTHON_SURFACE_MAP.md)
- repository boundary rules for `runs/`, `labs/`, `tmp/`, and `scripts/` are tracked in [`docs/RUNTIME_BOUNDARY_RULES.md`](docs/RUNTIME_BOUNDARY_RULES.md)

Current experimentally validated connector status:

- promotion-ready now:
  - `arxiv_native`
  - `europe_pmc`
  - `plos`
  - `elife_jats_xml`
  - `biorxiv`
  - `medrxiv`
  - `mdpi_epub_asset`
  - `springer_openaccess_api`
  - `elsevier_article_retrieval_api`
  - `springer_subscription_connector`
  - `wiley_tdm`
  - `taylor_francis_tdm`

Important rollout interpretation:

- `promotion-ready` means a connector is strong enough for `shadow / feature-flag`, not that it should immediately become the new global production default
- the current single-source summary for `shadow` vs `default cutover` is:
  - `labs/vendor_promotion_validation/MATURITY_MATRIX.md`
  - `runs/vendor_promotion_validation/shadow-rollout.json`
- as of `2026-03-27`, the practical posture is:
  - already-live V2 behavior: `arxiv_native`, uploaded structured XML, uploaded PDF fallback, source-first HTML/XML/EPUB routing
  - next shadow-first connectors: `springer_subscription_connector`, `wiley_tdm`, `taylor_francis_tdm`, `springer_openaccess_api`, `elsevier_article_retrieval_api`
  - still not valid to describe as default production automatic acquisition: `Wiley browser-bridge HTML`, `MDPI` direct server fetch, `Taylor & Francis OA EPUB`

Practical acquisition interpretation:

- `Europe PMC`, `PLOS`, `eLife`, `Springer OA`, `bioRxiv`, `medRxiv`, and `MDPI EPUB` are already executable open structured routes
- `Elsevier` is modeled as `api_first` under user entitlement or authorized API environment; Mdtero does not imply public access
- `Elsevier XML` uploaded or fetched through authorized acquisition lands on the V2 structured importer path on the main backend surface
- `Wiley`, `Springer subscription`, and `Taylor & Francis` should prefer source-first HTML/XML/PDF routes where `curl_cffi` or official APIs can fetch without browser automation
- `Wiley` has experimental official TDM evidence:
  - `Wiley TDM PDF -> GROBID -> Markdown`
  - live validation on this machine succeeded for `10.1002/er.7490`, `10.1002/er.6487`, and `10.1002/sam.11700`
  - observed fetch time was about `2.5s - 4.1s`, with end-to-end `PDF -> Markdown` around `8s` on a warm local GROBID container
- Browser acquisition is no longer a product route. Historical browser-extension and Playwright validation remains useful as archive evidence only; do not revive it as fallback.

Current architectural direction:

- `server` should increasingly act as discovery, routing, parsing, normalization, rendering, and structured persistence
- production acquisition should default to source-first server/native routes or explicit user-provided files/attachments
- helper/browser-first is retired; direct server-side DOI/URL fetching with lawful machine-friendly formats is the release target when quality gates pass
- local project ingest now has a project-owned pre-parse ledger: `mdtero ingest` records DOI/URL/local-file provenance into `.mdtero/state/ingest-ledger.sqlite3`, and `mdtero papers` projects honest readiness states such as `metadata_only`, `oa_location_found`, `fulltext_staged`, and `manual_action_required` without implying parse success
- local project parse now extends that same ledger with append-only `parse_attempts`, and `mdtero parse --json` operates on existing ingested project records instead of bypassing project authority
- local project Zotero import now supports both fixture replay and real read-only library access: `mdtero zotero import --fixture tests/service/fixtures/zotero/sample-library.json --json` replays fixture data, while `mdtero zotero import --library-id <id> --library-type user|group --api-key <key> --json` (or `--local` for Zotero local API) records durable Zotero mappings plus attachment discovery into the same ledger without promising sync or write-back
- local project RAG/chat now extends the same ledger with `rag_builds` and `rag_chunks`; `mdtero rag build --json` indexes parsed Markdown into project-owned retrieval state and `mdtero chat --json` answers from local lexical evidence while preserving the shared grounded-chat response shape
- local project dashboard now projects the same ledger through `mdtero dashboard --json` and plain-text `mdtero dashboard`, keeping `planned_lane`, `actual_lane`, `parser_label`, `artifact_outcome`, `reason_code`, Zotero attachment evidence, RAG readiness, and derived operator actions visible without introducing dashboard-owned state
- the first local parse path is intentionally narrow: staged local PDFs only, routed through the uploaded-PDF adapter seam so backend-owned lane/parser/failure vocabulary remains authoritative
- `mdtero papers --json` now exposes latest parse status plus durable parse history so downstream Zotero/RAG/TUI slices can consume the same local truth without reparsing CLI text
- server-side fetch should remain an optional convenience or coverage fallback, not the primary production ingestion posture
- `PDF` remains fallback, not the primary route
- when PDF fallback is used, `GROBID` is the only maintained engine on the scholarly parse path
- generic project-file fallback belongs in the isolated `MarkItDown` sidecar track, not in the default backend runtime dependency set
- canonical route semantics in the experimental line are now `source_first`, `jats_or_structured_xml_first`, `api_first`, `html_helper_first`, `epub_first`, `pdf_fallback_only`, and `legacy_parse`
- the planned Python import API should be a cloud parse SDK: `from mdtero import Mdtero` wraps hosted parse tasks and artifact download, while local parser modules remain backend internals

Primary internal references:

- [`docs/superpowers/specs/2026-03-25-parser-kernel-v2-design.md`](docs/superpowers/specs/2026-03-25-parser-kernel-v2-design.md)
- [`docs/superpowers/specs/2026-03-25-parser-kernel-v2-alignment-audit.md`](docs/superpowers/specs/2026-03-25-parser-kernel-v2-alignment-audit.md)
- [`docs/superpowers/specs/2026-03-25-helper-extension-browser-bridge-design.md`](docs/superpowers/specs/2026-03-25-helper-extension-browser-bridge-design.md)
- [`labs/publisher_ingestion_probe/README.md`](labs/publisher_ingestion_probe/README.md)
- [`labs/local_helper_playbook/README.md`](labs/local_helper_playbook/README.md)
- [`labs/vendor_promotion_validation/README.md`](labs/vendor_promotion_validation/README.md)

## Grounded Chat Contract

The shared workspace `Notebook` rail depends on `POST /threads/{thread_id}/messages` as the backend SSOT for grounded project chat.

Current host posture:

- `apps/site-next` consumes this same thread-message contract through host-local transport layers
- frontend adapters normalize the payload before it reaches shared workspace components, so the backend response shape here remains the SSOT contract rather than a UI-only fork
- grounded chat is a workspace enhancement layer over parsed Mdtero documents, not a replacement for parsing and not a cross-project global assistant

Current V1 contract:

- request body always accepts `content`
- request body may also carry:
  - `scope_type`: `document`, `selection`, or `project`
  - `document_ids`: selected project document ids
  - `mode`: `grounded` or `synthesis`
  - `citation_limit`: optional citation cap for the returned answer
- backward compatibility is preserved:
  - plain `{ "content": "..." }` still works
  - omitted scope defaults to `project`
  - omitted mode defaults to `grounded`

Current V1 response shape includes:

- `answer`
- `citations`
- `used_embeddings`
- `retrieval_strategy`
- `scope_summary`

Retrieval posture on `2026-04-23`:

- default retrieval strategy is `lexical_v1`
- `used_embeddings` stays `false` in the default path
- local project chat now reuses this same contract shape through `mdtero chat --json`, with evidence restricted to project-owned `rag_chunks`
- scope-aware retrieval is supported for:
  - single document
  - selected documents inside a project
  - whole-project search
- `message_citations` remains the persistence SSOT for assistant citation rows on the backend thread route; the local helper path projects compatible citation payloads without inventing a second response contract
- answer generation now flows through a backend adapter so provider-specific behavior can change later without changing the route contract or dropping citation traceability
- future user-supplied LLM keys or embedding providers may sit behind this adapter, but Mdtero-owned parsing, retrieval scope, and citation payloads remain the fixed contract

## Local Development

```bash
python3.12 -m venv .venv
source .venv/bin/activate
python3 -m unittest discover -s tests -v
python3 -m uvicorn service.main:app --reload
```

Python 3.12 is the intended local baseline. Parts of the service test surface now use modern union syntax such as `str | None`, so older 3.9-only virtualenvs will fail during collection even before the relevant backend code runs.

Install surface guidance:

- `pip install -r requirements.txt`
  - full developer install; includes both runtime deps and local/shadow tooling
- `pip install -r requirements-prod.txt`
  - lean deployed/runtime surface only; this is what the production image installs
  - do not add local replay / browser / scraping extras here just because they are useful on a laptop
- `pip install -r requirements-local.txt`
  - local source-first and shadow add-ons such as `curl_cffi`, `pyzotero`, `paperscraper`, and `pubmed-parser`
  - these are local Python backend/tooling dependencies, not packages bundled into the npm CLI or browser extension

Typical local developer bootstrap:

```bash
pip install -r requirements.txt
```

Pytest-backed service suite:

```bash
pip install -r requirements-test.txt
./.venv/bin/pytest tests/service -q
```

Full repository pytest sweep:

```bash
./.venv/bin/pytest -q
```

### Canonical pre-launch readiness command

The older representative journey gate entrypoint has been archived and is no longer an active backend validation surface. Historical journey-gate materials now live under `archive/validation/`.

Current active verification entrypoints should be taken from the live parsing and topic-batch surfaces documented elsewhere in this README rather than from the archived journey gate.

## Maintenance Rules

- keep credentials, publisher-side helpers, and operator workflows private
- prefer tested behavior in `service/` over one-off scripts
- keep agent install docs aligned with the current beta onboarding path
- treat this repository as the release gate for production auth, billing, parsing, translation, and helper-serving behavior
