Metadata-Version: 2.4
Name: pmcdb
Version: 0.0.3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Dist: duckdb>=1.0
Requires-Dist: irohds>=0.3
Requires-Dist: coren>=0.1
Requires-Dist: certifi
Requires-Dist: platformdirs>=4.0
Requires-Dist: maturin>=1.12.6
Requires-Dist: numpy>=1.24 ; extra == 'rlm'
Provides-Extra: rlm
Summary: PubMed database builder and query interface
Keywords: pubmed,ncbi,bioinformatics,parquet,duckdb
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://codeberg.org/chaxor/pmcdb

# pmcdb

PubMed query tool. Downloads the complete NCBI PubMed corpus (~40M articles),
parses every DTD field, produces compact queryable Parquet tables (~34 GB).
Every user serves data back to the scientific community via irohds P2P.

## Install

```sh
uv add pmcdb
```

Rust parser binary bundled in platform-specific wheel. No toolchain needed.

## Usage

```python
from pmcdb import PubMed

with PubMed() as db:
    df = db.df("SELECT * FROM citation WHERE pub_year = '2024' LIMIT 10")
    print(db.tables())  # 30 tables

# Reproducible checkpoint (query-time filter only)
with PubMed(through="2024") as db:
    df = db.df("SELECT count(*) FROM citation")
```

First call triggers build (~2 min on 64-core, ~15 min average).
Subsequent calls: instant (local cache) or delta-efficient P2P fetch.

## CLI

```sh
python -m pmcdb                     # build + compact + serve (default)
python -m pmcdb query "SELECT ..."  # run SQL
python -m pmcdb --no-compact        # build only, skip compaction
```

## Architecture

```
pmcdb-core (Rust)   download + parse XML.gz -> per-worker Parquet
                   quick-xml, arrow/parquet, crossbeam-channel, ureq, coren
pmcdb (Python)      DuckDB query layer, compaction, irohds P2P distribution
                   @irohds.memo on create_table, coren for resource limits
```

**30 tables:** 27 from XML corpus + 3 auxiliary (journal catalog,
deleted PMIDs, computed author clusters from NCBI).

- Parse: ~70-80k records/sec (zero-copy RowWriter into Arrow builders)
- Full dataset: ~34 GB Parquet (vs 239 GB SQLite)
- Resume after interrupt via `_state` file
- Deterministic: same FTP state -> byte-identical sorted Parquet
- Adaptive flush threshold via coren (Pi 4GB to HPC 512GB)
- Mandatory compaction before P2P: one sorted file per table

## Development

```sh
make dev       # maturin develop into venv
make test      # cargo test + pytest
make sync      # full pipeline: build + compact + serve
```

## Publishing

Credentials in standard locations:
- `~/.pypirc` (twine)
- `~/.cargo/credentials.toml` (cargo)

```sh
make pub               # bump patch, test, build host wheel, upload, tag
make pub V=0.2.0       # explicit version
make pub-all           # all platforms (needs podman for Linux wheels)
make release V=0.2.0   # tag-only (CI builds + publishes)
```

## CI

Codeberg Forgejo Actions. CI runs Rust + Python tests on push/PR.
Release workflow builds Linux x86_64 + aarch64 wheels on tag push,
publishes to PyPI + crates.io + Codeberg release.

