Metadata-Version: 2.4
Name: europepmc-bulk
Version: 0.1.1
Summary: Bulk, parallel, resumable harvester for the Europe PMC corpus
Project-URL: Documentation, https://europepmc-bulk.readthedocs.io
Project-URL: Repository, https://github.com/Tianyi-Billy-Ma/europepmc-bulk
Project-URL: Issues, https://github.com/Tianyi-Billy-Ma/europepmc-bulk/issues
Project-URL: Changelog, https://github.com/Tianyi-Billy-Ma/europepmc-bulk/blob/main/CHANGELOG.md
Author: Billy Ma
License-Expression: MIT
License-File: LICENSE
Keywords: bioinformatics,data-collection,europepmc,harvester,scientific-literature
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Text Processing :: Markup :: XML
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: lxml>=4.9
Requires-Dist: requests>=2.28
Requires-Dist: tqdm>=4.64
Provides-Extra: async
Requires-Dist: aiohttp>=3.8; extra == 'async'
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest-mock>=3.12; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: responses>=0.24; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Requires-Dist: types-tqdm; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.4; extra == 'docs'
Requires-Dist: mkdocs>=1.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'docs'
Description-Content-Type: text/markdown

# europepmc-bulk

[![PyPI](https://img.shields.io/pypi/v/europepmc-bulk)](https://pypi.org/project/europepmc-bulk/)
[![Python](https://img.shields.io/pypi/pyversions/europepmc-bulk)](https://pypi.org/project/europepmc-bulk/)
[![License](https://img.shields.io/pypi/l/europepmc-bulk)](LICENSE)

Bulk, parallel, resumable harvester for the [Europe PMC](https://europepmc.org/) corpus.

`europepmc-bulk` complements the existing [pyeuropepmc](https://pypi.org/project/pyeuropepmc/) package — pyeuropepmc is great for ad-hoc search and per-article analysis; **europepmc-bulk** is built for harvesting the entire 40M-article corpus with cursor pagination, atomic file writes, resume state, and threaded parallelism.

## Features

- REST search with cursor-mark pagination
- Bulk FTP/HTTPS downloads of full-text archives, text-mined CSVs, ID mappings
- Annotations API batch collection
- OAI-PMH incremental updates
- JATS XML parsing
- Atomic file writes for crash safety
- Persistent resume state (interrupt and resume any harvest)
- Token-bucket rate limiter (default 10 req/s, configurable)
- Threaded parallel harvest with shared rate limiter
- Optional async HTTP client (`pip install "europepmc-bulk[async]"`)
- Click CLI mirror of the Python API

## Install

```bash
pip install europepmc-bulk
# or with async client
pip install "europepmc-bulk[async]"
```

## Quick start

```python
from europepmc_bulk import Config, AbstractHarvester

config = Config(base_dir="./epmc-data")
harvester = AbstractHarvester(config)
harvester.harvest_year(2024, output_format="json")
```

```bash
# CLI equivalent
europepmc-bulk harvest-abstracts --start-year 2024 --end-year 2024 --format json
```

See [docs](https://europepmc-bulk.readthedocs.io) for full usage.

## License

MIT — see [LICENSE](LICENSE).

## Citing Europe PMC

If you use this package to collect data from Europe PMC, please cite:

> The Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform for innovation. *Nucleic Acids Research*, 2014.
