Metadata-Version: 2.4
Name: refua-data
Version: 0.7.2
Summary: Data ingestion, caching, and parquet materialization for the Refua drug discovery ecosystem.
Author-email: JJ Ben-Joseph <jj@tensorspace.ai>
License-Expression: MIT
Project-URL: Homepage, https://agentcures.com/
Project-URL: Repository, https://github.com/agentcures/refua
Project-URL: Documentation, https://github.com/agentcures/refua#readme
Project-URL: Issues, https://github.com/agentcures/refua/issues
Keywords: drug discovery,data engineering,cheminformatics,bioinformatics,parquet,refua
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.15,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.2.2
Requires-Dist: pyarrow>=18.0.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: requests>=2.32.3
Requires-Dist: tqdm>=4.66.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Requires-Dist: mypy>=1.11.0; extra == "dev"
Requires-Dist: pre-commit>=4.5.1; extra == "dev"
Requires-Dist: pandas-stubs>=2.2.3.250527; extra == "dev"
Requires-Dist: types-requests>=2.32.0.20241016; extra == "dev"
Requires-Dist: build>=1.2.2; extra == "dev"
Requires-Dist: twine>=6.1.0; extra == "dev"
Dynamic: license-file

# refua-data

`refua-data` is the Refua data layer for drug discovery. It provides a curated dataset catalog, intelligent local caching, and parquet materialization optimized for downstream modeling and campaign workflows.

## What it provides

- A built-in catalog of useful drug-discovery datasets.
- Dataset-aware download pipeline with cache reuse and metadata tracking.
- Pluggable cache backend architecture (filesystem cache by default).
- API dataset ingestion for paginated JSON endpoints (for example ChEMBL and UniProt).
- HTTP conditional refresh support (`ETag` / `Last-Modified`) when enabled.
- Support for partitioned parquet bundle downloads (for example Open Targets releases).
- Native Excel (`.xlsx`) ingestion for datasets such as GDSC fitted dose-response releases.
- Incremental parquet materialization (chunked processing + partitioned parquet parts).
- CLI for listing, fetching, and materializing datasets.
- Query interface for filtered row access from materialized parquet datasets.
- Source health checks via `validate-sources` for CI and environment diagnostics.
- Rich dataset metadata snapshots (description + usage notes) persisted in cache metadata.

## Included datasets

The default catalog includes local-file/HTTP datasets plus API presets useful in drug discovery, including **ZINC**, **BindingDB**, **Open Targets**, **CancerRxGene/GDSC**, **ChEMBL**, **UniProt**, **openFDA**, and the **Human Protein Atlas**.

1. `zinc15_250k` (ZINC)
2. `zinc15_tranche_druglike_instock` (ZINC tranche)
3. `zinc15_tranche_druglike_agent` (ZINC tranche)
4. `zinc15_tranche_druglike_wait_ok` (ZINC tranche)
5. `zinc15_tranche_druglike_boutique` (ZINC tranche)
6. `zinc15_tranche_druglike_annotated` (ZINC tranche)
7. `tox21`
8. `bbbp`
9. `bace`
10. `clintox`
11. `sider`
12. `hiv`
13. `muv`
14. `esol`
15. `freesolv`
16. `lipophilicity`
17. `pcba`
18. `bindingdb_articles_affinity`
19. `openfda_drug_event_serious`
20. `proteinatlas_human_proteome`
21. `opentargets_target_prioritisation`
22. `gdsc2_fitted_dose_response`
23. `chembl_activity_ki_human`
24. `chembl_activity_ic50_human`
25. `chembl_activity_kd_human`
26. `chembl_activity_ec50_human`
27. `chembl_activity_ac50_human`
28. `chembl_assays_binding_human`
29. `chembl_assays_functional_human`
30. `chembl_assays_adme_human`
31. `chembl_targets_human_single_protein`
32. `chembl_targets_human_protein_complex`
33. `chembl_molecules_phase3plus`
34. `chembl_molecules_phase4`
35. `chembl_molecules_black_box_warning`
36. `chembl_mechanism_phase2plus`
37. `chembl_drug_indications_phase2plus`
38. `chembl_drug_indications_phase3plus`
39. `uniprot_human_reviewed`
40. `uniprot_human_receptors`
41. `uniprot_human_membrane`
42. `uniprot_human_nucleus`
43. `uniprot_human_kinases`
44. `uniprot_human_gpcr`
45. `uniprot_human_ion_channels`
46. `uniprot_human_transporters`
47. `uniprot_human_secreted`
48. `uniprot_human_transcription_factors`
49. `uniprot_human_enzymes`

Most of these are distributed through MoleculeNet/DeepChem mirrors and retain upstream licensing terms.
BindingDB is included as a versioned ZIP-backed TSV snapshot for literature-derived affinity modeling.
Open Targets is included as a versioned parquet-part bundle for target prioritisation workflows.
CancerRxGene GDSC is included as a versioned Excel-backed dose-response release for cell-line pharmacology modeling.
ChEMBL, UniProt, and openFDA presets are fetched through their public REST APIs and cached locally as JSONL.
ZINC tranche presets aggregate multiple tranche files per dataset (drug-like MW B-K and logP A-K bins,
reactivity A/B/C/E) into one cached tabular source during fetch.

## Install

```bash
cd refua-data
pip install -e .
```

## CLI quickstart

List datasets:

```bash
refua-data list
```

Validate all dataset sources:

```bash
refua-data validate-sources
```

Validate a subset and fail CI on probe failures:

```bash
refua-data validate-sources chembl_activity_ki_human uniprot_human_kinases --fail-on-error
```

JSON output for automation:

```bash
refua-data validate-sources --json --fail-on-error
```

For datasets with multiple mirrors, source validation succeeds when at least one configured source
is reachable. Failed fallback attempts are included in the result details.

Fetch raw data with cache:

```bash
refua-data fetch zinc15_250k
```

Fetch API-based presets:

```bash
refua-data fetch chembl_activity_ki_human
refua-data fetch uniprot_human_kinases
```

Materialize parquet:

```bash
refua-data materialize zinc15_250k
```

Query materialized parquet rows:

```bash
refua-data query zinc15_250k --columns smiles,logP --filters '{"logP":{"lt":2.5}}' --limit 50
```

Refresh against remote metadata:

```bash
refua-data fetch zinc15_250k --refresh
```

For API datasets, `--refresh` re-runs the API query (with conditional headers on first page when available).

## Cache layout

By default, cache root is:

- `~/.cache/refua-data`

Override with:

- `REFUA_DATA_HOME=/custom/path`

Layout:

- `raw/<dataset>/<version>/...` downloaded source files
- `_meta/raw/<dataset>/<version>/...json` raw metadata (`etag`, `sha256`, API request signature, rows/pages, dataset description/usage metadata)
- `parquet/<dataset>/<version>/part-*.parquet` materialized parquet parts
- `_meta/parquet/<dataset>/<version>/manifest.json` parquet manifest metadata with dataset snapshot

## Python API

```python
from refua_data import DatasetManager

manager = DatasetManager()
manager.fetch("zinc15_250k")
manager.fetch("chembl_activity_ki_human")
result = manager.materialize("zinc15_250k")
print(result.parquet_dir)
```

`DataCache` is the default cache backend. You can pass a custom backend object that implements
the same interface (`ensure`, `raw_file`, `raw_meta`, `parquet_dir`, `parquet_manifest`,
`read_json`, `write_json`) to make storage pluggable.

## Licensing notes

- `refua-data` package code is MIT licensed.
- Dataset content licenses are dataset-specific and controlled by upstream providers.
- Always verify dataset licensing and allowed use before redistribution or commercial deployment.
