Metadata-Version: 2.4
Name: raeh-data
Version: 0.1.0
Summary: Shared data-access library for RAEH biomedical signal datasets
Project-URL: Homepage, https://raeh.io
Author-email: RAEH <hemant@raeh.io>
Maintainer-email: RAEH <hemant@raeh.io>
License: Copyright (c) 2026 RAEH. All rights reserved.
        
        This software is proprietary and confidential. The package is distributed
        publicly on the Python Package Index solely to simplify installation for
        authorized users; publication does NOT grant any license to use, copy,
        modify, or distribute the software except as expressly permitted in writing
        by RAEH.
        
        The library is an access client for RAEH's private biomedical signal
        datasets stored in access-controlled cloud storage. Installing this package
        does not grant access to any data: a valid set of RAEH-issued cloud
        credentials is required, and data use is governed separately by the
        applicable data-use agreements.
        
        Permitted use: authorized RAEH personnel and collaborators who have been
        granted credentials and have accepted the relevant data-use agreements may
        use this software to access data for which they are authorized.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
        FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
        IN THE SOFTWARE.
License-File: LICENSE
Keywords: benchmark,biomedical,dataset,ecg,human-activity-recognition,ppg,signals,wearable
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: boto3>=1.34
Requires-Dist: duckdb>=1.0
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: pyarrow>=15.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: scipy>=1.12
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Description-Content-Type: text/markdown

# raeh-data

Shared data-access library for RAEH biomedical signal datasets.

One install, every RAEH project (algorithm validation, SQI audits, RR/BP estimation, foundation-model pretraining, …) reads from `s3://raeh-datasets/` the same way. Returns plain `pandas.DataFrame` / `numpy.ndarray` — no framework lock-in.

## Status

Layer 1 (data access) and Layer 2 (signal-processing ops) implemented; canonical metadata populated on S3 for all datasets (see [Datasets Reference](docs/datasets.md)).

## Install

```bash
pip install raeh-data
```

That's it — no SSH key, no GitHub access, no `git` required. Python ≥ 3.11.

Pin a version for reproducibility:

```bash
pip install raeh-data==0.1.0
```

Or as a dependency in a consumer project's `requirements.txt` / `pyproject.toml`:

```
raeh-data>=0.1
```

> The package is public on PyPI for install convenience, but it is an
> **access client for RAEH's private datasets**. Installing it does not grant
> data access — you also need RAEH-issued AWS credentials (below) and must be
> covered by the relevant data-use agreements. See `LICENSE`.

### For contributors

```bash
git clone git@github.com:<org>/raeh-data.git
cd raeh-data
pip install -e ".[dev]"     # editable install with test/lint/build deps
```

### AWS credentials

Installing the package doesn't grant data access — the datasets live in a
private bucket. `raeh-data` authenticates with **any standard AWS credential
source** (boto3's default provider chain), so use whichever your team has set
up. In rough order of preference:

**1. AWS SSO / IAM Identity Center (recommended — short-lived, nothing to leak):**
```bash
aws sso login --profile raeh        # once per session
export AWS_PROFILE=raeh             # or set profile in your shell rc
```
First-time setup (`aws configure sso`) and the admin-side org configuration are
in **[AWS SSO setup](docs/aws-sso-setup.md)**.

**2. A named profile in `~/.aws/credentials`:**
```bash
export AWS_PROFILE=raeh
```

**3. On AWS compute (EC2 / ECS / Lambda):** nothing to do — the instance/task
role is picked up automatically.

**4. Long-lived keys via env or a `.env` at your project root** (simplest, but
avoid for shared machines — these don't expire):
```env
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_DEFAULT_REGION=ap-south-1
S3_BUCKET_NAME=raeh-datasets
```

The bucket (`raeh-datasets`) and region (`ap-south-1`) have sensible defaults;
override via `.env`, env vars, or `raeh_data.configure(...)` only if needed.

Without any working credentials you'll get `StorageUnavailable: HTTP 403
Forbidden` on the first data call.

## Quick example

```python
from raeh_data import datasets, ops

# Browse what's available
print(datasets.list())

# Load one subject's PPG + ground truth
sig = datasets.load("ppg_dalia", "S01", signal="ppg")
gt = datasets.ground_truth("ppg_dalia", "S01")

# Apply a signal-processing pipeline
sig = ops.bandpass(sig, 0.5, 8.0, fs=64)
sig = ops.zscore(sig)

# Iterate windows for a reproducible benchmark
for sig_df, gt_df, meta in datasets.iter_benchmark("ppg_dalia", "ppg"):
    # meta.subject_id, meta.window_idx, meta.sample_rate
    # ... predict, compare to gt_df ...
    pass
```

## Documentation

- **[Usage Guide](docs/usage.md)** — concepts, recipes, common patterns.
- **[API Reference](docs/api.md)** — every public function and class.
- **[Datasets Reference](docs/datasets.md)** — per-dataset info, sample rates, benchmark protocols.
- **[AWS SSO setup](docs/aws-sso-setup.md)** — credential-free access via IAM Identity Center (admin + user onboarding).
- **[Troubleshooting](docs/troubleshooting.md)** — common errors and fixes.
- **[Design Doc](docs/design.md)** — internal architecture and design decisions (for contributors).

## Run the demo

```bash
PYTHONPATH=src python examples/demo_ppg_dalia.py
```

End-to-end walkthrough on the PPG-DALIA dataset — catalog, load, ops chain, windowed iteration, benchmark mode.

## Run the tests

```bash
pytest                    # unit tests (default; integration skipped)
pytest -m integration     # live-S3 integration tests (requires creds)
```

## Project layout

```
raeh-data/
├── pyproject.toml
├── docs/                  ← documentation (you're here)
├── examples/              ← runnable demo scripts
├── scripts/               ← admin scripts (e.g., metadata rewriter)
├── src/raeh_data/
│   ├── datasets.py        ← Layer 1 — public data-access API
│   ├── ops/               ← Layer 2 — signal-processing ops
│   ├── cache.py           ← local Parquet cache
│   ├── _core.py           ← internal: DataStore (S3 + DuckDB)
│   ├── _config.py         ← env var loading + configure()
│   ├── _schemas.py        ← DatasetMetadata, YieldMetadata
│   └── exceptions.py      ← public exception hierarchy
└── tests/                 ← unit tests + live-S3 integration tests
```
