Metadata-Version: 2.4
Name: ds5
Version: 0.1.0
Summary: A Python package for handling and processing drug screening data in HDF5 format
Author-email: Huiyi Yang <cathy.Yang@utah.edu>
License: MIT
Project-URL: Homepage, https://gitlab.com/qiao-lab/ds5
Project-URL: Repository, https://gitlab.com/qiao-lab/ds5
Project-URL: Issues, https://gitlab.com/qiao-lab/ds5/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.3.0
Requires-Dist: numpy>=2.2.0
Requires-Dist: h5py>=3.14.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: scipy>=1.15.0
Requires-Dist: matplotlib>=3.10.0
Requires-Dist: requests>=2.32.0
Requires-Dist: seaborn>=0.13.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ipython>=8.0.0; extra == "dev"
Requires-Dist: jupyter>=1.0.0; extra == "dev"
Provides-Extra: web
Requires-Dist: streamlit>=1.30.0; extra == "web"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0; extra == "docs"
Requires-Dist: myst-parser>=3.0; extra == "docs"
Dynamic: license-file

# DS5

A Python package for drug sensitivity screening data analysis. DS5 handles the full pipeline from raw plate-reader data to drug sensitivity metrics (IC50, EC50, Emax, DSS) with built-in quality control, DMSO normalization, and reporting.

## Installation

```bash
# From the project root
pip install -e .

# With dev dependencies (pytest, jupyter)
pip install -e ".[dev]"
```

Requires Python 3.11–3.12.

## Quick start

```python
import DS5

# 1. Create a new HDF5 file
DS5.gen_new_HDF5("experiment.h5")

# 2. Load plate-reader data from Excel
DS5.load_excel_to_h5(
    "experiment.h5",
    well_read_file_name="plate_reads.xlsx",
    well_read_sheet_name="Sheet1",
    plate_map_file_name="plate_map.xlsx",
    plate_map_sheet_name="Sheet1",
    patient_id="HCI001",
    test_id="set1",
)

# 3. Preprocess (outlier removal)
DS5.preprocess_data("experiment.h5")

# 4. Analyze a single drug
ic50 = DS5.analyze_drug_ic50("experiment.h5", "HCI001", "set1", "Doxorubicin")
print(f"IC50 = {ic50['ic50']['value']}")

# 5. Summarize all drugs in one table
summary = DS5.summarize_test_results("experiment.h5", "HCI001", "set1")
print(summary)

# 6. Batch process and cache results
DS5.process_ds5("experiment.h5")

# 7. Extract data for custom analysis
df = DS5.get_data("experiment.h5", "HCI001_set1", data_type="normalized")
```

## API overview

### Data I/O

| Function | Description |
|---|---|
| `gen_new_HDF5(file_name)` | Create empty DS5-format HDF5 file |
| `load_excel_to_h5(...)` | Load plate-reader Excel + plate map into HDF5 |
| `export_h5_to_excel(h5, output)` | Export HDF5 contents to Excel workbook |
| `load_GDSC_to_h5(csv, ...)` | Load GDSC-format CSV into HDF5 |
| `load_all_GDSC_to_h5(csv, ...)` | Batch-load all experiments from GDSC CSV |
| `generate_GDSC_screen_list(csv, ...)` | List available screens in a GDSC CSV |
| `get_data(h5, screen, data_type)` | Extract data as DataFrame (`intensity`, `normalized`, etc.) |

### Preprocessing & QC

| Function | Description |
|---|---|
| `preprocess_data(h5, qc_para_file=None)` | Apply outlier removal to all screens |
| `check_preprocess(h5, patient, test, drug)` | Visualize preprocessing effect on a drug |
| `QC_visual(h5, screen, qc_para_file)` | Generate before/after QC comparison plots |

### Drug analysis

| Function | Description |
|---|---|
| `analyze_dmso_controls(h5, patient, test)` | DMSO control statistics and boxplot |
| `analyze_all_dmso(h5, patient=None)` | DMSO analysis across all screens |
| `analyze_drug_ic50(h5, patient, test, drug)` | IC50 via 4-parameter logistic fit |
| `analyze_drug_ec50(h5, patient, test, drug)` | EC50 (50% absolute inhibition) |
| `analyze_drug_emax(h5, patient, test, drug, mode)` | Maximum inhibition (supports multiple Emax modes) |
| `calculate_DSS(h5, patient, test, drug)` | DSS1, DSS2, DSS3 drug sensitivity scores |

### Emax modes

The `mode` (or `emax_mode`) parameter controls how Emax is computed. All functions that compute Emax support these modes:

| Mode | Definition | Requires curve fit |
|---|---|---|
| `observed_best` (default) | Highest mean inhibition at any tested concentration | No |
| `observed_highest_dose` | Mean inhibition at the highest tested concentration | No |
| `fitted_highest_dose` | 4PL model-predicted response at the highest tested concentration | Yes (falls back to `observed_best`) |
| `e_inf` | Fitted 4PL asymptote, must be in [-10, 200]% | Yes (falls back to `observed_best`) |

```python
# Single drug analysis with Emax mode
emax = DS5.analyze_drug_emax("experiment.h5", "HCI001", "set1", "Doxorubicin", mode="e_inf")

# Batch processing with Emax mode
DS5.process_ds5("experiment.h5", emax_mode="fitted_highest_dose")

# Summary and comparison with Emax mode
summary = DS5.summarize_test_results("experiment.h5", "HCI001", "set1", emax_mode="e_inf")
comparison = DS5.compare_metrics("experiment.h5", emax_mode="e_inf")
```

DSS2 always uses the fitted Emax from the 4PL curve regardless of `emax_mode`.

### Summary & comparison

| Function | Description |
|---|---|
| `summarize_test_results(h5, patient, test, emax_mode)` | All metrics for all drugs in one DataFrame |
| `process_ds5(input_h5, output_h5=None, emax_mode)` | Batch-process and cache summary tables |
| `compare_metrics(h5, patient=None, emax_mode)` | Cross-screen metric comparison |
| `generate_report(h5, test_name)` | HTML report with heatmaps and top drug picks |

### Drug name standardization

| Function | Description |
|---|---|
| `standardize_drug_name(name)` | Resolve via RxNorm/PubChem → `rx:12345`, `pc:6789`, or `raw:name` |
| `register_metric(name, func)` | Register an external metric plugin |

## HDF5 schema

DS5 stores all data in a single HDF5 file. See [docs/HDF5_SCHEMA.md](docs/HDF5_SCHEMA.md) for full details.

```
/patients/
  /{patient_id}/
    /{test_id}/
      data                 # Raw plate-reader values (byte-string array)
      plate_map            # Well identifiers: "DrugName concentration" or "DMSO"
      preprocessed_data    # (optional) Float array with outliers set to NaN
      summary_table        # (optional) Cached metric summary from process_ds5
/drug_standardization_table  # (optional) Maps raw drug names ↔ rx:/pc: IDs
```

## Plate map format

The plate map Excel file should have row labels (A, B, C, ...) and column labels (1, 2, 3, ...) matching standard microplate layout. Each cell contains either:

- `DMSO` — marks a DMSO control well
- `DrugName concentration` — e.g., `Doxorubicin 0.1` (drug name, space, concentration in µM)

## QC configuration

Preprocessing is controlled by a `QC_para.txt` file with key=value pairs:

```
# QC_para.txt example
left_percentile = 1
right_percentile = 99
dmso_use_mad = true
drug_outlier_threshold = 5
```

| Parameter | Default | Description |
|---|---|---|
| `left_percentile` | 0 | Lower percentile cutoff for global outlier removal |
| `right_percentile` | 0 | Upper percentile cutoff for global outlier removal |
| `dmso_use_mad` | true | Use MAD-based (true) or IQR-based (false) DMSO outlier removal |
| `drug_outlier_threshold` | 5 | Median-ratio threshold for per-drug outlier removal |

If no QC file is provided, defaults are used (minimal filtering).

## External metrics plugin

You can extend DS5 with custom metrics:

```python
from DS5 import register_metric

def compute_my_metric(h5_file_name, patient_id, test_id, drug_name, **kwargs):
    """Must return a dict of {column_name: value}."""
    # ... your computation ...
    return {
        "MY_SCORE": 42.0,
        "__meta__": {"prefer_higher": True},  # optional: controls ranking direction
    }

register_metric("my_metric", compute_my_metric)

# Now use it in summarize_test_results
summary = DS5.summarize_test_results(
    "experiment.h5", "HCI001", "set1",
    use_external_metrics=True,
    external_metrics=["my_metric"],
)
# summary DataFrame will include a MY_SCORE column
```

See `external_metrics/calculate_metric_max_viability.py` for a complete example.

## Running tests

```bash
pytest tests/ -v -m "not network"
```

Test data lives in `tests/fixtures/` — `synthetic.h5` contains a 9x6 plate with 3 drugs at 5 concentrations + DMSO controls. Expected metric outputs are recorded in `golden_values.json`. Tests use 10% relative tolerance so minor algorithm improvements pass but large regressions fail.

## License

MIT
