Metadata-Version: 2.4
Name: cumulus-library-data-metrics
Version: 10.0.0
Summary: Data quality and characterization metrics for Cumulus
Requires-Python: >= 3.11
Description-Content-Type: text/markdown
License-Expression: Apache-2.0
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
License-File: LICENSE
Requires-Dist: cumulus-library >= 6, < 7
Requires-Dist: pre-commit ; extra == "dev"
Requires-Dist: ruff < 0.15 ; extra == "dev"
Requires-Dist: ddt ; extra == "tests"
Requires-Dist: pytest ; extra == "tests"
Project-URL: Documentation, https://docs.smarthealthit.org/cumulus/
Project-URL: Home, https://smarthealthit.org/cumulus/
Project-URL: Source, https://github.com/smart-on-fhir/cumulus-library-data-metrics
Provides-Extra: dev
Provides-Extra: tests

# Data Metrics

A Cumulus-based implementation of the [qualifier metrics](https://github.com/sync-for-science/qualifier/blob/master/metrics.md).

## Implemented Metrics

The following qualifier metrics are implemented (per December 2025 qualifier definitions).

- [c_attachment_count](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_attachment_count)
- [c_content_type_use](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_content_type_use)
- [c_pt_count](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_pt_count)
- [c_pt_deceased_count](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_pt_deceased_count)
- [c_resource_count](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_resource_count)
- [c_resources_per_pt](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_resources_per_pt)
- [c_system_use](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_system_use)
- [c_us_core_v6_count](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#c_us_core_v6_count) *
- [q_date_recent](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#q_date_recent)
- [q_ref_target_pop](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#q_ref_target_pop)
- [q_ref_target_valid](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#q_ref_target_valid)
- [q_system_use](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#q_system_use)
- [q_valid_us_core_v6](https://github.com/sync-for-science/qualifier/blob/master/metrics.md#q_valid_us_core_v6) *

\* These are US Core profile-based metrics, and the following profiles are not yet implemented:
  - Implantable Device (due to the difficulty in identify implantable records)
  - Some Observation profiles and also its various Vital Signs sub-profiles like Blood Pressure (just haven't gotten around to them yet)

## Installing

```sh
pip install cumulus-library-data-metrics
```

## Running the Metrics

These metrics are designed as a
[Cumulus Library](https://docs.smarthealthit.org/cumulus/library/)
study and are run using the `cumulus-library` command.

### Local Ndjson
Let's say you have a collection of FHIR-formatted NDJSON files.
They can all be in one folder or in organized subfolders.

Here's a sample command to run against that pile of NDJSON data:
```sh
cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics
```

And then you can load `output-tables.db` in a DuckDB session and see the results.
Or read below to export the counts tables.

#### Visualization
The metrics can also be reviewed in an interactive web interface by installing and running the open source [Cumulus Data Metrics Reporting Tool](https://github.com/smart-on-fhir/cumulus-data-metrics-reporting). When generating a metrics file for this view, the `output-mode:aggregate` flag should be used. E.g.,

```sh
cumulus-library build \
--option output-mode:aggregate \
--option min-bucket-size:0 \
--db-type duckdb \
--database src/data/metrics.duckdb \
--target data_metrics \
--load-ndjson-dir {path/to/ndjson/root}
```

### Athena
Here's a sample command to run against your Cumulus data in Athena:
```sh
cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics
```

And then you can see the resulting tables in Athena.
Or read below to export the counts tables.

### Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly),
you can easily export those using Cumulus Library,
by replacing `build` in the above commands with `export ./output-folder`.
Like so:

```sh
cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics
```

#### Aggregate counts

This study generates `CUBE` output by default.
If it's easier to work with simple aggregate counts of every value combination
(that is, without the partial value combinations that `CUBE()` generates),
run the build step with `--option output-mode:aggregate`.

That is, run it like:
```sh
cumulus-library build --option output-mode:aggregate ...
```

#### Bucket sizes

To help preserve privacy, this study ignores any count results of less than ten.

For example, if there are only two male patients that died at age 55,
that combination of male & 55 will be dropped from the `c_pt_deceased_count` table.

This makes it easier to share the count results with other institutions.
But if that's not a concern and you want the fine-grained details,
you can run the build step with `--option min-bucket-size:0` to turn this feature off.
Or use another value to change the bucket threshold (the default value is 10).

