Metadata-Version: 2.4
Name: zedstat
Version: 0.0.149
Summary: Statistics tools for ML models and deployment
Home-page: https://github.com/zeroknowledgediscovery/zedstat
Download-URL: https://github.com/zeroknowledgediscovery/zedstat/archive/0.0.149.tar.gz
Author: zed.uchicago.edu
Author-email: ishanu@uchicago.edu
License: LICENSE
Keywords: machine learning,statistics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: download-url
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

===============
zedstat
===============

.. image:: https://zed.uchicago.edu/logo/logo_zedstat.png
   :height: 150px
   :align: center 

.. image:: https://zenodo.org/badge/529991779.svg
   :target: https://zenodo.org/badge/latestdoi/529991779

.. class:: no-web no-pdf

:Author: ZeD@UChicago <zed.uchicago.edu>
:Description: Tools for ML statistics 
:Documentation: https://zeroknowledgediscovery.github.io/zedstat/
:Example: https://github.com/zeroknowledgediscovery/zedstat/blob/master/examples/example1.ipynb
		
Additional usage examples
=========================

**1. Export operating characteristics at a chosen prevalence**

.. code-block:: python

   import pandas as pd
   from zedstat import zedstat

   roc_df = pd.read_csv("roc.csv")

   zt = zedstat.processRoc(
       df=roc_df,
       order=3,
       total_samples=100000,
       positive_samples=100,
       alpha=0.01,
       prevalence=0.002,
   )

   zt.smooth(STEP=0.001)
   zt.allmeasures(interpolate=True)
   zt.usample(precision=3)
   zt.getBounds()

   out = zt.get().join(zt.df_lim["U"], rsuffix="_upper").join(zt.df_lim["L"], rsuffix="_lower")
   out.to_csv("roc_operating_characteristics.csv")


**2. Retrieve threshold-level PPV from a score**

This uses the ROC-derived operating characteristics and prevalence to estimate the
positive predictive value associated with using a score as a decision threshold.

.. code-block:: python

   example_scores = [0.10, 0.20, 0.30]

   ppv_at_threshold = zt.score_to_threshold_ppv(
       example_scores,
       regen=True,
       STEP=0.001,
       precision=3,
       interpolate=True,
       convexify=False,
   )

   threshold_ppv_df = pd.DataFrame({
       "score": example_scores,
       "threshold_ppv": ppv_at_threshold,
   })

   display(threshold_ppv_df)


**3. Held-out calibration with isotonic regression**

The calibration module is separate from the ROC processing utilities and is imported as:

.. code-block:: python

   from zedstat import calibration

   res = calibration.heldout_isotonic_calibration_with_bootstrap(
       df,
       score_col="predicted_risk",
       label_col="target",
       test_size=0.25,
       random_state=4,
       lower_score_is_risk=False,
       target_prevalence=None,
       n_bins=100,
       n_boot=1000,
       calibration_df_path="calibration_df_SISA.csv",
       plot="calibration_SISA.pdf",
   )

   print(res["summary"])
   display(res["calibration_table"])


**4. Convert raw scores to calibrated probabilities**

The held-out calibration routine returns the fitted isotonic regression model in
``res["iso_model"]``. This can be applied to any score vector.

.. code-block:: python

   import numpy as np
   import pandas as pd

   example_scores = [0.10, 0.20, 0.30]
   example_scores_arr = np.asarray(example_scores, dtype=float)

   calibrated_probs = res["iso_model"].predict(example_scores_arr)

   calibrated_df = pd.DataFrame({
       "score": example_scores,
       "calibrated_probability": np.asarray(calibrated_probs, dtype=float),
   })

   display(calibrated_df)


**5. Format calibration summary for a manuscript table**

.. code-block:: python

   def format_calibration_summary_df(summary):
       import numpy as np
       import pandas as pd

       summary = pd.Series(summary)

       ci_map = {
           "auc_raw_test": ("auc_raw_ci_low", "auc_raw_ci_high"),
           "auc_calibrated_test": ("auc_calibrated_ci_low", "auc_calibrated_ci_high"),
           "brier_raw_test": ("brier_raw_ci_low", "brier_raw_ci_high"),
           "brier_calibrated_test": ("brier_calibrated_ci_low", "brier_calibrated_ci_high"),
           "calibration_intercept_test": ("calibration_intercept_ci_low", "calibration_intercept_ci_high"),
           "calibration_slope_test": ("calibration_slope_ci_low", "calibration_slope_ci_high"),
       }

       skip_keys = {
           "auc_raw_ci_low", "auc_raw_ci_high",
           "auc_calibrated_ci_low", "auc_calibrated_ci_high",
           "brier_raw_ci_low", "brier_raw_ci_high",
           "brier_calibrated_ci_low", "brier_calibrated_ci_high",
           "calibration_intercept_ci_low", "calibration_intercept_ci_high",
           "calibration_slope_ci_low", "calibration_slope_ci_high",
       }

       rows = []
       for key, val in summary.items():
           if key in skip_keys:
               continue

           value_str = "" if pd.isna(val) else f"{float(val):.3f}"

           if key in ci_map:
               lo_key, hi_key = ci_map[key]
               lo = summary.get(lo_key, np.nan)
               hi = summary.get(hi_key, np.nan)
               if pd.notna(lo) and pd.notna(hi):
                   value_str = f"{float(val):.3f} ({float(lo):.3f}, {float(hi):.3f})"

           rows.append({"variable": str(key), "value": value_str})

       return pd.DataFrame(rows, columns=["variable", "value"])

   summary_df = format_calibration_summary_df(res["summary"])
   display(summary_df)


Feature explanations
====================

**processRoc**

``processRoc`` is the main ROC post-processing class. It takes an empirical ROC
curve with false positive rate and true positive rate, augments it with operating
metrics, and allows interpolation, confidence bounds, and interpretation at chosen
operating points.

**smooth(STEP, interpolate, convexify)**

This function regularizes the empirical ROC curve. If ``convexify=True``, the upper
ROC hull is computed so that dominated operating points are removed. If
``interpolate=True``, the curve is resampled on a uniform false positive rate grid.

Let the ROC curve be represented as points

.. math::

   \{(f_i, t_i)\}_{i=1}^m

where :math:`f_i` is the false positive rate and :math:`t_i` is the true positive
rate. After optional convexification and interpolation, these are resampled onto a
uniform grid in :math:`f`.

**allmeasures(prevalence)**

This computes threshold-level operating measures from sensitivity, specificity, and
prevalence.

Let

.. math::

   \mathrm{TPR}(t) = P(\hat Y_t = 1 \mid Y=1), \qquad
   \mathrm{FPR}(t) = P(\hat Y_t = 1 \mid Y=0),

and let prevalence be

.. math::

   \pi = P(Y=1).

Then:

.. math::

   \mathrm{Specificity}(t) = 1 - \mathrm{FPR}(t)

.. math::

   \mathrm{PPV}(t) =
   \frac{\mathrm{TPR}(t)\pi}
        {\mathrm{TPR}(t)\pi + \mathrm{FPR}(t)(1-\pi)}

.. math::

   \mathrm{NPV}(t) =
   \frac{(1-\mathrm{FPR}(t))(1-\pi)}
        {(1-\mathrm{FPR}(t))(1-\pi) + (1-\mathrm{TPR}(t))\pi}

.. math::

   \mathrm{Accuracy}(t) =
   \pi \mathrm{TPR}(t) + (1-\pi)(1-\mathrm{FPR}(t))

.. math::

   \mathrm{LR}^+(t) = \frac{\mathrm{TPR}(t)}{\mathrm{FPR}(t)}

.. math::

   \mathrm{LR}^-(t) = \frac{1-\mathrm{TPR}(t)}{1-\mathrm{FPR}(t)}

These are threshold-level decision quantities. They describe the performance of
classifying everyone whose score crosses the threshold.

**usample(precision)**

This resamples the metric tables on a uniform false positive rate grid, typically
for downstream lookup and plotting. The grid spacing is controlled by the decimal
precision.

**getBounds(total_samples, positive_samples, alpha)**

This computes pointwise confidence bounds for the operating measures using Wilson
intervals for sensitivity and specificity, then propagates these to PPV, NPV,
accuracy, and likelihood ratios.

If :math:`n_1` is the number of positive cases and :math:`n_0` is the number of
negative cases, then Wilson intervals are first computed for

.. math::

   \mathrm{TPR}(t) \quad \text{using } n_1

and for

.. math::

   \mathrm{Specificity}(t) = 1-\mathrm{FPR}(t) \quad \text{using } n_0.

The derived measures are then bounded by substituting lower and upper values of
sensitivity and specificity into the formulas above.

**auc()**

The area under the ROC curve is

.. math::

   \mathrm{AUC} = \int_0^1 \mathrm{TPR}(f)\, df

where :math:`f` is false positive rate. Numerically this is computed by trapezoidal
integration over the processed ROC curve. Confidence intervals can be estimated
either analytically or by bootstrap when raw scores and labels are available.

**operating_zone(LRplus, LRminus)**

This identifies practically useful threshold regions subject to likelihood-ratio
constraints, for example high precision or high sensitivity operating points.
Internally, the method searches the set of thresholds satisfying

.. math::

   \mathrm{LR}^+(t) > c_1, \qquad \mathrm{LR}^-(t) < c_2

for user-specified constants :math:`c_1` and :math:`c_2`.

**interpret(fpr, number_of_positives, five_yr_survival, factor)**

This converts operating characteristics into expected counts for an interpretable
hypothetical population.

If :math:`P` is the chosen number of true positives in the target population and
prevalence is :math:`\pi`, then the implied number of negatives is

.. math::

   N = P \frac{1-\pi}{\pi}.

Given sensitivity and PPV at the selected operating point, the method estimates
true positives, false positives, false negatives, total flags, and number needed to
screen.

Calibration
===========

The calibration utilities are provided in ``zedstat.calibration``.

**Held-out isotonic calibration**

Given a score :math:`S` and binary label :math:`Y \in \{0,1\}`, the calibration
routine splits the data into train and test subsets. On the training subset,
isotonic regression fits a monotone mapping

.. math::

   g(s) \approx P(Y=1 \mid S=s).

The fitted function :math:`g` is then applied to the held-out test scores to obtain
calibrated probabilities.

**Calibrated probability**

The calibrated probability at score :math:`s` is the local conditional event
probability

.. math::

   m(s) = P(Y=1 \mid S=s).

For continuous scores this can be interpreted as

.. math::

   m(s) = \frac{f_{S,Y=1}(s)}{f_S(s)}
        = \frac{\pi f_{S \mid Y=1}(s)}{f_S(s)}.

This is a local score-level quantity.

**Threshold PPV versus calibrated probability**

These are different objects.

Threshold PPV at threshold :math:`t` is

.. math::

   \mathrm{PPV}(t) = P(Y=1 \mid S \ge t)

when higher scores indicate higher risk. This is a tail-average quantity:

.. math::

   \mathrm{PPV}(t)
   =
   \frac{\int_t^\infty m(s) f_S(s)\, ds}
        {\int_t^\infty f_S(s)\, ds}
   =
   E[m(S)\mid S\ge t].

Thus, calibrated probability is local, while threshold PPV is cumulative over all
subjects beyond the threshold.

**Brier score**

The Brier score evaluates probability accuracy:

.. math::

   \mathrm{Brier} = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)^2

where :math:`p_i` is the predicted probability and :math:`y_i` is the observed
binary outcome. Lower is better, with 0 indicating perfect probabilistic
prediction.

**Calibration intercept and slope**

A well-calibrated model should satisfy

.. math::

   P(Y=1 \mid \hat p = p) = p.

A practical assessment regresses the observed outcome on the logit of the predicted
probability:

.. math::

   \logit P(Y=1) = a + b \logit(\hat p).

Here:

* :math:`a` is the calibration intercept. Ideal value is 0.
* :math:`b` is the calibration slope. Ideal value is 1.

An intercept above 0 indicates underprediction on average. An intercept below 0
indicates overprediction on average. A slope below 1 indicates overly extreme
predictions; a slope above 1 indicates predictions compressed toward the center.

**Calibration table and reliability diagram**

Predicted probabilities are grouped into bins, and for each bin the observed event
rate is estimated. If a bin contains :math:`k` events among :math:`n` subjects, the
observed rate is

.. math::

   \hat p_{\mathrm{obs}} = \frac{k}{n}.

Wilson confidence intervals are computed for each bin and displayed as vertical
error bars in the reliability diagram.

Sample size planning
====================

The sample size utilities in ``zedstat`` use AUC-based approximations. Let the
target AUC be :math:`A`. Define

.. math::

   Q_1 = \frac{A}{2-A}, \qquad
   Q_2 = \frac{2A^2}{1+A}.

In the balanced-design approximation, the required sample size per class for
resolving an AUC tolerance :math:`\delta` at confidence level :math:`1-\alpha` is

.. math::

   n \approx \frac{z_{1-\alpha/2}^2 \, c}{\delta^2},

where

.. math::

   c = A(1-A) - A^2 + Q_1 + Q_2.

When prevalence is specified, the code uses a prevalence-aware total sample size
formula derived from the Hanley-McNeil variance approximation.

Remarks
=======

``zedstat`` separates two different but complementary notions of risk:

* threshold-level decision utility, such as PPV, NPV, and likelihood ratios,
  derived from the ROC curve and prevalence;
* score-level probability interpretation, obtained through calibration.

The first is useful for screening policy and operating-point selection. The second
is useful when the score must be interpreted as an individual event probability.
