Tutorial


Summary

Powerful biomarkers are important tools in diagnostic, clinical and research settings. In the area of diagnostic medicine, a biomarker is often used as a tool to identify subjects with a disease, or at high risk of developing a disease. Moreover, it can be used to foresee the more likely outcome of the disease, monitor its progression and predict the response to a given therapy. Diagnostic accuracy can be improved considerably by combining multiple markers, whose performance in identifying diseased subjects is usually assessed via receiver operating characteristic (ROC) curves. The CombiROC tool was originally designed as an easy to use R-Shiny web application to determine optimal combinations of markers from diverse complex omics data (Mazzara et al. 2017); such an implementation is easy to use but has limited features and limitations arise from the machine it is deployed on. The CombiROC package is the natural evolution of the CombiROC tool and it allows the researcher/analyst to freely use the method and further build on it.

The complete workflow

This python package was developed as porting for the popular CombiROC workflow. All the code and tutorial were developed with this work in mind.

The aim of this document is to show the whole CombiROC workflow to get you up and running as quickly as possible with this package. To do so we’re going to use the proteomic dataset from Zingaretti et al. 2012 containing multi-marker signatures for Autoimmune Hepatitis (AIH) for samples clinically diagnosed as “abnormal” (class A) or “normal” (class B). The scope of the workflow is to first find the markers combinations, then to assess their performance in classifying samples of the dataset.

Note: if you use CombiROC in your research, please cite:

Required data format

The dataset to be analysed should be in text format, which can be separated by commas, tabs or semicolons. Format of the columns should be the following:

  • The 1st column must contain unique patient/sample IDs.

  • The 2nd column must contain the class to which each sample belongs.

  • The classes must be exactly TWO and they must be labelled with character format with “A” (usually the cases) and “B” (usually the controls).

  • From the 3rd column on, the dataset must contain numerical values that represent the signal corresponding to the markers abundance in each sample (marker-related columns).

  • The header for marker-related columns can be called ‘Marker1, Marker2, Marker3, …’ or can be called directly with the gene/protein name. Please note that “-” (dash) is not allowed in the column names

Data loading

The load_data function uses a customized read.table function that checks the conformity of the dataset format. If all the checks are passed, marker-related columns are reordered alphabetically, depending on marker names (this is necessary for a proper computation of combinations), and it imposes Class as the name of the second column. The loaded dataset is here assigned to the data object.

Please note that load_data takes the semicolumn (“;”) as default separator: if the dataset to be loaded has a different separator, i.e. a comma (“,”), is necessary to specify it in the argument sep.

The code below shows how to load a data set contained in the “data” folder (remember to adjust the path according to your current working directory):

import os
import pandas as pd
import numpy as np
import itertools

import matplotlib
from matplotlib import colors
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from matplotlib import pyplot as plt
import plotly.figure_factory as ff
import seaborn as sns

import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf
import sklearn
from sklearn import metrics, preprocessing

from scipy import stats

import scanpy as sc

import os
/mnt/home/gobbini/jupyterminiconda3_47112/envs/pycroc/lib/python3.10/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex
import pycroc as pcr
#data = pcr.load_data('./data/demo_data.csv', sep="\t")

For follow the tutorial you can load the example dataset as:

data = pcr.datasets.Zingaretti_2012()

Look at the dataframe structure:

data.head()
Patient.ID Class Marker1 Marker2 Marker3 Marker4 Marker5
0 AIH1 A 438 187 197 298 139
1 AIH2 A 345 293 134 523 335
2 AIH3 A 903 392 300 1253 0
3 AIH4 A 552 267 296 666 22
4 AIH5 A 1451 760 498 884 684

Data exploration

It is usually a good thing to visually explore your data with at least few plots. To do this, pycroc provides the class Distr that initialize an object directly from the data after the case-class specification

distr = pcr.Distr(data, case_class='A')

Box plots are a nice option to observe the distribution of measurements in each sample. The user can plot the data as she/he wishes using the preferred function.

distr.markers_boxplot(ylim=(0,2000))
tutorial/tutorial_files/tutorial_15_0.png

The ROC curve for all markers and its coordinates

The ROC curve shows how many real positive samples would be found positive (sensitivity, or SE) and how many real negative samples would be found negative (specificity, or SP) in function of signal threshold. Please note that the False Positive Rate (i.e. 1 - specificity) is plotted on the x-axis. These SE and SP are refereed to the signal intensity threshold considering all the markers together; they are not the SE and SP of a single marker/combination computed by the se_sp() function further discussed in the Sensitivity and specificity paragraph.

#initialize minimal values for SE and SP
distr.markers_coord(min_SE=40, min_SP=80)
print(distr.min_SE, distr.min_SP)
40 80
distr.markers_ROC()
tutorial/tutorial_files/tutorial_18_0.png

Users that want more control over the plotting process could access a long-format dataframe accessing the object attribute data_long

distr.data_long.head()
Patient.ID Class Markers Values
0 AIH1 A Marker1 438
1 AIH2 A Marker1 345
2 AIH3 A Marker1 903
3 AIH4 A Marker1 552
4 AIH5 A Marker1 1451

The same is true fot hte resulting computed threshold usign the attribute coord

distr.coord.head()
threshold SE SP Youden
0 0.0 100.0 0.00 0.00
1 1.0 99.0 11.38 0.10
2 2.5 99.0 11.85 0.11
3 3.5 99.0 12.00 0.11
4 4.5 99.0 12.31 0.11

The density plot and suggested signal threshold

The markers_density method shows the distribution of the signal intensity values for both the classes.

In addition, the function allows the user to set both the ylim and xlim values to customize the visualization.

One important feature of the density plot is that it calculates a possible signal intensity threshold: in case of lack of a priori knowedge of the threshold the user can set the argument signalthr_prediction = TRUE.

In this way the function calculates a “suggested signal threshold” that corresponds to the median of the signal threshold values (in Coord) at which SE and SP are greater or equal to their set minimal values (min_SE and min_SP). This threshold is added to the plot as a dashed black line and a number. The use of the median allows to pick a threshold whose SE/SP are not too close to the limits (min_SE and min_SP), but it is recommended to always inspect “Coord” and choose the most appropriate signal threshold by considering SP, SE and Youden index.

This suggested signal threshold can be used as signalthr argument of the combi() function further in the workflow.

#     def markers_density(self, xlim=(0,None), ylim=(0, None),
#                         bw_adjust= {'case':'nrd0', 'control':'nrd0'},
#                         colors= [(1.0, 0.4980392156862745, 0.054901960784313725),
#                                  (0.12156862745098039, 0.4666666666666667, 0.7058823529411765)],
#                         signalthr_prediction=False):

#         #NOTE: if min_SE/min_SP were set before, compute threshold using sub_coord like object
#         #sns.set_palette(colors)

#         case= np.array(self.data_long.loc[self.data_long.Class==self.case_class,].Values)
#         control = np.array(self.data_long.loc[self.data_long.Class==self.control_class,].Values)

#         data={'case':case, 'control':control}

#         for i in ['case', 'control']:
#             if not i in bw_adjust.keys():
#                 bw_adjust[i]='nrd0'
#             if bw_adjust[i]=='nrd0':
#                 bw_adjust[i]= _bw_nrd0(data[i])
#             kde=sns.kdeplot(data[i],  bw_adjust=bw_adjust[i], gridsize=10000)

#         kde.set(ylim=ylim, xlim=xlim)
#         kde.legend([self.case_class,self.control_class])

#         if signalthr_prediction==True:
#             sub_coord = self.coord.loc[(self.coord['SE'] >= self.min_SE) & (self.coord['SP'] >= self.min_SP),:]
#             self.optimal_threshold = sub_coord.threshold.median()
#             kde.axvline(self.optimal_threshold, color='black', ls='--')
#             kde.text(self.optimal_threshold*0.9, 0.00002, str(round(self.optimal_threshold,2)),
#                      horizontalalignment='right', size='small', color='black')
distr.markers_density(ylim=(0,0.002), xlim=(0,4000), signalthr_prediction=True)
tutorial/tutorial_files/tutorial_25_0.png

Combinatorial analysis

Pycroc has a second class named Combinations used to compute the marker combinations and counts positive samples for each class (once thresholds are selected).

A sample, to be considered positive for a given combination, must have a value higher than a given signal threshold (signalthr) for at least a given number of markers composing that combination (combithr).

As mentioned before, signalthr parameter should be set depending on the guidelines and characteristics of the methodology used for the assay or by an accurate inspection of signal intensity distribution. In case of lack of specific guidelines, one should set the value signalthr as suggested by the markers_density method, described in the previous section.

In this tutorial, we set signalthr equal to 450 instead of 407.5 in order to reproduce the results reported in Mazzara et. al 2017 (the original CombiROC paper) or in Bombaci & Rossi 2019 as well as in the tutorial of the web app with default thresholds.

combithr, instead, should be set exclusively depending on the needed stringency: 1 is the less stringent and most common choice (meaning that at least one marker in a combination needs to reach the threshold). The obtained tab object is a dataframe of all the combinations obtained with the chosen parameters.

combi = pcr.Combinations(data, case_class='A',signalthr=450, combithr=1)
combi.tab.head()
Markers #Positives A #Positives B SE SP # of markers
index
Marker1 Marker1 26 6 65.0 95.4 1
Marker2 Marker2 19 2 47.5 98.5 1
Marker3 Marker3 8 1 20.0 99.2 1
Marker4 Marker4 26 48 65.0 63.1 1
Marker5 Marker5 23 15 57.5 88.5 1

Selection of combinations

The markers combinations can now be ranked and selected. After specifying the case class (“A” in this case), the method ranked_combs ranks the combinations by the Youden index in order to show the combinations with the highest SE (of cases) and SP (of controls) on the top, facilitating the user in the selection of the best ones. Again, the Youden index (J) is calculated in this way:

\[{J=SE+SP−1}\]

The user can also set (not mandatory) a minimal value of SE and/or SP that a combination must have to be selected, i.e. to be considered as “gold” combinations.

combi.ranked_combs()
combi.ranked
SE SP # of markers Youden
index
Combination 11 77.5 95.4 3 0.729
Combination 22 87.5 83.8 4 0.713
Combination 1 72.5 95.4 2 0.679
Combination 2 72.5 95.4 2 0.679
Combination 13 82.5 83.8 3 0.663
Combination 15 82.5 83.8 3 0.663
Combination 18 75.0 86.9 3 0.619
Marker1 65.0 95.4 1 0.604
Combination 4 75.0 83.8 2 0.588
Combination 7 70.0 86.9 2 0.569
Combination 9 67.5 87.7 2 0.552
Combination 26 95.0 56.2 5 0.512
Combination 5 52.5 98.5 2 0.510
Combination 25 92.5 57.7 4 0.502
Combination 21 90.0 60.0 4 0.500
Combination 17 85.0 61.5 3 0.465
Combination 24 90.0 56.2 4 0.462
Combination 23 90.0 56.2 4 0.462
Marker5 57.5 88.5 1 0.460
Marker2 47.5 98.5 1 0.460
Combination 19 87.5 57.7 3 0.452
Combination 14 85.0 60.0 3 0.450
Combination 12 85.0 60.0 3 0.450
Combination 20 85.0 58.5 3 0.435
Combination 6 80.0 61.5 2 0.415
Combination 16 82.5 56.2 3 0.387
Combination 3 77.5 60.0 2 0.375
Combination 10 77.5 59.2 2 0.367
Combination 8 72.5 62.3 2 0.348
Marker4 65.0 63.1 1 0.281
Marker3 20.0 99.2 1 0.192

A possibility to overview how single markers and all combinations are distributed in the SE - SP ballpark is to plot them with the bubble_chart method. The bigger the bubble, the more markers are in the combination: looking at the size and distribution of bubbles across SE and SP values is useful to anticipate how effective will be the combinations in the ranking. Setting no cutoffs (i.e. SE = 0 and SP = 0), all single markers and combinations (all bubbles) will be considered as gold combinations and ranked in the next passage.

In the the example below the minimal values of SE and SP are set, respectively, to 40 and 80, in order to reproduce the gold combinations selection reported in Mazzara et. al 2017. The obtained values of combinations, ranked according to Youden index, are stored in the ranked attribute.

combi.bubble_chart(min_SE=40, min_SP=80)
tutorial/tutorial_files/tutorial_32_0.png

ROC curves

To allow an objective comparison of combinations, pycroc applies the Generalised Linear Model for each selected combination.

#data for combination were alreadys tored in the combi object
combi.train_models(combinations=['Marker1', 'Combination 11', 'Combination 15', 'Combination 22'])

Model results were stored in the models attribute of the combi object.

combi.models
{'Marker1': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f40250>,
 'Combination 11': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f41150>,
 'Combination 15': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f13ca0>,
 'Combination 22': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f5eda0>}

The resulting predictions are then used to compute ROC curves (with function pROC::roc()) and their corresponding metrics which are both returned by the function as a named list object (in this case called reports). The function roc_reports() requires as input:

  • The data object ( data ) obtained with load_data();

  • The table with combinations and corresponding positive samples counts ( tab ), obtained with combi().

In addition, the user has to specify the class case, and the single markers and/or the combinations that she/he wants to be displayed with the specific function’s arguments. In the example below a single marker ( Marker1 ) and two combinations (combinations number 11 and 15 ) were choosen.

#combi.models['Combination 11'].predict() # combi score  a.k.a p(x)
combi.roc_curves(case_class='A')
tutorial/tutorial_files/tutorial_39_0.png
combi.models['Combination 11'].summary()
Generalized Linear Model Regression Results
Dep. Variable: ['Class[A]', 'Class[B]'] No. Observations: 170
Model: GLM Df Residuals: 166
Model Family: Binomial Df Model: 3
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -43.745
Date: Tue, 17 May 2022 Deviance: 87.490
Time: 10:52:39 Pearson chi2: 161.
No. Iterations: 7 Pseudo R-squ. (CS): 0.4382
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -17.0128 2.957 -5.753 0.000 -22.809 -11.217
np.log(Marker1 + 1) 1.5378 0.473 3.252 0.001 0.611 2.465
np.log(Marker2 + 1) 0.9176 0.468 1.960 0.050 9.88e-05 1.835
np.log(Marker3 + 1) 0.5706 0.251 2.269 0.023 0.078 1.063
combi.reports
AUC SE SP OptCutoff Youden ACC TN TP FN FP
Marker1 0.909519 0.90 0.807692 0.218699 0.707692 0.829412 105.0 36.0 4.0 25.0
Combination 11 0.941538 0.95 0.869231 0.216361 0.819231 0.888235 113.0 38.0 2.0 17.0
Combination 15 0.935192 0.90 0.853846 0.247679 0.753846 0.864706 111.0 36.0 4.0 19.0
Combination 22 0.945000 0.95 0.861538 0.219362 0.811538 0.882353 112.0 38.0 2.0 18.0
pcr.combi_score(data, combi.models)
Patient.ID Marker1 Combination 11 Combination 15 Combination 22
0 AIH1 0.472599 0.541461 0.609072 0.528845
1 AIH2 0.344255 0.497969 0.497881 0.544820
2 AIH3 0.819448 0.899583 0.673739 0.680834
3 AIH4 0.600797 0.746098 0.653317 0.626728
4 AIH5 0.929360 0.978461 0.977912 0.983694
... ... ... ... ... ...
165 no AIH126 0.396019 0.153715 0.145528 0.188304
166 no AIH127 0.207736 0.196093 0.315590 0.207712
167 no AIH128 0.618847 0.254666 0.543287 0.309190
168 no AIH129 0.259477 0.044433 0.036875 0.058829
169 no AIH130 0.060561 0.147093 0.205254 0.203920

170 rows × 5 columns

pcr.combi_score(data, combi.models, classify=True, reports=combi.reports, case_class='A', control_class='B')
Patient.ID Marker1 Combination 11 Combination 15 Combination 22
0 AIH1 A A A A
1 AIH2 A A A A
2 AIH3 A A A A
3 AIH4 A A A A
4 AIH5 A A A A
... ... ... ... ... ...
165 no AIH126 A B B B
166 no AIH127 B B A B
167 no AIH128 A A A A
168 no AIH129 A B B B
169 no AIH130 B B B B

170 rows × 5 columns

Single-cell applications

import scanpy as sc
adata=sc.datasets.pbmc68k_reduced()
adata
gene_list=adata.uns['rank_genes_groups']['names']['CD56+ NK'][0:29]
scanpy_to_pycroc(AnnData=adata, gene_list=gene_list, obs_name='bulk_labels', case_class=['CD56+ NK'], case_label='NK')