Tutorial
Summary
Powerful biomarkers are important tools in diagnostic, clinical and research settings. In the area of diagnostic medicine, a biomarker is often used as a tool to identify subjects with a disease, or at high risk of developing a disease. Moreover, it can be used to foresee the more likely outcome of the disease, monitor its progression and predict the response to a given therapy. Diagnostic accuracy can be improved considerably by combining multiple markers, whose performance in identifying diseased subjects is usually assessed via receiver operating characteristic (ROC) curves. The CombiROC tool was originally designed as an easy to use R-Shiny web application to determine optimal combinations of markers from diverse complex omics data (Mazzara et al. 2017); such an implementation is easy to use but has limited features and limitations arise from the machine it is deployed on. The CombiROC package is the natural evolution of the CombiROC tool and it allows the researcher/analyst to freely use the method and further build on it.
The complete workflow
This python package was developed as porting for the popular CombiROC workflow. All the code and tutorial were developed with this work in mind.
The aim of this document is to show the whole CombiROC workflow to get you up and running as quickly as possible with this package. To do so we’re going to use the proteomic dataset from Zingaretti et al. 2012 containing multi-marker signatures for Autoimmune Hepatitis (AIH) for samples clinically diagnosed as “abnormal” (class A) or “normal” (class B). The scope of the workflow is to first find the markers combinations, then to assess their performance in classifying samples of the dataset.
Note: if you use CombiROC in your research, please cite:
Required data format
The dataset to be analysed should be in text format, which can be separated by commas, tabs or semicolons. Format of the columns should be the following:
The 1st column must contain unique patient/sample IDs.
The 2nd column must contain the class to which each sample belongs.
The classes must be exactly TWO and they must be labelled with character format with “A” (usually the cases) and “B” (usually the controls).
From the 3rd column on, the dataset must contain numerical values that represent the signal corresponding to the markers abundance in each sample (marker-related columns).
The header for marker-related columns can be called ‘Marker1, Marker2, Marker3, …’ or can be called directly with the gene/protein name. Please note that “-” (dash) is not allowed in the column names
Data loading
The load_data
function uses a customized read.table
function
that checks the conformity of the dataset format. If all the checks are
passed, marker-related columns are reordered alphabetically, depending
on marker names (this is necessary for a proper computation of
combinations), and it imposes Class as the name of the second column.
The loaded dataset is here assigned to the data object.
Please note that load_data
takes the semicolumn (“;”) as default
separator: if the dataset to be loaded has a different separator, i.e. a
comma (“,”), is necessary to specify it in the argument sep.
The code below shows how to load a data set contained in the “data” folder (remember to adjust the path according to your current working directory):
import os
import pandas as pd
import numpy as np
import itertools
import matplotlib
from matplotlib import colors
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from matplotlib import pyplot as plt
import plotly.figure_factory as ff
import seaborn as sns
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf
import sklearn
from sklearn import metrics, preprocessing
from scipy import stats
import scanpy as sc
import os
/mnt/home/gobbini/jupyterminiconda3_47112/envs/pycroc/lib/python3.10/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import Int64Index as NumericIndex
import pycroc as pcr
#data = pcr.load_data('./data/demo_data.csv', sep="\t")
For follow the tutorial you can load the example dataset as:
data = pcr.datasets.Zingaretti_2012()
Look at the dataframe structure:
data.head()
Patient.ID | Class | Marker1 | Marker2 | Marker3 | Marker4 | Marker5 | |
---|---|---|---|---|---|---|---|
0 | AIH1 | A | 438 | 187 | 197 | 298 | 139 |
1 | AIH2 | A | 345 | 293 | 134 | 523 | 335 |
2 | AIH3 | A | 903 | 392 | 300 | 1253 | 0 |
3 | AIH4 | A | 552 | 267 | 296 | 666 | 22 |
4 | AIH5 | A | 1451 | 760 | 498 | 884 | 684 |
Data exploration
It is usually a good thing to visually explore your data with at least
few plots. To do this, pycroc provides the class Distr
that
initialize an object directly from the data after the case-class
specification
distr = pcr.Distr(data, case_class='A')
Box plots are a nice option to observe the distribution of measurements in each sample. The user can plot the data as she/he wishes using the preferred function.
distr.markers_boxplot(ylim=(0,2000))

The ROC curve for all markers and its coordinates
The ROC curve shows how many real positive samples would be found positive (sensitivity, or SE) and how many real negative samples would be found negative (specificity, or SP) in function of signal threshold. Please note that the False Positive Rate (i.e. 1 - specificity) is plotted on the x-axis. These SE and SP are refereed to the signal intensity threshold considering all the markers together; they are not the SE and SP of a single marker/combination computed by the se_sp() function further discussed in the Sensitivity and specificity paragraph.
#initialize minimal values for SE and SP
distr.markers_coord(min_SE=40, min_SP=80)
print(distr.min_SE, distr.min_SP)
40 80
distr.markers_ROC()

Users that want more control over the plotting process could access a
long-format dataframe accessing the object attribute data_long
distr.data_long.head()
Patient.ID | Class | Markers | Values | |
---|---|---|---|---|
0 | AIH1 | A | Marker1 | 438 |
1 | AIH2 | A | Marker1 | 345 |
2 | AIH3 | A | Marker1 | 903 |
3 | AIH4 | A | Marker1 | 552 |
4 | AIH5 | A | Marker1 | 1451 |
The same is true fot hte resulting computed threshold usign the
attribute coord
distr.coord.head()
threshold | SE | SP | Youden | |
---|---|---|---|---|
0 | 0.0 | 100.0 | 0.00 | 0.00 |
1 | 1.0 | 99.0 | 11.38 | 0.10 |
2 | 2.5 | 99.0 | 11.85 | 0.11 |
3 | 3.5 | 99.0 | 12.00 | 0.11 |
4 | 4.5 | 99.0 | 12.31 | 0.11 |
The density plot and suggested signal threshold
The markers_density
method shows the distribution of the signal
intensity values for both the classes.
In addition, the function allows the user to set both the ylim and xlim values to customize the visualization.
One important feature of the density plot is that it calculates a
possible signal intensity threshold: in case of lack of a priori
knowedge of the threshold the user can set the argument
signalthr_prediction = TRUE
.
In this way the function calculates a “suggested signal threshold” that corresponds to the median of the signal threshold values (in Coord) at which SE and SP are greater or equal to their set minimal values (min_SE and min_SP). This threshold is added to the plot as a dashed black line and a number. The use of the median allows to pick a threshold whose SE/SP are not too close to the limits (min_SE and min_SP), but it is recommended to always inspect “Coord” and choose the most appropriate signal threshold by considering SP, SE and Youden index.
This suggested signal threshold can be used as signalthr argument of the combi() function further in the workflow.
# def markers_density(self, xlim=(0,None), ylim=(0, None),
# bw_adjust= {'case':'nrd0', 'control':'nrd0'},
# colors= [(1.0, 0.4980392156862745, 0.054901960784313725),
# (0.12156862745098039, 0.4666666666666667, 0.7058823529411765)],
# signalthr_prediction=False):
# #NOTE: if min_SE/min_SP were set before, compute threshold using sub_coord like object
# #sns.set_palette(colors)
# case= np.array(self.data_long.loc[self.data_long.Class==self.case_class,].Values)
# control = np.array(self.data_long.loc[self.data_long.Class==self.control_class,].Values)
# data={'case':case, 'control':control}
# for i in ['case', 'control']:
# if not i in bw_adjust.keys():
# bw_adjust[i]='nrd0'
# if bw_adjust[i]=='nrd0':
# bw_adjust[i]= _bw_nrd0(data[i])
# kde=sns.kdeplot(data[i], bw_adjust=bw_adjust[i], gridsize=10000)
# kde.set(ylim=ylim, xlim=xlim)
# kde.legend([self.case_class,self.control_class])
# if signalthr_prediction==True:
# sub_coord = self.coord.loc[(self.coord['SE'] >= self.min_SE) & (self.coord['SP'] >= self.min_SP),:]
# self.optimal_threshold = sub_coord.threshold.median()
# kde.axvline(self.optimal_threshold, color='black', ls='--')
# kde.text(self.optimal_threshold*0.9, 0.00002, str(round(self.optimal_threshold,2)),
# horizontalalignment='right', size='small', color='black')
distr.markers_density(ylim=(0,0.002), xlim=(0,4000), signalthr_prediction=True)

Combinatorial analysis
Pycroc has a second class named Combinations
used to compute the
marker combinations and counts positive samples for each class (once
thresholds are selected).
A sample, to be considered positive for a given combination, must have a
value higher than a given signal threshold (signalthr
) for at least
a given number of markers composing that combination (combithr
).
As mentioned before, signalthr
parameter should be set depending on
the guidelines and characteristics of the methodology used for the assay
or by an accurate inspection of signal intensity distribution. In case
of lack of specific guidelines, one should set the value signalthr
as suggested by the markers_density
method, described in the
previous section.
In this tutorial, we set signalthr
equal to 450 instead of 407.5 in
order to reproduce the results reported in Mazzara et. al 2017 (the
original CombiROC paper) or in Bombaci & Rossi 2019 as well as in the
tutorial of the web app with default thresholds.
combithr
, instead, should be set exclusively depending on the needed
stringency: 1 is the less stringent and most common choice (meaning that
at least one marker in a combination needs to reach the threshold). The
obtained tab object is a dataframe of all the combinations obtained with
the chosen parameters.
combi = pcr.Combinations(data, case_class='A',signalthr=450, combithr=1)
combi.tab.head()
Markers | #Positives A | #Positives B | SE | SP | # of markers | |
---|---|---|---|---|---|---|
index | ||||||
Marker1 | Marker1 | 26 | 6 | 65.0 | 95.4 | 1 |
Marker2 | Marker2 | 19 | 2 | 47.5 | 98.5 | 1 |
Marker3 | Marker3 | 8 | 1 | 20.0 | 99.2 | 1 |
Marker4 | Marker4 | 26 | 48 | 65.0 | 63.1 | 1 |
Marker5 | Marker5 | 23 | 15 | 57.5 | 88.5 | 1 |
Selection of combinations
The markers combinations can now be ranked and selected. After
specifying the case class (“A” in this case), the method
ranked_combs
ranks the combinations by the Youden
index in order
to show the combinations with the highest SE (of cases) and SP (of
controls) on the top, facilitating the user in the selection of the best
ones. Again, the Youden index (J) is calculated in this way:
The user can also set (not mandatory) a minimal value of SE and/or SP that a combination must have to be selected, i.e. to be considered as “gold” combinations.
combi.ranked_combs()
combi.ranked
SE | SP | # of markers | Youden | |
---|---|---|---|---|
index | ||||
Combination 11 | 77.5 | 95.4 | 3 | 0.729 |
Combination 22 | 87.5 | 83.8 | 4 | 0.713 |
Combination 1 | 72.5 | 95.4 | 2 | 0.679 |
Combination 2 | 72.5 | 95.4 | 2 | 0.679 |
Combination 13 | 82.5 | 83.8 | 3 | 0.663 |
Combination 15 | 82.5 | 83.8 | 3 | 0.663 |
Combination 18 | 75.0 | 86.9 | 3 | 0.619 |
Marker1 | 65.0 | 95.4 | 1 | 0.604 |
Combination 4 | 75.0 | 83.8 | 2 | 0.588 |
Combination 7 | 70.0 | 86.9 | 2 | 0.569 |
Combination 9 | 67.5 | 87.7 | 2 | 0.552 |
Combination 26 | 95.0 | 56.2 | 5 | 0.512 |
Combination 5 | 52.5 | 98.5 | 2 | 0.510 |
Combination 25 | 92.5 | 57.7 | 4 | 0.502 |
Combination 21 | 90.0 | 60.0 | 4 | 0.500 |
Combination 17 | 85.0 | 61.5 | 3 | 0.465 |
Combination 24 | 90.0 | 56.2 | 4 | 0.462 |
Combination 23 | 90.0 | 56.2 | 4 | 0.462 |
Marker5 | 57.5 | 88.5 | 1 | 0.460 |
Marker2 | 47.5 | 98.5 | 1 | 0.460 |
Combination 19 | 87.5 | 57.7 | 3 | 0.452 |
Combination 14 | 85.0 | 60.0 | 3 | 0.450 |
Combination 12 | 85.0 | 60.0 | 3 | 0.450 |
Combination 20 | 85.0 | 58.5 | 3 | 0.435 |
Combination 6 | 80.0 | 61.5 | 2 | 0.415 |
Combination 16 | 82.5 | 56.2 | 3 | 0.387 |
Combination 3 | 77.5 | 60.0 | 2 | 0.375 |
Combination 10 | 77.5 | 59.2 | 2 | 0.367 |
Combination 8 | 72.5 | 62.3 | 2 | 0.348 |
Marker4 | 65.0 | 63.1 | 1 | 0.281 |
Marker3 | 20.0 | 99.2 | 1 | 0.192 |
A possibility to overview how single markers and all combinations are
distributed in the SE - SP ballpark is to plot them with the
bubble_chart
method. The bigger the bubble, the more markers are in
the combination: looking at the size and distribution of bubbles across
SE and SP values is useful to anticipate how effective will be the
combinations in the ranking. Setting no cutoffs (i.e. SE = 0 and SP =
0), all single markers and combinations (all bubbles) will be considered
as gold
combinations and ranked in the next passage.
In the the example below the minimal values of SE and SP are set,
respectively, to 40 and 80, in order to reproduce the gold combinations
selection reported in Mazzara et. al 2017. The obtained values of
combinations, ranked according to Youden index, are stored in the
ranked
attribute.
combi.bubble_chart(min_SE=40, min_SP=80)

ROC curves
To allow an objective comparison of combinations, pycroc applies the Generalised Linear Model for each selected combination.
#data for combination were alreadys tored in the combi object
combi.train_models(combinations=['Marker1', 'Combination 11', 'Combination 15', 'Combination 22'])
Model results were stored in the models
attribute of the combi
object.
combi.models
{'Marker1': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f40250>,
'Combination 11': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f41150>,
'Combination 15': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f13ca0>,
'Combination 22': <statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x7f6157f5eda0>}
The resulting predictions are then used to compute ROC curves (with function pROC::roc()) and their corresponding metrics which are both returned by the function as a named list object (in this case called reports). The function roc_reports() requires as input:
The data object ( data ) obtained with load_data();
The table with combinations and corresponding positive samples counts ( tab ), obtained with combi().
In addition, the user has to specify the class case, and the single markers and/or the combinations that she/he wants to be displayed with the specific function’s arguments. In the example below a single marker ( Marker1 ) and two combinations (combinations number 11 and 15 ) were choosen.
#combi.models['Combination 11'].predict() # combi score a.k.a p(x)
combi.roc_curves(case_class='A')

combi.models['Combination 11'].summary()
Dep. Variable: | ['Class[A]', 'Class[B]'] | No. Observations: | 170 |
---|---|---|---|
Model: | GLM | Df Residuals: | 166 |
Model Family: | Binomial | Df Model: | 3 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -43.745 |
Date: | Tue, 17 May 2022 | Deviance: | 87.490 |
Time: | 10:52:39 | Pearson chi2: | 161. |
No. Iterations: | 7 | Pseudo R-squ. (CS): | 0.4382 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -17.0128 | 2.957 | -5.753 | 0.000 | -22.809 | -11.217 |
np.log(Marker1 + 1) | 1.5378 | 0.473 | 3.252 | 0.001 | 0.611 | 2.465 |
np.log(Marker2 + 1) | 0.9176 | 0.468 | 1.960 | 0.050 | 9.88e-05 | 1.835 |
np.log(Marker3 + 1) | 0.5706 | 0.251 | 2.269 | 0.023 | 0.078 | 1.063 |
combi.reports
AUC | SE | SP | OptCutoff | Youden | ACC | TN | TP | FN | FP | |
---|---|---|---|---|---|---|---|---|---|---|
Marker1 | 0.909519 | 0.90 | 0.807692 | 0.218699 | 0.707692 | 0.829412 | 105.0 | 36.0 | 4.0 | 25.0 |
Combination 11 | 0.941538 | 0.95 | 0.869231 | 0.216361 | 0.819231 | 0.888235 | 113.0 | 38.0 | 2.0 | 17.0 |
Combination 15 | 0.935192 | 0.90 | 0.853846 | 0.247679 | 0.753846 | 0.864706 | 111.0 | 36.0 | 4.0 | 19.0 |
Combination 22 | 0.945000 | 0.95 | 0.861538 | 0.219362 | 0.811538 | 0.882353 | 112.0 | 38.0 | 2.0 | 18.0 |
pcr.combi_score(data, combi.models)
Patient.ID | Marker1 | Combination 11 | Combination 15 | Combination 22 | |
---|---|---|---|---|---|
0 | AIH1 | 0.472599 | 0.541461 | 0.609072 | 0.528845 |
1 | AIH2 | 0.344255 | 0.497969 | 0.497881 | 0.544820 |
2 | AIH3 | 0.819448 | 0.899583 | 0.673739 | 0.680834 |
3 | AIH4 | 0.600797 | 0.746098 | 0.653317 | 0.626728 |
4 | AIH5 | 0.929360 | 0.978461 | 0.977912 | 0.983694 |
... | ... | ... | ... | ... | ... |
165 | no AIH126 | 0.396019 | 0.153715 | 0.145528 | 0.188304 |
166 | no AIH127 | 0.207736 | 0.196093 | 0.315590 | 0.207712 |
167 | no AIH128 | 0.618847 | 0.254666 | 0.543287 | 0.309190 |
168 | no AIH129 | 0.259477 | 0.044433 | 0.036875 | 0.058829 |
169 | no AIH130 | 0.060561 | 0.147093 | 0.205254 | 0.203920 |
170 rows × 5 columns
pcr.combi_score(data, combi.models, classify=True, reports=combi.reports, case_class='A', control_class='B')
Patient.ID | Marker1 | Combination 11 | Combination 15 | Combination 22 | |
---|---|---|---|---|---|
0 | AIH1 | A | A | A | A |
1 | AIH2 | A | A | A | A |
2 | AIH3 | A | A | A | A |
3 | AIH4 | A | A | A | A |
4 | AIH5 | A | A | A | A |
... | ... | ... | ... | ... | ... |
165 | no AIH126 | A | B | B | B |
166 | no AIH127 | B | B | A | B |
167 | no AIH128 | A | A | A | A |
168 | no AIH129 | A | B | B | B |
169 | no AIH130 | B | B | B | B |
170 rows × 5 columns
Single-cell applications
import scanpy as sc
adata=sc.datasets.pbmc68k_reduced()
adata
gene_list=adata.uns['rank_genes_groups']['names']['CD56+ NK'][0:29]
scanpy_to_pycroc(AnnData=adata, gene_list=gene_list, obs_name='bulk_labels', case_class=['CD56+ NK'], case_label='NK')