dandelion.preprocessing.calculate_threshold

dandelion.preprocessing.calculate_threshold(self, manual_threshold=None, model=None, normalize_method=None, threshold_method=None, edge=None, cross=None, subsample=None, threshold_model=None, cutoff=None, sensitivity=None, specificity=None, ncpu=None, plot=True, plot_group=None, figsize=(4.5, 2.5), *args)[source]

Calculating nearest neighbor distances for tuning clonal assignment with shazam.

Runs the following:

distToNearest

Get non-zero distance of every heavy chain (IGH) sequence (as defined by sequenceColumn) to its nearest sequence in a partition of heavy chains sharing the same V gene, J gene, and junction length (VJL), or in a partition of single cells with heavy chains sharing the same heavy chain VJL combination, or of single cells with heavy and light chains sharing the same heavy chain VJL and light chain VJL combinations.

findThreshold

automtically determines an optimal threshold for clonal assignment of Ig sequences using a vector of nearest neighbor distances. It provides two alternative methods using either a Gamma/Gaussian Mixture Model fit (threshold_method=”gmm”) or kernel density fit (threshold_method=”density”).

Parameters
  • self (Dandelion, DataFrame, str) – Dandelion object, pandas DataFrame in changeo/airr format, or file path to changeo/airr file after clones have been determined.

  • manual_threshold (float, optional) – value to manually plot in histogram.

  • model (str, optional) – underlying SHM model, which must be one of c(“ham”, “aa”, “hh_s1f”, “hh_s5f”, “mk_rs1nf”, “hs1f_compat”, “m1n_compat”).

  • normalize_method (str, optional) – method of normalization. The default is “len”, which divides the distance by the length of the sequence group. If “none” then no normalization if performed.

  • threshold_method (str, optional) – string defining the method to use for determining the optimal threshold. One of “gmm” or “density”.

  • edge (float, optional) – upper range as a fraction of the data density to rule initialization of Gaussian fit parameters. Default value is 0.9 (or 90). Applies only when threshold_method=”density”.

  • cross (Sequence, optional) – supplementary nearest neighbor distance vector output from distToNearest for initialization of the Gaussian fit parameters. Applies only when method=”gmm”.

  • subsample (int, optional) – maximum number of distances to subsample to before threshold detection.

  • threshold_model (str, optional) – allows the user to choose among four possible combinations of fitting curves: “norm-norm”, “norm-gamma”, “gamma-norm”, and “gamma-gamma”. Applies only when method=”gmm”.

  • cutoff (str, optional) – method to use for threshold selection: the optimal threshold “optimal”, the intersection point of the two fitted curves “intersect”, or a value defined by user for one of the sensitivity or specificity “user”. Applies only when method=”gmm”.

  • sensitivity (float, optional) – sensitivity required. Applies only when method=”gmm” and cutoff=”user”.

  • specificity (float, optional) – specificity required. Applies only when method=”gmm” and cutoff=”user”.

  • ncpu (int, optional) – number of cpus for parallelization. Default is all available cpus.

  • plot (bool) – whether or not to return plot.

  • plot_group (str, optional) – determines the fill color and facets.

  • figsize (Tuple[Union[int,float], Union[int,float]]) – size of plot. Default is (4.5, 2.5).

  • *args – passed to shazam’s distToNearest.

Return type

Dandelion

Returns

  • Dandelion object object with distance threshold value in .threshold.

  • If plot = True,plotnine plot showing histogram of length normalized ham model distance threshold.