Provides concrete implementation of statistical techniques, as well as all associated objects (selectors, plots, etc.).
Provides concrete implementation of statistical techniques that use l1l2py library.
Wrapper job function for l1l2py.algorithms.ridge_regression(). This function transposes data subset if the dimensions do not match with labels vector, performs the call, and calculates classification error with supplied error function. See l1l2py documentation for more details.
Parameters : | data : numpy.ndarray
labels : numpy.ndarray
error_func : callable
|
---|---|
Returns : | beta : numpy.ndarray
error : float
labels : numpy.ndarray
labels_predicted : numpy.ndarray
|
Bases: kdvs.fw.Stat.Technique
Classifier based on l1l2py.algorithms.ridge_regression() (called with mu=0.0). It uses l1l2py v1.0.5. It can be configured with the following parameters:
- ‘error_func’ (callable) – callable used as error function
- ‘return_predictions’ (boolean) – if predictions shall be returned
The configuration parameters are interpreted once, during initialization. This technique uses single virtual degree of freedom (DOF). The following Results elements are produced:
‘Classification Error’ (float) – classification error
‘Beta’ (numpy.ndarray) – solution obtained from ridge regression
‘CM MCC’ (tuple: (int, int, int, int, float)) – the number of: true positives, true negatives, false positives, false negatives, and Matthews Correlation Coefficient,
- ‘Predictions’ (dict) – if ‘return_predictions’ was True, it contains the dictionary:
{ ‘orig_labels’ : iterable of original labels in numerical format (numpy.sign), ‘pred_labels’ : iterable of predicted labels in numerical format (numpy.sign), ‘orig_samples’ : iterable of original samples }
This technique creates empty ‘Selection’ Results element. This technique produces single Job instance. This technique does not generate any plots.
See also
Parameters : | kwargs : dict
|
---|---|
Raises : | Error :
|
Create single Job that wraps l1l2_ols_job_wrapper() call together with proper arguments and associated additional data.
Parameters : | ssname : string
data : numpy.ndarray
labels : numpy.ndarray
additionalJobData : dict
|
---|---|
Returns : | (jID, job) : string, Job
|
Notes
Proper order of data and labels must be ensured in order for the technique to work. Typically, subsets are generated according to samples order specified within the primary data set; labels must be in the same order. By definition, KDVS does not check this.
Produce Results instance for exactly one job result.
Parameters : | ssname : string
jobs : iterable of Job
runtime_data : dict
|
---|---|
Returns : | final_results : Results
|
Raises : | Error :
|
Wrapper job function for l1l2py.algorithms.ridge_regression(). This function transposes data subset if the dimensions do not match with labels vector, creates training and test splits based on supplied index sets, performs the call for each value of lambda parameter, calculates errors on training and test splits with supplied error function. See l1l2py documentation for more details.
Parameters : | data : numpy.ndarray
labels : numpy.ndarray
calls : dict
|
---|---|
Returns : | results : dict
|
Bases: kdvs.fw.Stat.Technique
Classifier based on l1l2py.algorithms.ridge_regression(), called with range of mu values (also dubbed ‘lambda’ here). It uses l1l2py v1.0.5. This technique produces training and test splits. It can be configured with the following parameters:
‘error_func’ (callable) – callable used as error function
‘return_predictions’ (boolean) – if predictions shall be returned
‘external_k’ (int) – number of splits to make
‘lambda_min’ (float) – minimum value of lambda parameter range
‘lambda_max’ (float) – maximum value of lambda parameter range
‘lambda_number’ (int) – number of values of lambda parameter range
- ‘lambda_range_type’ (string) – type of lambda parameter range to generate
(‘geometric’/’linear’)
- ‘lambda_range’ (iterable of float) – custom lambda parameter values to use
(None if not used)
‘data_normalizer’ (callable) – callable used to normalize data subset
‘labels_normalizer’ (callable) – callable used to normalize label values
‘ext_split_sets’ (None) – placeholder parameter for pre–computed splits (if any)
The configuration parameters are interpreted once, during initialization. This technique uses single virtual degree of freedom (DOF). The following Results elements are produced:
‘Classification Error’ (float) – classification error for all DOFs (based on ‘Avg Err TS’)
‘Avg Err TS’ (numpy.ndarray) – mean of error obtained on test splits
‘Std Err TS’ (numpy.ndarray) – standard deviation of error obtained on test splits
‘Med Err TS’ (numpy.ndarray) – median of error obtained on test splits
‘Var Err TS’ (numpy.ndarray) – variance of error obtained on test splits
‘Avg Err TR’ (numpy.ndarray) – mean of error obtained on training splits
‘Std Err TR’ (numpy.ndarray) – standard deviation of error obtained on training splits
‘Med Err TR’ (numpy.ndarray) – median of error obtained on training splits
‘Var Err TR’ (numpy.ndarray) – variance of error obtained on training splits
‘Err TS’ (numpy.ndarray) – individual errors obtained for all test splits
‘Err TR’ (numpy.ndarray) – individual errors obtained for all training splits
‘CM MCC’ (tuple: (int, int, int, int, float)) – the number of: true positives, true negatives, false positives, false negatives, and Matthews Correlation Coefficient,
- ‘DofExt’ (dict) – many result bits depending of individual DOF, including:
predictions, model, numbers of: true positives, true negatives, false positives, false negatives, Matthews Correlation Coefficient (some of them are exposed separately)
‘Predictions’ (dict) – if ‘return_predictions’ was True, the dictionary:
{ DOF_name: { ‘orig_labels’ : iterable of original labels in numerical format (numpy.sign), ‘pred_labels’ : iterable of predicted labels in numerical format (numpy.sign), ‘orig_samples’ : iterable of original samples } }
This technique creates empty ‘Selection’ Results element. This technique produces single Job instance. This technique generates two boxplot error plots (via L1L2ErrorBoxplotMuGraph) for training and test splits.
See also
Parameters : | kwargs : dict
|
---|---|
Raises : | Error :
|
Create single Job instance. Splits are created during job execution. The number of splits is taken from the value of the parameter ‘external_k’. If value of the parameter ‘ext_split_sets’ is not None, index sets specified there will be used across all splits. Otherwise, new index sets will be generated for each split with l1l2py.tools.stratified_kfold_splits().
Parameters : | ssname : string
data : numpy.ndarray
labels : numpy.ndarray
additionalJobData : dict
|
---|---|
Returns : | (jID, job) : string, Job
|
Notes
Proper order of data and labels must be ensured in order for the technique to work. Typically, subsets are generated according to samples order specified within the primary data set; labels must be in the same order. By definition, KDVS does not check this.
Produce Results instance for exactly one job result.
Parameters : | ssname : string
jobs : iterable of Job
runtime_data : dict
|
---|---|
Returns : | final_results : Results
|
Raises : | Error :
|
Wrapper job function for l1l2py.model_selection(). This function passes all properly prepared arguments directly to the call. See l1l2py documentation for more details.
Bases: kdvs.fw.Stat.Technique
Classifier and variable selector that enforces model sparsity, based on l1l2py.model_selection(), that utilizes three ranges of parameters: ‘tau’, ‘lambda’, and ‘mu’. ‘Tau’ and ‘lambda’ parameters are used internally to control selection of best statistical model. ‘Tau’ and ‘mu’ values are not specified directly (they must be found computationally), but user has the control of how they scale across the requested range. ‘Lambda’ values may be specified directly. Values of ‘mu’ parameter are degrees of freedom (DOF) for this technique (the actual DOF identifiers used are configurable). It uses l1l2py v1.0.5. This technique produces two layers of training and test splits: internal ones (hidden), and external ones (managed by this technique directly).
‘external_k’ (int) – number of external splits to make (within the control of the technique)
‘internal_k’ (int) – number of internal splits to make (beyond the control of the technique, implemented in l1l2py)
‘tau_min_scale’ (float) – minimum scaling factor for the tau parameter range
‘tau_max_scale’ (float) – maximum scaling factor for the tau parameter range
‘tau_number’ (int) – number of values of tau parameter range
(‘geometric’/’linear’)
‘mu_scaling_factor_min’ (float) – minimum scaling factor for the mu parameter range
‘mu_scaling_factor_max’ (float) – maximum scaling factor for the mu parameter range
‘mu_number’ (int) – number of values of mu parameter range
(‘geometric’/’linear’)
‘lambda_min’ (float) – minimum value of lambda parameter range
‘lambda_max’ (float) – maximum value of lambda parameter range
‘lambda_number’ (int) – number of values of lambda parameter range
(‘geometric’/’linear’)
(None if not used)
‘error_func’ (callable) – callable used as error function for external splits
‘cv_error_func’ (callable) – callable used as error function for internal splits
‘sparse’ (boolean) – if most sparsed solution shall be enforced (exclusive with ‘regularized’)
‘regularized’ (boolean) – if most regularized solution shall be enforced (exclusive with ‘sparse’)
‘return_predictions’ (boolean) – if predictions shall be returned
‘data_normalizer’ (callable) – callable used to normalize data subset
‘labels_normalizer’ (callable) – callable used to normalize label values
‘ext_split_sets’ (None) – placeholder parameter for pre–computed splits (if any)
The configuration parameters are interpreted once, during initialization.
The following Results elements are produced:
‘Classification Error’ (float) – classification error for all DOFs (i.e. ‘mu’ values) (based on ‘Avg Err TS’)
‘Avg Err TS’ (numpy.ndarray) – mean of error obtained on external test splits
‘Std Err TS’ (numpy.ndarray) – standard deviation of error obtained on external test splits
‘Med Err TS’ (numpy.ndarray) – median of error obtained on external test splits
‘Var Err TS’ (numpy.ndarray) – variance of error obtained on external test splits
‘Avg Err TR’ (numpy.ndarray) – mean of error obtained on external training splits
‘Std Err TR’ (numpy.ndarray) – standard deviation of error obtained on external training splits
‘Err TS’ (numpy.ndarray) – all individual errors obtained for all external and internal test splits
‘Err TR’ (numpy.ndarray) – all individual errors obtained for all external and internal training splits
- ‘MuExt’ (dict) – many result bits depending of individual DOF (i.e. ‘mu’ value)
and external split, including: predictions, model, numbers of: true positives, true negatives, false positives, false negatives, Matthews Correlation Coefficient (some of them are exposed separately)
- ‘Calls’ (dict) – all information used to prepare individual l1l2py calls
(exposed for debug purpose)
‘Avg Kfold Err TS’ (numpy.ndarray) – mean of errors obtained for internal test splits
‘Avg Kfold Err TR’ (numpy.ndarray) – mean of errors obtained for internal training splits
‘Max Tau’ (int) – index of ‘tau’ value in tau parameter range that was used to produce the best solution
‘CM MCC’ (tuple: (int, int, int, int, float)) – the number of: true positives, true negatives, false positives, false negatives, and Matthews Correlation Coefficient
‘Predictions’ (dict) – if ‘return_predictions’ was True, the dictionary:
{ DOF_name: { ‘orig_labels’ : iterable of original labels in numerical format (numpy.sign), ‘pred_labels’ : iterable of predicted labels in numerical format (numpy.sign), ‘orig_samples’ : iterable of original samples } }
This technique creates empty ‘Selection’ Results element. This technique produces as many Job instances as the number of external splits (each external split is processed in separate Job).
This technique generates the following plots:
- for each DOF (i.e. ‘mu’ value), surface of error for external test splits in (tau,lambda) parameter space
(via L1L2KfoldErrorsGraph)
- single surface of average error across all external and internal test splits in (tau,lambda) parameter space
(via L1L2KfoldErrorsGraph)
- single boxplot of error obtained on external test splits for all ‘mu’ values
(via L1L2ErrorBoxplotMuGraph)
- single boxplot of error obtained on external training splits for all ‘mu’ values
(via L1L2ErrorBoxplotMuGraph)
See also
l1l2py.algorithms.l1_bound
Parameters : | kwargs : dict
|
---|---|
Raises : | Error :
|
Create as many Job instances as the number of external splits (each external split is processed in separated Job). Internal splits are created from external splits during job execution. The number of external splits is taken from the value of the parameter ‘external_k’. If value of the parameter ‘ext_split_sets’ is not None, index sets specified there will be used across all splits. Otherwise, new index sets will be generated for each split with l1l2py.tools.stratified_kfold_splits(). All Job instances produced here may be associated together as ‘job group’ managed by JobGroupManager; this approach is used in ‘experiment’ application to keep splits together.
Parameters : | ssname : string
data : numpy.ndarray
labels : numpy.ndarray
additionalJobData : dict
|
---|---|
Returns : | (jID, job) : string, Job
|
Notes
Proper order of data and labels must be ensured in order for the technique to work. Typically, subsets are generated according to samples order specified within the primary data set; labels must be in the same order. By definition, KDVS does not check this.
Produce single Results instance for job results coming from all external splits performed on the same data subset. To produce the final Results instance, all job results must be completed correctly; sometimes this may not be the case (see l1l2py documentation for more details).
Parameters : | ssname : string
jobs : iterable of Job
runtime_data : dict
|
---|---|
Returns : | final_results : Results
|
Raises : | Error :
Error :
Error :
Warn :
Warn :
|
Bases: kdvs.fw.impl.stat.Plot.MatplotlibPlot
Specialized subclass that plots surface of selected errors in (tau,lambda) parameter space. The exact values to plot are configurable. This plotter is tailored for l1l2py–related techniques that use splits. This plotter accepts no additional configuration parameters.
See also
Configure this plotter. The following configuration options can be used:
- ‘backend’ (string) – physical matplotlib backend
- ‘driver’ (string) – physical matplotlib driver
Refer to matplotlib documentation for more details. In addition, initialize proper plotting toolkits (matplotlib.cm for color management, mpl_toolkits.mplot3d.axes3d.Axes3D for 3D plotting).
Parameters : | kwargs : dict
|
---|
Create plot environment and set all necessary parameters according to content parameters provided. The following content parameters are available:
- ‘ranges’ (tuple of numpy.ndarray) – data ranges to be used
- ‘labels’ (tuple of string) – label ranges to be plotted
- ‘ts_errors’ (tuple of numpy.ndarray) – test errors to be plotted
- ‘tr_errors’ (tuple of numpy.ndarray) – training errors to be plotted
- ‘plot_title’ (string) – tile of this plot
Parameters : | kwargs : dict
|
---|
Produce actual plot and return its content.
Bases: kdvs.fw.impl.stat.Plot.MatplotlibPlot
Specialized subclass that produces boxplots of requested data for all given parameter values (i.e. DOF values). The exact values to plot are configurable. This plotter is tailored for l1l2py–related techniques that use splits and DOFs. This plotter accepts no additional configuration parameters.
See also
Configure this plotter. The following configuration options can be used:
- ‘backend’ (string) – physical matplotlib backend
- ‘driver’ (string) – physical matplotlib driver
Refer to matplotlib documentation for more details.
Parameters : | kwargs : dict
|
---|
Create plot environment and set all necessary parameters according to content parameters provided. The following content parameters are available:
- ‘errors’ (tuple of numpy.ndarray) – data to be plotted
- ‘mu_range’ (tuple of float) – range of values of ‘mu’ parameter,
- ‘plot_title’ (string) – tile of this plot
Parameters : | kwargs : dict
|
---|
Produce actual plot and return its content.
Provides concrete implementations for various standard selectors. Selectors are closely tied to statistical techniques and reporters. Selectors implemented here were designed to work with techniques that use l1l2py library, but can be adapted for other purposes with some care. NOTE: this part of API has only been sketched, the selectors implemented here are not freely portable and many details are still hard–coded.
Bases: kdvs.fw.Stat.Selector
Base class for ‘outer selectors’. The ‘outer selection’ refers to the prior knowledge concepts that can be ‘selected’ if they pass certain criteria. For instance, if concrete statistical technique performs classification on data subset, the associated prior knowledge concept can be ‘selected’ by passing certain criteria related to classification error. Or, if individual classification error for that data subset is below median classification error computed across all considered data subsets, it may be marked as ‘selected’, etc. The concrete implementation accepts iterable of Results instances, and can compute whatever passing criteria it sees fit. It must fill Results element ‘Selection’, subdictionary ‘outer’, for reporting; later, the associated Reporter must interpret it correctly. Very often, outer selector is closely tied to inner selector.
Perform ‘outer selection’, based on the input Results instances. The individual prior knowledge concepts (for which Results were produced), can be marked as SELECTED or NOTSELECTED (SELECTIONERROR may be used in dubious cases). The typical way to mark ‘selection’ is to fill standard Results element ‘Selection’, subdictionary ‘outer’, with details understood for associated Reporter. Also, the implementation shall return the iterable of objects that contain individual selection markings stored in ‘Selection’->’outer’ Results element. By default, this method does nothing.
Parameters : | indResultIter : iterable of Results
|
---|
Bases: kdvs.fw.Stat.Selector
Base class for ‘inner selectors’. The ‘inner selection’ refers to the individual variables of data subset that can be ‘selected’ if they pass certain criteria. For instance, if concrete statistical technique performs variable selection (in machine learning sense) on data subset, selected variables can be simply marked by the selector as ‘selected’ (in KDVS sense). Or, if the technique uses more complicated scheme, such as counting frequencies of selected variables across all splits (as some l1l2py–related techniques do), the selector can mark certain variables as ‘selected’ depending on some frequency threshold, etc. The concrete implementation accepts the following:
- results of outer selection
- iterable of individual Results instances
- information about data subsets (to retrieve specific variables)
and can compute whatever passing criteria it sees fit. It mus fill Results element ‘Selection’, subdictionary ‘inner’, for reporting; later, the associated Reporter must interpret it correctly. Very often, inner selector is closely tied to outer selector.
Perform ‘inner selection’ based on the input Results instances and information already produced by some outer selection. Very often, only variables that come from ‘properly selected’ data subsets can be considered ‘properly selected’ in KDVS sense. The additional information about subsets is used to retrieve specific variables for reporting. The individual variables can be marked as SELECTED or NOTSELECTED (SELECTIONERROR may be used in dubious cases). The typical way to mark ‘selection’ is to fill standard Results element ‘Selection’, subdictionary ‘inner’, with details understood for associated Reporter. Also, the implementation shall return the dictionary of objects that contain individual selection markings stored in ‘Selection’->’inner’ Results element (keyed by subset ID). By default, this method does nothing.
Parameters : | outerSelectionResults : object
ssIndResults : dict of Results
subsetsDict : dict
|
---|
Bases: kdvs.fw.impl.stat.PKCSelector.OuterSelector
Outer selector that marks PKC as ‘selected’ if associated classification error (that must be present in Results instance as ‘Classification Error’) is below (<) the configurable threshold. The error threshold is specified during initialization and interpreted once.
Parameters : | kwargs : dict
|
---|
Mark each individual Results instance as SELECTED or NOTSELECTED, depending on the classification error. The value of classification error is compared to error threshold; if it is smaller (<), then PKC is marked as SELECTED, and as NOTSELECTED otherwise. Note that for technique that uses not–null multiple degrees of freedom (DOFs), each DOF can be associated with different classification error; this is often the case when the statistical technique is regularized and certain parameter values can drastically alter the results it produces. It uses the following selection marking for single Results instance: a dictionary
- {DOF : selection_status}
It fills ‘Selection’->’outer’ Results element that must be present (may be empty).
Parameters : | indResultIter : iterable of Results
|
---|---|
Returns : | cres : iterable of dict
|
Bases: kdvs.fw.impl.stat.PKCSelector.InnerSelector
Inner selector that marks all variables coming from ‘properly selected’ data subsets (by outer selection) as ‘properly selected’. IMPORTANT NOTE: when the variable comes from ‘not selected’ data subset, it is marked as ‘not selected’ regardless of any proper variable selection scheme used; this ignores any variable selection (in machine learning sense) performed for such data subset. This selector accepts no configuration parameters. NOTE: currently, this selector produces only selection markings, and does not use selection constants.
Parameters : | kwargs : dict
|
---|
Perform ‘inner selection’ on the individual variables according to Results instances. Inner selection is based solely on outer selection results. If data subset was marked during outer selection as SELECTED, all its variables are marked as ‘selected’ as well, and as ‘not selected’ otherwise. Currently, it understands outer selection results produced by OuterSelector_ClassificationErrorThreshold. It fills ‘Selection’->’inner’ Results element that must be present (may be empty).
Parameters : | outerSelectionResults : object
ssIndResults : dict of Results
subsetsDict : dict
|
---|---|
Returns : | vres : dict of dict
|
Bases: kdvs.fw.impl.stat.PKCSelector.InnerSelector
Inner selector closely tied with techniques that use l1l2py library, that marks variables coming from ‘properly selected’ data subsets (by outer selection) as ‘properly selected’, if they appeared frequently enough among selected variables (in variable selection sense) across test splits. This selector accepts the following configuration parameters:
‘frequency_threshold’ (float) – the threshold on the frequency of appearance across splits
- ‘pass_variables_for_nonselected_pkcs’ (boolean) – for compatibility with KDVS v1.0;
if True, the variables that come from ‘not selected’ data subsets will be counted as ‘not selected’ (as expected, introduced in KDVS v2.0); if False, such variables will be ignored and not counted anywhere (replicating the behavior of KDVS v1.0)
NOTE: currently, this selector produces only selection markings, and does not use selection constants.
Parameters : | kwargs : dict
|
---|
Perform ‘inner selection’ on the individual variables according to Results instances. Inner selection results are based on outer selection results, as well as raw results from statistical technique that uses l1l2 regularization (see l1l2py documentation for more details). If data subset was marked during outer selection as SELECTED, the selector scans frequencies (stored in ‘MuExt’ Results element) and marks individual variables if they ‘pass’ the frequency threshold criterion (freq > thr). Currently, it understands outer selection results produced by OuterSelector_ClassificationErrorThreshold. It fills ‘Selection’->’inner’ Results element that must be present (may be empty).
Parameters : | outerSelectionResults : object
ssIndResults : dict of Results
subsetsDict : dict
|
---|---|
Returns : | vres : dict of dict
|
Provides base plot class that uses matplotlib, if installed. It can be subclassed to create specific (single) plot. By default, plot can be generated as PDF or PNG.
Default configuration options for PDF plotting. The physical matplotlib backend used is ‘pdf’ (see matplotlib.backends.backend_pdf). The physical driver is matplotlib.pyplot.
Default configuration options for PNG plotting. The physical matplotlib backend used is ‘agg’ (see here for detailed explanation). The physical driver is matplotlib.pyplot.
Bases: kdvs.fw.Stat.Plot
Base plot class that uses matplotlib as plot generator. During initialization, it is verified (verifyDepModule()) that matplotlib library is present.
Configure matplotlib plotter. The following configuration options can be used:
- ‘backend’ (string) – physical matplotlib backend
- ‘driver’ (string) – physical matplotlib driver
See matplotlib documentation for more details.
Parameters : | kwargs : dict
|
---|
This method needs to be re–implemented for producing actual plot. It should obtain a reference to physical matplotlib driver (via self.driver), and set–up all plot parameters as done normally with matplotlib. There is no need to physically save the plot afterwards, e.g. with matplotlib.pyplot.savefig(); it will be done later automatically. By default, this method does nothing.
Parameters : | kwargs : dict
|
---|
Finalize created plot and return its content (i.e. actual physical file content) to be saved by KDVS StorageManager in proper location. The following configuration options can be used:
- ‘format’ (string) – physical file format that is supported by the backend
See matplotlib documentation for more details.
Parameters : | kwargs : dict
|
---|---|
Returns : | content : string
|