hidimstat.BlockBasedImportance

class hidimstat.BlockBasedImportance(estimator='DNN', importance_estimator='sampling_RF', coffeine_transformer=None, do_hypertuning=True, dict_hypertuning=None, problem_type='regression', encoding_input=True, sampling_with_repetition=True, split_percentage=0.8, conditional=True, variables_categories=None, residuals_sampling=False, n_permutations=50, n_jobs=1, verbose=0, groups=None, group_stacking=False, sub_groups=None, k_fold=2, prop_out_subLayers=0, iteration_index=None, random_state=2023, do_compute_importance=True, group_fold=None)

This class implements the Block-Based Importance (BBI), a framework for variable importance computation with statistical guarantees. It consists of two blocks of estimators: Learner block (Predicting on the data) and Importance block (Resampling the variable/group of interest to assess the impact on the loss). For single-level see CHAMMA et al.[1] and for group-level see Chamma et al.[2].

Parameters:
estimator{String or sklearn.base.BaseEstimator}, default=”DNN”

The provided estimator for the learner block. The default estimator is a custom Multi-Layer Perceptron (MLP) learner.

  • String options include:
    • “DNN” for the Multi-Layer Perceptron

    • “RF” for the Random Forest

  • Other options include:
    • sklearn.base.BaseEstimator

importance_estimator{String or sklearn.base.BaseEstimator}, default=”sampling_RF”

The provided estimator for the importance block. The default estimator includes the use of the sampling Random Forest where the sampling is executed in the corresponding leaf of each instance within its neighbors

  • String options include:
    • “sampling_RF” for the sampling Random Forest

    • “residuals_RF” for the Random Forest along with the residuals path for importance computation

  • Other options include:
    • sklearn.base.BaseEstimator

coffeine_transformertuple, default=None

Applying the coffeine’s pipeline for filterbank models on electrophysiological data. The tuple cosists of (coffeine pipeline, new number of variables) or (coffeine pipeline, new number of variables, list of variables to keep after variable selection).

do_hypertuningbool, default=True

Tuning the hyperparameters of the provided estimator.

dict_hypertuningdict, default=None

The dictionary of hyperparameters to tune, depending on the provided inference estimator.

problem_typestr, default=’regression’

A classification or a regression problem.

encoding_inputbool, default=True

To one-hot or ordinal encode the nominal and ordinal input variables.

sampling_with_repetitionbool, default=True

Sampling with repetition the train part of the train/valid scheme under the training set. The number of training samples in train is equal to the number of instances in the training set.

split_percentagefloat, default=0.8

The training/validation cut for the provided data.

conditionalbool, default=True

The permutation or the conditional sampling approach.

variables_categoriesdict, default=None

The dictionary of binary, nominal and ordinal variables.

residuals_samplingbool, default=False

The use of permutations or random sampling for residuals with the conditional sampling.

n_permutationsint, default=50

The number of permutations/random samplings for each column.

n_jobsint, default=1

The number of workers for parallel processing.

verboseint, default=0

If verbose > 0, the fitted iterations will be printed.

groupsdict, default=None

The knowledge-driven/data-driven grouping of the variables if provided.

group_stackingbool, default=False

Apply the stacking-based method for the provided groups.

sub_groupsdict, default=None

The list of provided variables’s indices to condition on per variable/group of interest (default set to all the remaining variables).

k_foldint, default=2

The number of folds for k-fold cross fitting.

prop_out_subLayersint, default=0.

If group_stacking is True, the proportion of outputs for the linear sub-layers per group.

iteration_indexint, default=None

The index of the current processed iteration.

random_stateint, default=2023

Fixing the seeds of the random generator.

do_compute_importanceboolean, default=True

Whether to compute the Importance Scores.

group_foldlist, default=None

The list of group labels to perform GroupKFold to keep subjects within the same training or test set.

References

__init__(estimator='DNN', importance_estimator='sampling_RF', coffeine_transformer=None, do_hypertuning=True, dict_hypertuning=None, problem_type='regression', encoding_input=True, sampling_with_repetition=True, split_percentage=0.8, conditional=True, variables_categories=None, residuals_sampling=False, n_permutations=50, n_jobs=1, verbose=0, groups=None, group_stacking=False, sub_groups=None, k_fold=2, prop_out_subLayers=0, iteration_index=None, random_state=2023, do_compute_importance=True, group_fold=None)

Methods

__init__([estimator, importance_estimator, ...])

compute_importance([X, y])

This function computes the importance scores and the statistical guarantees per variable/group of interest

fit(X[, y])

Build the provided estimator with the training set (X, y)

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict([X])

This function predicts the regression target for the input samples X.

predict_proba([X])

This function predicts the class probabilities for the input samples X.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.