Knockoff Feature Statistics API Reference¶
-
class
knockpy.knockoff_stats.
DeepPinkStatistic
(model=None)[source]¶ Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, feature_importance, …])Wraps the FeatureStatistic class using DeepPINK to generate variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, feature_importance='deeppink', antisym='cd', group_agg='sum', cv_score=False, train_kwargs={'verbose': False}, **kwargs)[source]¶ Wraps the FeatureStatistic class using DeepPINK to generate variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- feature_importancestr
Specifies how to create feature importances from
model
. Four options:“deeppink”: Use the deeppink feature importance
defined in https://arxiv.org/abs/1809.01185 - “unweighted”: Use the Z weights from the deeppink paper without weighting them using the layers from the MLP. Deeppink usually outperforms this feature importance (but not always). - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf
Defaults to deeppink, which is often both the most powerful and the most computationally efficient.
- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- cv_scorebool
If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.
- kwargsdict
Extra kwargs to pass to underlying RandomForest class
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
class
knockpy.knockoff_stats.
FeatureStatistic
(model=None)[source]¶ The base knockoff feature statistic class, which can wrap any generic prediction algorithm.
- Parameters
- model :
An instance of a class with a “train” or “fit” method and a “predict” method.
- Attributes
- model :
A (predictive) model class underlying the variable importance measures.
- indsnp.ndarray
(2p,)
-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fittinggl.
- rev_indsnp.ndarray:
Indices which reverse the effect of
inds.
In particular, if M is any(n, 2p)
-dimensional array, then`M==M[:, inds][:, rev_inds]`
- scorefloat
A metric of the model’s performance, as defined by
score_type
.- score_typestr
One of MSE, CVMSE, accuracy, or cvaccuracy. (cv stands for cross-validated)
- Znp.ndarray
a
2p
-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).
Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, feature_importance, …])Trains the model and creates feature importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
cv_score_model
(features, y, cv_score, logistic_flag=False)[source]¶ Similar to score_model, but uses cross-validated scoring if cv_score=True.
-
fit
(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='avg', **kwargs)[source]¶ Trains the model and creates feature importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- feature_importancestr
Specifies how to create feature importances from
model
. Two options:“swap”: The default swap-statistic from
http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf
Defaults to ‘swap’
- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- **kwargs**dict
kwargs to pass to the ‘train’ or ‘fit’ method of the model.
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
score_model
(features, y, y_dist=None)[source]¶ Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
- Returns
- lossfloat
Either the MSE or one minus the accuracy of the model, depending on whether y is continuous or binary.
-
swap_feature_importances
(features, y)[source]¶ Given a model of the features and y, calculates feature importances as follows.
For feature i, replace the feature with its knockoff and calculate the relative increase in the loss. Similarly, for knockoff i, replace the knockoffs with its feature and calculate the relative increase in the loss.
- Parameters
- featuresnp.ndarray
(n, 2p)
-shaped array of concatenated features and knockoffs, which must be permuted byself.inds
.- ynp.ndarray
(n,)
-shaped response vector
- Returns
- Z_swapnp.ndarray
(2p,)
-shaped array of variable importances such that Z_swap is in the same permuted asfeatures
initially were.
-
swap_path_feature_importances
(features, y, step_size=0.5, max_lambda=5)[source]¶ Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf
-
class
knockpy.knockoff_stats.
LassoStatistic
[source]¶ Lasso Statistic wrapper class
Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, zstat, use_pyglm, …])Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, zstat='coef', use_pyglm=True, group_lasso=False, antisym='cd', group_agg='avg', cv_score=False, debias=False, Ginv=None, **kwargs)[source]¶ Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- zstatstr:
Two options for the variable importance measure: - If ‘coef’, uses to cross-validated (group) lasso coefficients. - If ‘lars_path’, uses the lambda value where each feature/knockoff enters the lasso path (meaning becomes nonzero). This defaults to coef.
- use_pyglmbool
When fitting the group lasso, use the pyglm package if True (default). Else, use the group_lasso package.
- y_diststr
One of “binomial” or “gaussian”
- group_lassobool
If True, use a true group lasso. Else just use the sklearn ungrouped lasso.
- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- cv_scorebool
If true, score the feature statistic’s predictive accuracy using cross validation.
- debiasbool:
If true, debias the lasso. See https://arxiv.org/abs/1508.02757
- Ginvnp.ndarray
(2p, 2p)
-shaped precision matrix for the feature-knockoff covariate distribution. This must be specified ifdebias=True
.- kwargsdict
Extra kwargs to pass to underlying Lasso classes
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
class
knockpy.knockoff_stats.
MargCorrStatistic
[source]¶ Lasso Statistic wrapper class
Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups])Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, **kwargs)[source]¶ Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- kwargsdict
Extra kwargs to pass to underlying
combine_Z_stats
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
class
knockpy.knockoff_stats.
OLSStatistic
[source]¶ Lasso Statistic wrapper class
Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, cv_score])Wraps the FeatureStatistic class with OLS coefs as variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, cv_score=False, **kwargs)[source]¶ Wraps the FeatureStatistic class with OLS coefs as variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- cv_scorebool
If true, score the feature statistic’s predictive accuracy using cross validation.
- kwargsdict
Extra kwargs to pass to
combine_Z_stats
.
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
class
knockpy.knockoff_stats.
RandomForestStatistic
(model=None)[source]¶ Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, feature_importance, …])Wraps the FeatureStatistic class using a Random Forest to generate variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='sum', cv_score=False, **kwargs)[source]¶ Wraps the FeatureStatistic class using a Random Forest to generate variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- feature_importancestr
Specifies how to create feature importances from
model
. Three options:“sklearn”: Use sklearn feature importances. These
are very poor measures of feature importance, but very fast. - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf
Defaults to ‘swap’
- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- cv_scorebool
If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.
- kwargsdict
Extra kwargs to pass to underlying RandomForest class
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
class
knockpy.knockoff_stats.
RidgeStatistic
[source]¶ Ridge statistic wrapper class
Methods
cv_score_model
(features, y, cv_score[, …])Similar to score_model, but uses cross-validated scoring if cv_score=True.
fit
(X, Xk, y[, groups, antisym, group_agg, …])Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.
score_model
(features, y[, y_dist])Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.
swap_feature_importances
(features, y)Given a model of the features and y, calculates feature importances as follows.
swap_path_feature_importances
(features, y[, …])Similar to
swap_feature_importances
; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf-
fit
(X, Xk, y, groups=None, antisym='cd', group_agg='avg', cv_score=False, **kwargs)[source]¶ Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- cv_scorebool
If true, score the feature statistic’s predictive accuracy using cross validation.
- kwargsdict
Extra kwargs to pass to underlying Lasso classes
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
-
knockpy.knockoff_stats.
calc_lars_path
(X, Xk, y, groups=None, **kwargs)[source]¶ Calculates locations at which X/knockoffs enter lasso model when regressed on y.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- **kwargs
kwargs for
sklearn.linear_model.lars_path
- Returns
- Znp.ndarray
(2p,)
-shaped array indicating the lasso path statistic for each variable. (This means the maximum lambda such that the lasso coefficient on variable j is nonzero.)
-
knockpy.knockoff_stats.
combine_Z_stats
(Z, groups=None, antisym='cd', group_agg='sum')[source]¶ Given Z scores (variable importances), returns (grouped) feature statistics
- Parameters
- Znp.ndarray
(2p,)
-shaped numpy array of Z-statistics. The first p values correspond to true features, and the last p correspond to knockoffs (in the same order as the true features).- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- antisymstr
The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)
- group_aggstr
For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.
- Returns
- Wnp.ndarray
an array of feature statistics. This is
(p,)
-dimensional for regular knockoffs and(num_groups,)
-dimensional for group knockoffs.
-
knockpy.knockoff_stats.
data_dependent_threshhold
(W, fdr=0.1, offset=1)[source]¶ Calculate data-dependent threshhold given W statistics.
- Parameters
- Wnp.ndarray
p-length numpy array of feature statistics OR (p, batch_length) shaped array.
- fdrfloat
desired level of false discovery rate control
- offsetint
If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.
- Returns
- Tfloat or np.ndarray
The data-dependent threshhold. Either a float or a (batch_length,) dimensional array.
-
knockpy.knockoff_stats.
fit_group_lasso
(X, Xk, y, groups, use_pyglm=True, y_dist=None, group_lasso=True, **kwargs)[source]¶ Fits cross-validated ridge on [X, Xk] and y.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- use_pyglmbool
When fitting the group lasso, use the pyglm package if True (default). Else, use the group_lasso package.
- y_diststr
One of “binomial” or “gaussian”
- group_lassobool
If True, use a true group lasso. Else just use the sklearn ungrouped lasso.
- **kwargs
kwargs for eventual (group) lasso model.
- Returns
- glpyglm/sklearn/group_lasso model
The model fit through cross-validation; one of many types.
- indsnp.ndarray
(2p,)
-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fittinggl.
- rev_indsnp.ndarray:
Indices which reverse the effect of
inds.
In particular, if M is any(n, 2p)
-dimensional array, then`M==M[:, inds][:, rev_inds]`
-
knockpy.knockoff_stats.
fit_lasso
(X, Xk, y, y_dist=None, use_lars=False, **kwargs)[source]¶ Fits cross-validated lasso on [X, Xk] and y.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix.- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- y_diststr
One of “binomial” or “gaussian”
- use_larsbool
If True, uses a LARS-based solver for Gaussian data. If False, uses a gradient based solver (default).
- **kwargs
kwargs for sklearn model.
- Returns
- glsklearn.linear_model.LassoCV/LassoLarsCV/LogisticRegressionCV
The sklearn model fit through cross-validation.
- indsnp.ndarray
(2p,)
-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fittinggl.
- rev_indsnp.ndarray:
Indices which reverse the effect of
inds.
In particular, if M is any(n, 2p)
-dimensional array, then`M==M[:, inds][:, rev_inds]`
-
knockpy.knockoff_stats.
fit_ridge
(X, Xk, y, y_dist=None, **kwargs)[source]¶ Fits cross-validated ridge on [X, Xk] and y.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- ynp.ndarray
(n,)
-shaped response vector- y_diststr
One of “binomial” or “gaussian”
- **kwargs
kwargs for sklearn model.
- Returns
- glsklearn.linear_model.RidgeCV/LogisticRegressionCV
The sklearn model fit through cross-validation.
- indsnp.ndarray
(2p,)
-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fittinggl.
- rev_indsnp.ndarray:
Indices which reverse the effect of
inds.
In particular, if M is any(n, 2p)
-dimensional array, then`M==M[:, inds][:, rev_inds]`