Knockoff Feature Statistics API Reference

class knockpy.knockoff_stats.DeepPinkStatistic(model=None)[source]

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, …])

Wraps the FeatureStatistic class using DeepPINK to generate variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='deeppink', antisym='cd', group_agg='sum', cv_score=False, train_kwargs={'verbose': False}, **kwargs)[source]

Wraps the FeatureStatistic class using DeepPINK to generate variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Four options:

  • “deeppink”: Use the deeppink feature importance

defined in https://arxiv.org/abs/1809.01185 - “unweighted”: Use the Z weights from the deeppink paper without weighting them using the layers from the MLP. Deeppink usually outperforms this feature importance (but not always). - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to deeppink, which is often both the most powerful and the most computationally efficient.

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.

kwargsdict

Extra kwargs to pass to underlying RandomForest class

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.FeatureStatistic(model=None)[source]

The base knockoff feature statistic class, which can wrap any generic prediction algorithm.

Parameters
model :

An instance of a class with a “train” or “fit” method and a “predict” method.

Attributes
model :

A (predictive) model class underlying the variable importance measures.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

scorefloat

A metric of the model’s performance, as defined by score_type.

score_typestr

One of MSE, CVMSE, accuracy, or cvaccuracy. (cv stands for cross-validated)

Znp.ndarray

a 2p-dimsional array of feature and knockoff importances. The first p coordinates correspond to features, the last p correspond to knockoffs.

Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, …])

Trains the model and creates feature importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

cv_score_model(features, y, cv_score, logistic_flag=False)[source]

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='avg', **kwargs)[source]

Trains the model and creates feature importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Two options:

  • “swap”: The default swap-statistic from

http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to ‘swap’

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

**kwargs**dict

kwargs to pass to the ‘train’ or ‘fit’ method of the model.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

score_model(features, y, y_dist=None)[source]

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

Returns
lossfloat

Either the MSE or one minus the accuracy of the model, depending on whether y is continuous or binary.

swap_feature_importances(features, y)[source]

Given a model of the features and y, calculates feature importances as follows.

For feature i, replace the feature with its knockoff and calculate the relative increase in the loss. Similarly, for knockoff i, replace the knockoffs with its feature and calculate the relative increase in the loss.

Parameters
featuresnp.ndarray

(n, 2p)-shaped array of concatenated features and knockoffs, which must be permuted by self.inds.

ynp.ndarray

(n,)-shaped response vector

Returns
Z_swapnp.ndarray

(2p,)-shaped array of variable importances such that Z_swap is in the same permuted as features initially were.

swap_path_feature_importances(features, y, step_size=0.5, max_lambda=5)[source]

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

class knockpy.knockoff_stats.LassoStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, zstat, use_pyglm, …])

Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, zstat='coef', use_pyglm=True, group_lasso=False, antisym='cd', group_agg='avg', cv_score=False, debias=False, Ginv=None, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Lasso coefficients or Lasso path statistics as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

zstatstr:

Two options for the variable importance measure: - If ‘coef’, uses to cross-validated (group) lasso coefficients. - If ‘lars_path’, uses the lambda value where each feature/knockoff enters the lasso path (meaning becomes nonzero). This defaults to coef.

use_pyglmbool

When fitting the group lasso, use the pyglm package if True (default). Else, use the group_lasso package.

y_diststr

One of “binomial” or “gaussian”

group_lassobool

If True, use a true group lasso. Else just use the sklearn ungrouped lasso.

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

debiasbool:

If true, debias the lasso. See https://arxiv.org/abs/1508.02757

Ginvnp.ndarray

(2p, 2p)-shaped precision matrix for the feature-knockoff covariate distribution. This must be specified if debias=True.

kwargsdict

Extra kwargs to pass to underlying Lasso classes

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.MargCorrStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups])

Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, **kwargs)[source]

Wraps the FeatureStatistic class using marginal correlations between X, Xk and y as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

kwargsdict

Extra kwargs to pass to underlying combine_Z_stats

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.OLSStatistic[source]

Lasso Statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, cv_score])

Wraps the FeatureStatistic class with OLS coefs as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class with OLS coefs as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

kwargsdict

Extra kwargs to pass to combine_Z_stats.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RandomForestStatistic(model=None)[source]

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, feature_importance, …])

Wraps the FeatureStatistic class using a Random Forest to generate variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, feature_importance='swap', antisym='cd', group_agg='sum', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class using a Random Forest to generate variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

feature_importancestr

Specifies how to create feature importances from model. Three options:

  • “sklearn”: Use sklearn feature importances. These

are very poor measures of feature importance, but very fast. - “swap”: The default swap-statistic from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf. These are good measures of feature importance but slightly slower. - “swapint”: The swap-integral defined from http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

Defaults to ‘swap’

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation. This is extremely expensive for random forests.

kwargsdict

Extra kwargs to pass to underlying RandomForest class

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

class knockpy.knockoff_stats.RidgeStatistic[source]

Ridge statistic wrapper class

Methods

cv_score_model(features, y, cv_score[, …])

Similar to score_model, but uses cross-validated scoring if cv_score=True.

fit(X, Xk, y[, groups, antisym, group_agg, …])

Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.

score_model(features, y[, y_dist])

Computes mean-squared error of self.model on (features, y) when y is nonbinary, and computes 1 - accuracy otherwise.

swap_feature_importances(features, y)

Given a model of the features and y, calculates feature importances as follows.

swap_path_feature_importances(features, y[, …])

Similar to swap_feature_importances; see http://proceedings.mlr.press/v89/gimenez19a/gimenez19a.pdf

fit(X, Xk, y, groups=None, antisym='cd', group_agg='avg', cv_score=False, **kwargs)[source]

Wraps the FeatureStatistic class but uses cross-validated Ridge coefficients as variable importances.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

cv_scorebool

If true, score the feature statistic’s predictive accuracy using cross validation.

kwargsdict

Extra kwargs to pass to underlying Lasso classes

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.calc_lars_path(X, Xk, y, groups=None, **kwargs)[source]

Calculates locations at which X/knockoffs enter lasso model when regressed on y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

**kwargs

kwargs for sklearn.linear_model.lars_path

Returns
Znp.ndarray

(2p,)-shaped array indicating the lasso path statistic for each variable. (This means the maximum lambda such that the lasso coefficient on variable j is nonzero.)

knockpy.knockoff_stats.calc_mse(model, X, y)[source]

Gets MSE of a model

knockpy.knockoff_stats.combine_Z_stats(Z, groups=None, antisym='cd', group_agg='sum')[source]

Given Z scores (variable importances), returns (grouped) feature statistics

Parameters
Znp.ndarray

(2p,)-shaped numpy array of Z-statistics. The first p values correspond to true features, and the last p correspond to knockoffs (in the same order as the true features).

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

antisymstr

The antisymmetric function used to create (ungrouped) feature statistics. Three options: - “CD” (Difference of absolute vals of coefficients), - “SM” (signed maximum). - “SCD” (Simple difference of coefficients - NOT recommended)

group_aggstr

For group knockoffs, specifies how to turn individual feature statistics into grouped feature statistics. Two options: “sum” and “avg”.

Returns
Wnp.ndarray

an array of feature statistics. This is (p,)-dimensional for regular knockoffs and (num_groups,)-dimensional for group knockoffs.

knockpy.knockoff_stats.data_dependent_threshhold(W, fdr=0.1, offset=1)[source]

Calculate data-dependent threshhold given W statistics.

Parameters
Wnp.ndarray

p-length numpy array of feature statistics OR (p, batch_length) shaped array.

fdrfloat

desired level of false discovery rate control

offsetint

If offset = 0, control the modified FDR. If offset = 1 (default), controls the FDR exactly.

Returns
Tfloat or np.ndarray

The data-dependent threshhold. Either a float or a (batch_length,) dimensional array.

knockpy.knockoff_stats.fit_group_lasso(X, Xk, y, groups, use_pyglm=True, y_dist=None, group_lasso=True, **kwargs)[source]

Fits cross-validated ridge on [X, Xk] and y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

use_pyglmbool

When fitting the group lasso, use the pyglm package if True (default). Else, use the group_lasso package.

y_diststr

One of “binomial” or “gaussian”

group_lassobool

If True, use a true group lasso. Else just use the sklearn ungrouped lasso.

**kwargs

kwargs for eventual (group) lasso model.

Returns
glpyglm/sklearn/group_lasso model

The model fit through cross-validation; one of many types.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

knockpy.knockoff_stats.fit_lasso(X, Xk, y, y_dist=None, use_lars=False, **kwargs)[source]

Fits cross-validated lasso on [X, Xk] and y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix.

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

y_diststr

One of “binomial” or “gaussian”

use_larsbool

If True, uses a LARS-based solver for Gaussian data. If False, uses a gradient based solver (default).

**kwargs

kwargs for sklearn model.

Returns
glsklearn.linear_model.LassoCV/LassoLarsCV/LogisticRegressionCV

The sklearn model fit through cross-validation.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

knockpy.knockoff_stats.fit_ridge(X, Xk, y, y_dist=None, **kwargs)[source]

Fits cross-validated ridge on [X, Xk] and y.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

ynp.ndarray

(n,)-shaped response vector

y_diststr

One of “binomial” or “gaussian”

**kwargs

kwargs for sklearn model.

Returns
glsklearn.linear_model.RidgeCV/LogisticRegressionCV

The sklearn model fit through cross-validation.

indsnp.ndarray

(2p,)-dimensional array of indices representing the random permutation applied to the concatenation of [X, Xk] before fitting gl.

rev_indsnp.ndarray:

Indices which reverse the effect of inds. In particular, if M is any (n, 2p)-dimensional array, then `M==M[:, inds][:, rev_inds]`

knockpy.knockoff_stats.parse_logistic_flag(kwargs)[source]

Checks whether y_dist is binomial

knockpy.knockoff_stats.parse_y_dist(y)[source]

Checks if y is binary; else assumes it is continuous

knockpy.knockoff_stats.use_reg_lasso(groups)[source]

Parses whether or not to use group lasso