Knockoff Sampler API Reference

class knockpy.knockoffs.FXSampler(X, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples FX knockoffs. See the GaussianSampler documentation for description of the arguments.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Rescales S to the same scale as the initial X input

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs()

Samples knockoffs.

fetch_S()[source]

Rescales S to the same scale as the initial X input

sample_knockoffs()[source]

Samples knockoffs. returns n x p knockoff matrix.

class knockpy.knockoffs.GaussianSampler(X, mu=None, Sigma=None, invSigma=None, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]

Samples MX Gaussian (group) knockoffs.

Parameters
Xnp.ndarray

the (n, p)-shaped design

munp.ndarray

(p, )-shaped mean of the features. If None, this defaults to the empirical mean of the features.

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the utilities.estimate_covariance function.

groupsnp.ndarray

For group knockoffs, a p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None (regular knockoffs).

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs. This is defined such that Cov(X, tilde(X)) = Sigma - S. When None, will be constructed by knockoff generator. Defaults to None.

methodstr

Specifies how to construct S matrix. This will be ignored if S is not None. There are several options:

  • ‘mvr’: Minimum Variance-Based Reconstructability knockoffs.

  • ‘mmi’: Minimizes the mutual information between X and the knockoffs.

  • ‘ci’: Conditional independence knockoffs.

  • ‘sdp’: minimize the mean absolute covariance (MAC) between the features

and the knockoffs. - ‘equicorrelated’: Minimizes the MAC under the constraint that the the correlation between each feature and its knockoff is the same.

The default is to use mvr for non-group knockoffs, and to use the group-SDP for grouped knockoffs (the implementation for group mvr knockoffs is currently fairly slow). In both cases we use a block-diagonal approximation if the number if features is greater than 1000.

objectivestr

How to optimize the S matrix if using the SDP for group knockoffs. There are several options:

  • ‘abs’: minimize sum(abs(Sigma - S))

between groups and the group knockoffs. - ‘pnorm’: minimize Lp-th matrix norm. Equivalent to abs when p = 1. - ‘norm’: minimize different type of matrix norm (see norm_type below).

sample_tolfloat

Minimum eigenvalue allowed for feature-knockoff covariance matrix. Keep this small but nonzero (1e-5) to prevent numerical errors.

verbosebool

If True, prints progress over time

rec_propfloat

The proportion of knockoffs to recycle (see Barber and Candes 2018, https://arxiv.org/abs/1602.03574). If method = ‘mvr’, then S_generation takes this into account and should increase the power of recycled knockoffs. sparsely-correlated, high-dimensional settings.

kwargsdict

Other kwargs for S-matrix solvers.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs()

Samples knockoffs.

fetch_S()[source]

Fetches knockoff S-matrix.

sample_knockoffs()[source]

Samples knockoffs. returns n x p knockoff matrix.

class knockpy.knockoffs.KnockoffSampler[source]

Base class for sampling knockoffs.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs

check_PSD_condition(Sigma, S)[source]

Checks that the feature-knockoff cov matrix is PSD.

Parameters
Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the shrinkage option. This is ignored for fixed-X knockoffs.

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate knockoffs.

Raises
Raises an error if S is not PSD or 2 Sigma - S is not PSD.
check_xk_validity(X, Xk, testname='', alpha=0.001)[source]

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X. Uses the BHQ adjustment for multiple testing.

Parameters
Xnp.ndarray

the (n, p)-shaped design

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

testnamestr

a testname that shows up in the error

alphafloat

The significance level. Defaults to 0.001

fetch_S()[source]

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)[source]

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

knockpy.knockoffs.produce_FX_knockoffs(X, invSigma, S, copies=1)[source]

See equation (1.4) of https://arxiv.org/pdf/1404.5609.pdf

The metropolized knockoff sampler for an arbitrary probability density and graphical structure using covariance-guided proposals.

See https://arxiv.org/abs/1903.00434 for a description of the algorithm and proof of validity and runtime.

This code was based on initial code written by Stephen Bates in October 2019, which was released in combination with https://arxiv.org/abs/1903.00434.

class knockpy.metro.ARTKSampler(X, Sigma, df_t, **kwargs)[source]

Samples knockoffs for autoregressive T-distributed designs. (Hence, ARTK). See https://arxiv.org/pdf/1903.00434.pdf for details.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. The first diagonal should be the pairwise correlations which define the Markov chain.

df_tfloat

The degrees of freedom for the t-distributions.

kwargsdict

kwargs to pass to the constructor method of the generic MetropolizedKnockoffSampler class.

Methods

cache_conditional_proposal_params([verbose, …])

Caches some of the conditional means for Xjstar | Xtemp.

center(M[, active_inds])

Centers an n x j matrix M.

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

compute_F(x_flags, j)

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j[, log_q1, …])

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

create_proposal_params(**kwargs)

Constructs the covariance-guided proposal.

fetch_S()

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)

Same as above, but uses caching to speed up computation.

fetch_proposal_params(X, prev_proposals)

Returns mean and variance of proposal j given X and previous proposals.

lf(X)

Reordered likelihood function

lf_ratio(X, Xjstar, j)

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.

log_q12(x_flags, j)

Computes q1 and q2 as specified by page 33 of the paper.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

q_ll(Xjstar, X, prev_proposals[, cond_mean, …])

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray (n,)-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray (n, p)-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray (n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs([clip, cache])

Samples knockoffs using the metropolized knockoff sampler.

sample_proposals(X, prev_proposals[, …])

Samples a continuous or discrete proposal given the design matrix and the previous proposals.

class knockpy.metro.BlockTSampler(X, Sigma, df_t, **kwargs)[source]

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Fetches knockoff S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs(**kwargs)

Actually samples knockoffs sequentially for each block.

fetch_S()[source]

Fetches knockoff S-matrix.

sample_knockoffs(**kwargs)[source]

Actually samples knockoffs sequentially for each block.

Parameters
kwargsdict

kwargs for the MetropolizedKnockoffSampler.sample_knockoffs call for each block.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.GibbsGridSampler(X, gibbs_graph, Sigma, Q=None, mu=None, max_width=6, **kwargs)[source]

Samples knockoffs for a discrete gibbs grid using the divide-and-conquer algorithm plus metropolized knockoff sampling.

Parameters
Xnp.ndarray

the (n, p)-shaped design matrix

gibbs_graphnp.ndarray

(p, p)-shaped matrix specifying the distribution of the gibbs grid: see knockpy.dgp.sample_gibbs. This must correspond to a grid-like undirected graphical model.

Sigmanp.ndarray

(p, p)-shaped estimated covariance matrix of the data.

max_widthint

The maximum treewidth to allow in the divide-and-conquer algorithm.

Notes

Unlike the attributes of a MetropolizedKnockoffSampler class, the attributes of a BlockTSampler class are stored in the same order that the design matrix is initially passed in. E.g., self.Xk corresponds with self.X.

Methods

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

fetch_S()

Returns None because the divide-and-conquer approach means there is no one S-matrix.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

sample_knockoffs(**kwargs)

Samples knockoffs using divide-and-conquer approach.

coords2num

num2coords

fetch_S()[source]

Returns None because the divide-and-conquer approach means there is no one S-matrix.

sample_knockoffs(**kwargs)[source]

Samples knockoffs using divide-and-conquer approach.

Parameters
kwargsdict

Keyword args for MetropolizedKnockoffSampler.sample_knockoffs.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

class knockpy.metro.MetropolizedKnockoffSampler(lf, X, mu=None, Sigma=None, undir_graph=None, order=None, active_frontier=None, gamma=0.999, metro_verbose=False, cliques=None, log_potentials=None, buckets=None, **kwargs)[source]

A metropolized knockoff sampler for arbitrary random variables using covariance-guided proposals.

Group knockoffs are not yet supported.

Parameters
lffunction

log-probability density. This function should take a (n, p)-shaped numpy array (n independent samples of a p-dimensional vector) and return a (n,) shaped array of log-probabilities. This can also be supplied as None if cliques and log-potentials are supplied.

Xnp.ndarray

the (n, p)-shaped design matrix

Xknp.ndarray

the (n, p)-shaped matrix of knockoffs

munp.ndarray

The (estimated) mean of X. Exact FDR control is maintained even when this vector is incorrect. Defaults to the mean of X, e.g., X.mean(axis=0).

Sigmanp.ndarray

(p, p)-shaped covariance matrix of the features. If None, this is estimated using the data using a naive method to ensure compatability with the proposals. Exact FDR control is maintained even when Sigma is incorrect.

undir_graphnp.ndarray or nx.Graph

An undirected graph specifying the conditional independence structure of the data-generating process. This must be specified if either of the order or active_frontier params are not specified. One of two options: - A networkx undirected graph object - A (p, p)-shaped numpy array, where nonzero elements represent edges.

ordernp.ndarray

A p-length numpy array specifying the ordering to sample the variables. Should be a vector with unique entries 0,…,p-1.

active_fontierA list of lists of length p where entry j is the set of

entries > j that are in V_j. This specifies the conditional independence structure of the distribution given by lf. See page 34 of the paper.

gammafloat

A tuning parameter to increase / decrease the acceptance ratio. See appendix F.2. Defaults to 0.999.

bucketsnp.ndarray or list

A list or array of discrete values that X can take. Covariance-guided proposals will be rounded to these values. If None, Metro assumes the domain of each feature is all real numbers.

kwargsdict

kwargs to pass to the smatrix.compute_smatrix method for sampling proposals.

Notes

All attributes of the MetropolizedKnockoffSampler are stored in the order that knockoffs are sampled, NOT the order that variables are initially passed in. For example, the X attribute will not necessarily equal the X argument: instead, self.X = X[:, self.order]. To reorder attributes to the initial order of the X argument, use the syntax self.attribute[:, self.inv_order].

Attributes
ordernp.ndarray

(p,)-shaped array of indices which reorders X into the order for sampling knockoffs.

inv_ordernp.ndarray

(p,)-shaped array of indices which takes a set of variables which have been reordered for metropolized sampling and returns them to their initial order. For example, X == X[:, self.order][:, self.inv_order].

Xnp.ndarray

(n, p) design matrix reordered according to the order for sampling knockoffs

X_propnp.ndarray

(n, p)-shaped array of knockoff proposals

Xknp.ndarray

the (n, p)-shaped array of knockoffs

acceptancesnp.ndarray

a (n, p)-shaped boolean array where acceptances[i, j] == 1 indicates that X_prop[i, j] was accepted.

final_acc_probsnp.ndarray

a (n, p)-shaped array where final_acc_probs[i, j] is the acceptance probability for X_prop[i, j].

Sigmanp.ndarray

the (p, p)-shaped estimated covariance matrix of X. The class constructor guarantees this is compatible with the conditional independence structure of the data.

Snp.ndarray

the (p, p)-shaped knockoff S-matrix used to generate the covariance-guided proposals.

Methods

cache_conditional_proposal_params([verbose, …])

Caches some of the conditional means for Xjstar | Xtemp.

center(M[, active_inds])

Centers an n x j matrix M.

check_PSD_condition(Sigma, S)

Checks that the feature-knockoff cov matrix is PSD.

check_xk_validity(X, Xk[, testname, alpha])

Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.

compute_F(x_flags, j)

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j[, log_q1, …])

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

create_proposal_params(**kwargs)

Constructs the covariance-guided proposal.

fetch_S()

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)

Same as above, but uses caching to speed up computation.

fetch_proposal_params(X, prev_proposals)

Returns mean and variance of proposal j given X and previous proposals.

lf(X)

Reordered likelihood function

lf_ratio(X, Xjstar, j)

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.

log_q12(x_flags, j)

Computes q1 and q2 as specified by page 33 of the paper.

many_ks_tests(sample1s, sample2s)

Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.

q_ll(Xjstar, X, prev_proposals[, cond_mean, …])

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray (n,)-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray (n, p)-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray (n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs([clip, cache])

Samples knockoffs using the metropolized knockoff sampler.

sample_proposals(X, prev_proposals[, …])

Samples a continuous or discrete proposal given the design matrix and the previous proposals.

cache_conditional_proposal_params(verbose=False, expensive_cache=True)[source]

Caches some of the conditional means for Xjstar | Xtemp. If expensive_cache = True, this will be quite memory intensive in order to achieve a 2-3x speedup. Otherwise, achieves a a 20-30% speedup at a more modest memory cost.

center(M, active_inds=None)[source]

Centers an n x j matrix M. For mu = 0, does not perform this computation, which actually is a bottleneck for large n and p.

compute_F(x_flags, j)[source]

Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.

compute_acc_prob(x_flags, j, log_q1=None, log_q2=None, Xtemp=None)[source]

Computes acceptance probability for variable j given a particular rejection pattern x_flags.

Mathematically, this is: Pr(tildeXj = Xjstar | Xtemp, Xtilde_{1:j-1}, Xstar_{1:j})

create_proposal_params(**kwargs)[source]

Constructs the covariance-guided proposal.

Parameters
kwargsdict

kwargs for the smatrix.compute_smatrix function

fetch_S()[source]

Fetches knockoff S-matrix.

fetch_cached_proposal_params(Xtemp, x_flags, j)[source]

Same as above, but uses caching to speed up computation. This caching can be cheap (if self.cache is False) or extremely expensive (if self.cache is True) in terms of memory.

fetch_proposal_params(X, prev_proposals)[source]

Returns mean and variance of proposal j given X and previous proposals. Both X and prev_proposals must be in the order used to sample knockoff variables.

lf(X)[source]

Reordered likelihood function

lf_ratio(X, Xjstar, j)[source]

Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X. This is equivalent to (but often faster) than:

>>> ld_obs = self.lf(X)
>>> Xnew = X.copy()
>>> Xnew[:, j] = Xjstar
>>> ld_prop = self.lf(Xnew)
>>> ld_ratio = ld_prop - ld_obs

When node potentials have been passed, this is much faster than calculating the log-likelihood function and subtracting.

Parameters
  • X – a n x p matrix of observations

  • Xjstar – New observations for column j of X

  • j – an int between 0 and p - 1, telling us which column to replace

log_q12(x_flags, j)[source]

Computes q1 and q2 as specified by page 33 of the paper.

q_ll(Xjstar, X, prev_proposals, cond_mean=None, cond_var=None)[source]

Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray

(n,)-shaped numpy array of values to evaluate the proposal likelihood at.

Xnp.ndarray

(n, p)-shaped array of observed data, in the order used to sample knockoff variables.

prev_proposalsnp.ndarray

(n, j-1)-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.

sample_knockoffs(clip=1e-05, cache=None)[source]

Samples knockoffs using the metropolized knockoff sampler.

Returns
Xknp.ndarray

A (n, p)-shaped knockoff matrix in the original order the variables were passed in.

sample_proposals(X, prev_proposals, cond_mean=None, cond_var=None)[source]

Samples a continuous or discrete proposal given the design matrix and the previous proposals. Can pass in the conditional mean and variance of the new proposals, if cached, to save computation.

knockpy.metro.gaussian_log_likelihood(X, mu, var)[source]

Somehow this is faster than scipy

knockpy.metro.get_ordering(T)[source]

Takes a junction tree and returns a variable ordering for the metro knockoff sampler. The code from this function is adapted from the code distributed with https://arxiv.org/abs/1903.00434.

Parameters
TA networkx graph that is a junction tree.
Nodes must be sets with elements 0,…,p-1.
Returns
ordera numpy array with unique elements 0,…,p-1
active_frontierlist of lists

a list of length p gwhere entry j is the set of entries > j that are in V_j. This specifies the conditional independence structure of a joint covariate distribution. See page 34 of https://arxiv.org/abs/1903.00434.

knockpy.metro.t_log_likelihood(X, df_t)[source]

UNNORMALIZED t loglikelihood. This is also faster than scipy

knockpy.metro.t_markov_loglike(X, rhos, df_t=3)[source]

Calculates log-likelihood for markov chain specified in https://arxiv.org/pdf/1903.00434.pdf

knockpy.metro.t_mvn_loglike(X, invScale, mu=None, df_t=3)[source]

Calculates multivariate t log-likelihood up to normalizing constant. :param X: n x p array of data :param invScale: p x p array, inverse multivariate t scale matrix :param mu: p-length array, location parameter :param df_t: degrees of freedom