Knockoff Sampler API Reference¶
-
class
knockpy.knockoffs.
FXSampler
(X, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]¶ Samples FX knockoffs. See the GaussianSampler documentation for description of the arguments.
Methods
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
fetch_S
()Rescales S to the same scale as the initial X input
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
Samples knockoffs.
-
class
knockpy.knockoffs.
GaussianSampler
(X, mu=None, Sigma=None, invSigma=None, groups=None, sample_tol=1e-05, S=None, method=None, verbose=False, **kwargs)[source]¶ Samples MX Gaussian (group) knockoffs.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design- munp.ndarray
(p, )
-shaped mean of the features. If None, this defaults to the empirical mean of the features.- Sigmanp.ndarray
(p, p)
-shaped covariance matrix of the features. If None, this is estimated using theutilities.estimate_covariance
function.- groupsnp.ndarray
For group knockoffs, a p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults to None (regular knockoffs).- Snp.ndarray
the
(p, p)
-shaped knockoff S-matrix used to generate knockoffs. This is defined such that Cov(X, tilde(X)) = Sigma - S. When None, will be constructed by knockoff generator. Defaults to None.- methodstr
Specifies how to construct S matrix. This will be ignored if
S
is not None. There are several options:‘mvr’: Minimum Variance-Based Reconstructability knockoffs.
‘mmi’: Minimizes the mutual information between X and the knockoffs.
‘ci’: Conditional independence knockoffs.
‘sdp’: minimize the mean absolute covariance (MAC) between the features
and the knockoffs. - ‘equicorrelated’: Minimizes the MAC under the constraint that the the correlation between each feature and its knockoff is the same.
The default is to use mvr for non-group knockoffs, and to use the group-SDP for grouped knockoffs (the implementation for group mvr knockoffs is currently fairly slow). In both cases we use a block-diagonal approximation if the number if features is greater than 1000.
- objectivestr
How to optimize the S matrix if using the SDP for group knockoffs. There are several options:
‘abs’: minimize sum(abs(Sigma - S))
between groups and the group knockoffs. - ‘pnorm’: minimize Lp-th matrix norm. Equivalent to abs when p = 1. - ‘norm’: minimize different type of matrix norm (see norm_type below).
- sample_tolfloat
Minimum eigenvalue allowed for feature-knockoff covariance matrix. Keep this small but nonzero (1e-5) to prevent numerical errors.
- verbosebool
If True, prints progress over time
- rec_propfloat
The proportion of knockoffs to recycle (see Barber and Candes 2018, https://arxiv.org/abs/1602.03574). If method = ‘mvr’, then S_generation takes this into account and should increase the power of recycled knockoffs. sparsely-correlated, high-dimensional settings.
- kwargsdict
Other kwargs for S-matrix solvers.
Methods
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
fetch_S
()Fetches knockoff S-matrix.
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
Samples knockoffs.
-
class
knockpy.knockoffs.
KnockoffSampler
[source]¶ Base class for sampling knockoffs.
Methods
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
fetch_S
()Fetches knockoff S-matrix.
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
sample_knockoffs
-
check_PSD_condition
(Sigma, S)[source]¶ Checks that the feature-knockoff cov matrix is PSD.
- Parameters
- Sigmanp.ndarray
(p, p)
-shaped covariance matrix of the features. If None, this is estimated using theshrinkage
option. This is ignored for fixed-X knockoffs.- Snp.ndarray
the
(p, p)
-shaped knockoff S-matrix used to generate knockoffs.
- Raises
- Raises an error if S is not PSD or 2 Sigma - S is not PSD.
-
check_xk_validity
(X, Xk, testname='', alpha=0.001)[source]¶ Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X. Uses the BHQ adjustment for multiple testing.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- testnamestr
a testname that shows up in the error
- alphafloat
The significance level. Defaults to 0.001
-
-
knockpy.knockoffs.
produce_FX_knockoffs
(X, invSigma, S, copies=1)[source]¶ See equation (1.4) of https://arxiv.org/pdf/1404.5609.pdf
The metropolized knockoff sampler for an arbitrary probability density and graphical structure using covariance-guided proposals.
See https://arxiv.org/abs/1903.00434 for a description of the algorithm and proof of validity and runtime.
This code was based on initial code written by Stephen Bates in October 2019, which was released in combination with https://arxiv.org/abs/1903.00434.
-
class
knockpy.metro.
ARTKSampler
(X, Sigma, df_t, **kwargs)[source]¶ Samples knockoffs for autoregressive T-distributed designs. (Hence, ARTK). See https://arxiv.org/pdf/1903.00434.pdf for details.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- Sigmanp.ndarray
(p, p)
-shaped covariance matrix of the features. The first diagonal should be the pairwise correlations which define the Markov chain.- df_tfloat
The degrees of freedom for the t-distributions.
- kwargsdict
kwargs to pass to the constructor method of the generic
MetropolizedKnockoffSampler
class.
Methods
cache_conditional_proposal_params
([verbose, …])Caches some of the conditional means for Xjstar | Xtemp.
center
(M[, active_inds])Centers an n x j matrix M.
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
compute_F
(x_flags, j)Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.
compute_acc_prob
(x_flags, j[, log_q1, …])Computes acceptance probability for variable
j
given a particular rejection patternx_flags
.create_proposal_params
(**kwargs)Constructs the covariance-guided proposal.
fetch_S
()Fetches knockoff S-matrix.
fetch_cached_proposal_params
(Xtemp, x_flags, j)Same as above, but uses caching to speed up computation.
fetch_proposal_params
(X, prev_proposals)Returns mean and variance of proposal j given X and previous proposals.
lf
(X)Reordered likelihood function
lf_ratio
(X, Xjstar, j)Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.
log_q12
(x_flags, j)Computes q1 and q2 as specified by page 33 of the paper.
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
q_ll
(Xjstar, X, prev_proposals[, cond_mean, …])Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray
(n,)
-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray(n, p)
-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray(n, j-1)
-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.sample_knockoffs
([clip, cache])Samples knockoffs using the metropolized knockoff sampler.
sample_proposals
(X, prev_proposals[, …])Samples a continuous or discrete proposal given the design matrix and the previous proposals.
-
class
knockpy.metro.
BlockTSampler
(X, Sigma, df_t, **kwargs)[source]¶ Methods
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
fetch_S
()Fetches knockoff S-matrix.
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
sample_knockoffs
(**kwargs)Actually samples knockoffs sequentially for each block.
-
class
knockpy.metro.
GibbsGridSampler
(X, gibbs_graph, Sigma, Q=None, mu=None, max_width=6, **kwargs)[source]¶ Samples knockoffs for a discrete gibbs grid using the divide-and-conquer algorithm plus metropolized knockoff sampling.
- Parameters
- Xnp.ndarray
the
(n, p)
-shaped design matrix- gibbs_graphnp.ndarray
(p, p)
-shaped matrix specifying the distribution of the gibbs grid: seeknockpy.dgp.sample_gibbs
. This must correspond to a grid-like undirected graphical model.- Sigmanp.ndarray
(p, p)
-shaped estimated covariance matrix of the data.- max_widthint
The maximum treewidth to allow in the divide-and-conquer algorithm.
Notes
Unlike the attributes of a
MetropolizedKnockoffSampler
class, the attributes of aBlockTSampler
class are stored in the same order that the design matrix is initially passed in. E.g.,self.Xk
corresponds withself.X
.Methods
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
fetch_S
()Returns
None
because the divide-and-conquer approach means there is no one S-matrix.many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
sample_knockoffs
(**kwargs)Samples knockoffs using divide-and-conquer approach.
coords2num
num2coords
-
class
knockpy.metro.
MetropolizedKnockoffSampler
(lf, X, mu=None, Sigma=None, undir_graph=None, order=None, active_frontier=None, gamma=0.999, metro_verbose=False, cliques=None, log_potentials=None, buckets=None, **kwargs)[source]¶ A metropolized knockoff sampler for arbitrary random variables using covariance-guided proposals.
Group knockoffs are not yet supported.
- Parameters
- lffunction
log-probability density. This function should take a
(n, p)
-shaped numpy array (n independent samples of a p-dimensional vector) and return a(n,)
shaped array of log-probabilities. This can also be supplied asNone
if cliques and log-potentials are supplied.- Xnp.ndarray
the
(n, p)
-shaped design matrix- Xknp.ndarray
the
(n, p)
-shaped matrix of knockoffs- munp.ndarray
The (estimated) mean of X. Exact FDR control is maintained even when this vector is incorrect. Defaults to the mean of X, e.g.,
X.mean(axis=0)
.- Sigmanp.ndarray
(p, p)
-shaped covariance matrix of the features. IfNone
, this is estimated using the data using a naive method to ensure compatability with the proposals. Exact FDR control is maintained even when Sigma is incorrect.- undir_graphnp.ndarray or nx.Graph
An undirected graph specifying the conditional independence structure of the data-generating process. This must be specified if either of the
order
oractive_frontier
params are not specified. One of two options: - A networkx undirected graph object - A(p, p)
-shaped numpy array, where nonzero elements represent edges.- ordernp.ndarray
A
p
-length numpy array specifying the ordering to sample the variables. Should be a vector with unique entries 0,…,p-1.- active_fontierA list of lists of length p where entry j is the set of
entries > j that are in V_j. This specifies the conditional independence structure of the distribution given by lf. See page 34 of the paper.
- gammafloat
A tuning parameter to increase / decrease the acceptance ratio. See appendix F.2. Defaults to 0.999.
- bucketsnp.ndarray or list
A list or array of discrete values that X can take. Covariance-guided proposals will be rounded to these values. If
None
, Metro assumes the domain of each feature is all real numbers.- kwargsdict
kwargs to pass to the
smatrix.compute_smatrix
method for sampling proposals.
Notes
All attributes of the MetropolizedKnockoffSampler are stored in the order that knockoffs are sampled, NOT the order that variables are initially passed in. For example, the
X
attribute will not necessarily equal theX
argument: instead,self.X = X[:, self.order]
. To reorder attributes to the initial order of theX
argument, use the syntaxself.attribute[:, self.inv_order]
.- Attributes
- ordernp.ndarray
(p,)
-shaped array of indices which reordersX
into the order for sampling knockoffs.- inv_ordernp.ndarray
(p,)
-shaped array of indices which takes a set of variables which have been reordered for metropolized sampling and returns them to their initial order. For example,X == X[:, self.order][:, self.inv_order]
.- Xnp.ndarray
(n, p)
design matrix reordered according to the order for sampling knockoffs- X_propnp.ndarray
(n, p)
-shaped array of knockoff proposals- Xknp.ndarray
the
(n, p)
-shaped array of knockoffs- acceptancesnp.ndarray
a
(n, p)
-shaped boolean array whereacceptances[i, j] == 1
indicates thatX_prop[i, j]
was accepted.- final_acc_probsnp.ndarray
a
(n, p)
-shaped array wherefinal_acc_probs[i, j]
is the acceptance probability forX_prop[i, j]
.- Sigmanp.ndarray
the
(p, p)
-shaped estimated covariance matrix ofX
. The class constructor guarantees this is compatible with the conditional independence structure of the data.- Snp.ndarray
the
(p, p)
-shaped knockoff S-matrix used to generate the covariance-guided proposals.
Methods
cache_conditional_proposal_params
([verbose, …])Caches some of the conditional means for Xjstar | Xtemp.
center
(M[, active_inds])Centers an n x j matrix M.
check_PSD_condition
(Sigma, S)Checks that the feature-knockoff cov matrix is PSD.
check_xk_validity
(X, Xk[, testname, alpha])Runs a variety of KS tests on X and Xk to (informally) check that Xk are valid knockoffs for X.
compute_F
(x_flags, j)Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.
compute_acc_prob
(x_flags, j[, log_q1, …])Computes acceptance probability for variable
j
given a particular rejection patternx_flags
.create_proposal_params
(**kwargs)Constructs the covariance-guided proposal.
fetch_S
()Fetches knockoff S-matrix.
fetch_cached_proposal_params
(Xtemp, x_flags, j)Same as above, but uses caching to speed up computation.
fetch_proposal_params
(X, prev_proposals)Returns mean and variance of proposal j given X and previous proposals.
lf
(X)Reordered likelihood function
lf_ratio
(X, Xjstar, j)Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X.
log_q12
(x_flags, j)Computes q1 and q2 as specified by page 33 of the paper.
many_ks_tests
(sample1s, sample2s)Samples1s, Sample2s = list of arrays Gets p values by running ks tests and then does a multiple testing correction.
q_ll
(Xjstar, X, prev_proposals[, cond_mean, …])Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray
(n,)
-shaped numpy array of values to evaluate the proposal likelihood at. X : np.ndarray(n, p)
-shaped array of observed data, in the order used to sample knockoff variables. prev_proposals : np.ndarray(n, j-1)
-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.sample_knockoffs
([clip, cache])Samples knockoffs using the metropolized knockoff sampler.
sample_proposals
(X, prev_proposals[, …])Samples a continuous or discrete proposal given the design matrix and the previous proposals.
-
cache_conditional_proposal_params
(verbose=False, expensive_cache=True)[source]¶ Caches some of the conditional means for Xjstar | Xtemp. If expensive_cache = True, this will be quite memory intensive in order to achieve a 2-3x speedup. Otherwise, achieves a a 20-30% speedup at a more modest memory cost.
-
center
(M, active_inds=None)[source]¶ Centers an n x j matrix M. For mu = 0, does not perform this computation, which actually is a bottleneck for large n and p.
-
compute_F
(x_flags, j)[source]¶ Computes the F function from Page 33 pf the paper: Pr(tildeXj=tildexj, Xjstar=xjstar | Xtemp, tildeX_{1:j-1}, Xjstar_{1:j-1}) Note that tildexj and xjstar are NOT inputs because they do NOT change during the junction tree DP process.
-
compute_acc_prob
(x_flags, j, log_q1=None, log_q2=None, Xtemp=None)[source]¶ Computes acceptance probability for variable
j
given a particular rejection patternx_flags
.Mathematically, this is: Pr(tildeXj = Xjstar | Xtemp, Xtilde_{1:j-1}, Xstar_{1:j})
-
create_proposal_params
(**kwargs)[source]¶ Constructs the covariance-guided proposal.
- Parameters
- kwargsdict
kwargs for the
smatrix.compute_smatrix
function
-
fetch_cached_proposal_params
(Xtemp, x_flags, j)[source]¶ Same as above, but uses caching to speed up computation. This caching can be cheap (if self.cache is False) or extremely expensive (if self.cache is True) in terms of memory.
-
fetch_proposal_params
(X, prev_proposals)[source]¶ Returns mean and variance of proposal j given X and previous proposals. Both
X
andprev_proposals
must be in the order used to sample knockoff variables.
-
lf_ratio
(X, Xjstar, j)[source]¶ Calculates the log of the likelihood ratio between two observations: X where X[:,j] is replaced with Xjstar, divided by the likelihood of X. This is equivalent to (but often faster) than:
>>> ld_obs = self.lf(X) >>> Xnew = X.copy() >>> Xnew[:, j] = Xjstar >>> ld_prop = self.lf(Xnew) >>> ld_ratio = ld_prop - ld_obs
When node potentials have been passed, this is much faster than calculating the log-likelihood function and subtracting.
- Parameters
X – a n x p matrix of observations
Xjstar – New observations for column j of X
j – an int between 0 and p - 1, telling us which column to replace
-
q_ll
(Xjstar, X, prev_proposals, cond_mean=None, cond_var=None)[source]¶ Calculates the log-likelihood of a proposal Xjstar given X and the previous proposals. Xjstar : np.ndarray
(n,)
-shaped numpy array of values to evaluate the proposal likelihood at.- Xnp.ndarray
(n, p)
-shaped array of observed data, in the order used to sample knockoff variables.- prev_proposalsnp.ndarray
(n, j-1)
-shaped array of previous proposals, in the order used to sample knockoff variables. If None, assumes j = 0.
-
knockpy.metro.
get_ordering
(T)[source]¶ Takes a junction tree and returns a variable ordering for the metro knockoff sampler. The code from this function is adapted from the code distributed with https://arxiv.org/abs/1903.00434.
- Parameters
- TA networkx graph that is a junction tree.
- Nodes must be sets with elements 0,…,p-1.
- Returns
- ordera numpy array with unique elements 0,…,p-1
- active_frontierlist of lists
a list of length p gwhere entry j is the set of entries > j that are in V_j. This specifies the conditional independence structure of a joint covariate distribution. See page 34 of https://arxiv.org/abs/1903.00434.
-
knockpy.metro.
t_log_likelihood
(X, df_t)[source]¶ UNNORMALIZED t loglikelihood. This is also faster than scipy
-
knockpy.metro.
t_markov_loglike
(X, rhos, df_t=3)[source]¶ Calculates log-likelihood for markov chain specified in https://arxiv.org/pdf/1903.00434.pdf
-
knockpy.metro.
t_mvn_loglike
(X, invScale, mu=None, df_t=3)[source]¶ Calculates multivariate t log-likelihood up to normalizing constant. :param X: n x p array of data :param invScale: p x p array, inverse multivariate t scale matrix :param mu: p-length array, location parameter :param df_t: degrees of freedom