Utilities API Reference¶
A collection of functions for generating synthetic datasets.
-
knockpy.dgp.
AR1
(p=30, a=1, b=1, tol=0.001, rho=None)[source]¶ Generates p-dimensional correlation matrix for AR(1) Gaussian process, where successive correlations are drawn from Beta(a,`b`) independelty. If rho is specified, then the process is stationary with correlation rho.
-
class
knockpy.dgp.
DGP
(mu=None, Sigma=None, invSigma=None, beta=None, gibbs_graph=None)[source]¶ A utility class which creates a (random) data-generating process for a design matrix X and a response y. If the parameters are not specified, they will be randomly generated during the
sample_data
method.- Parameters
- munp.ndarray
(p,)
-shaped mean vector for X data- Sigmanp.ndarray
(p, p)
-shaped covariance matrix for X data- invSigmanp.ndarray
(p, p)
-shaped precision matrix for X data- betanp.ndarray
coefficients used to generate y from X in a single index or sparse additive model.
- gibbs_graphnp.ndarray
(p, p)
-shaped matrix of coefficients for gibbs grid method.
- Attributes
- munp.ndarray
See above
- Sigmanp.ndarray
See above
- invSigmanp.ndarray
See above
- betanp.ndarray
See above
- gibbs_Graphnp.ndarray
See above
- Xnp.ndarray
(n, p)
-shaped design matrix- ynp.ndarray
(n, )
-shaped response vector
Methods
sample_data
([p, n, x_dist, method, y_dist, …])(Possibly) generates random data-generating parameters and then samples data using those parameters.
-
sample_data
(p=100, n=50, x_dist='gaussian', method='AR1', y_dist='gaussian', cond_mean='linear', coeff_size=1, coeff_dist=None, sparsity=0.5, groups=None, sign_prob=0.5, iid_signs=True, corr_signals=False, df_t=3, **kwargs)[source]¶ (Possibly) generates random data-generating parameters and then samples data using those parameters. By default, (X, y) are jointly Gaussian and y is a linear response to X.
- Parameters
- nint
The number of data points
- pint
The dimensionality of the data
- x_diststr
Specifies the distribution of X. One of “Gaussian”, “blockt”, “ar1t”, or “gibbs”.
- methodstr
How to generate the covariance matrix of X. One of “AR1”, “NestedAR1”, “partialcorr”, “factor”, “blockequi”, “ver”, “qer”, “dirichlet”, or “uniformdot”. See the docs for each of these methods.
- y_diststr
Specifies the distribution of y. One of “Gaussian” or “binomial”.
- cond_meanstr
How to calculate the conditional mean of y given X. Defaults to “linear”. See
knockpy.dgp.sample_response
for more options.- coeff_size: float
Size of non-zero coefficients
- coeff_diststr
Specifies the distribution of nonzero coefficients. Three options: -
None
: all coefficients have absolute value coeff_size. -normal
: all nonzero coefficients are drawn from Normal(coeff_size, 1). -uniform
: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).- sparsityfloat
Proportion of non-null coefficients. Generates
np.floor(p*sparsity)
non-nulls.- groupsnp.ndarray
A p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults toNone
. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.- sign_probfloat
The probability that each nonzero coefficient will be positive.
- iid_signsbool
If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.
- corr_signalsbool
If true, all of the nonzero coefficients will lie in a consecutive block.
- df_tfloat
If the X variables are marginally t-distributed, the degrees of freedom.
- kwargs: dict
keyword arguments to pass to method for generating the covariance matrix.
-
knockpy.dgp.
DirichletCorr
(p=100, temp=1, tol=1e-06)[source]¶ Generates a correlation matrix by sampling p eigenvalues from a dirichlet distribution whose p parameters are i.i.d. uniform [tol, temp] and generating a random covariance matrix with those eigenvalues.
-
knockpy.dgp.
ErdosRenyi
(p=300, delta=0.2, lower=0.1, upper=1, tol=0.1)[source]¶ Randomly samples bernoulli flags as well as values for partial correlations to generate sparse square matrices. Follows https://arxiv.org/pdf/1908.11611.pdf.
-
knockpy.dgp.
FactorModel
(p=500, rank=2)[source]¶ Generates random correlation matrix from a factor model with dimension p and rank rank.
-
knockpy.dgp.
NestedAR1
(p=500, a=7, b=1, tol=0.001, num_nests=5, nest_size=2)[source]¶ Generates correlation matrix for AR(1) Gaussian process with hierarchical correlation structure.
-
knockpy.dgp.
PartialCorr
(p=300, rho=0.3)[source]¶ Creates a correlation matrix of dimension p with partial correlation rho.
-
knockpy.dgp.
UniformDot
(d=100, p=100, tol=0.01)[source]¶ Let U be a random d x p matrix with i.i.d. uniform entries. Then Sigma = ``cov2corr``(U^T U)
-
knockpy.dgp.
Wishart
(d=100, p=100, tol=0.01)[source]¶ Let W be a random d x p matrix with i.i.d. Gaussian entries. Then Sigma = ``cov2corr``(W^T W).
-
knockpy.dgp.
block_equi_graph
(n=3000, p=1000, block_size=5, sparsity=0.1, rho=0.5, gamma=0, coeff_size=3.5, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, beta=None, mu=None, **kwargs)[source]¶ Samples data according to a block-equicorrelated Gaussian design, where
rho
is the within-block correlation andgamma * rho
is the between-block correlation.- Parameters
- nint
The number of data points
- pint
The dimensionality of the data
- block_sizeint
The size of blocks. Defaults to 5.
- sparsityfloat
The proportion of groups which are null. Defaults to 0.1
- rhofloat
The within-group correlation
- gammafloat
The between-group correlation is
gamma * rho
- betanp.ndarray
If supplied, the
(p,)
-shaped set of coefficients. Else, will be generated by callingknockpy.dgp.sample_sparse_coefficients.
- munp.ndarray
The
p
-dimensional mean of the covariates. Defaults to 0.- kwargsdict
Args passed to
knockpy.dgp.sample_response
.
Notes
This defaults to the same data-generating process as Dai and Barber 2016 (see https://arxiv.org/abs/1602.03589).
-
knockpy.dgp.
construct_gibbs_grid
(n, p, temp=1)[source]¶ Creates gridlike
gibbs_graph
parameter forsample_gibbs
. Seesample_gibbs
.
-
knockpy.dgp.
coords2num
(l, w, gridwidth=10)[source]¶ Takes coordinates of variable in a Gibbs grid, returns position
-
knockpy.dgp.
cov2blocks
(V, tol=1e-05)[source]¶ Decomposes a PREORDERED block-diagonal matrix V into its blocks.
-
knockpy.dgp.
create_correlation_tree
(corr_matrix, method='average')[source]¶ Creates hierarchical clustering (correlation tree) from a correlation matrix
- Parameters
- corr_matrixnp.ndarray
(p, p)
-shaped correlation matrix- methodstr
the method of hierarchical clustering: ‘single’, ‘average’, ‘fro’, or ‘complete’
- Returns
- linknp.ndarray
The link of the correlation tree, as in scipy
-
knockpy.dgp.
create_sparse_coefficients
(p, sparsity=0.5, groups=None, coeff_size=1, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, n=None)[source]¶ Generate a set of sparse coefficients for single index or sparse additive models. p : int
Dimensionality of coefficients
- sparsityfloat
Proportion of non-null coefficients. Generates
np.floor(p*sparsity)
non-nulls.- coeff_size: float
Size of non-zero coefficients
- groupsnp.ndarray
A p-length array of integers from 1 to num_groups such that
groups[j] == i
indicates that variable j is a member of group i. Defaults toNone
. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.- sign_probfloat
The probability that each nonzero coefficient will be positive.
- iid_signsbool
If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.
- coeff_diststr
Specifies the distribution of nonzero coefficients. Three options: -
None
: all coefficients have absolute value coeff_size. -normal
: all nonzero coefficients are drawn from Normal(coeff_size, 1). -uniform
: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).- corr_signalsbool
If true, all of the nonzero coefficients will lie in a consecutive block.
- Returns
- betanp.ndarray
(p,)
-shaped array of sparse coefficients
-
knockpy.dgp.
graph2cliques
(Q)[source]¶ Turns graph Q of connections into binary cliques for Gibbs grid
-
knockpy.dgp.
num2coords
(i, gridwidth=10)[source]¶ Coordinates of variable i in a Gibbs grid
- Parameters
- iint
Position of variable in ordering
- gridwidthint
Width of the grid
- Returns
- length_coord, width_coord (coordinates)
-
knockpy.dgp.
sample_ar1t
(rhos, n=50, df_t=3)[source]¶ Samples t-distributed variables according to a Markov chain.
- Parameters
- rhosnp.ndarray
(p-1, )
-length array of correlations between consecutive variables.- nint
The number of data points to sample
- df_tfloat
The degrees of freedom of the t variables
- Returns
- Xnp.ndarray
(n, p)
-shaped design matrix
-
knockpy.dgp.
sample_block_tmvn
(blocks, n=50, df_t=3)[source]¶ Samples a blocks of multivariate-t distributed variables according to a list of covariance matrices called
blocks
.- Parameters
- blockslist
A list of square, symmetric numpy arrays. These are the covariance matrices for each block of variables.
- nint
The number of data points to sample
- df_tfloat
The degrees of freedom of the t-distribution
Notes
The variables are scaled such that the marginal variance of each variable equals a diagonal element in one of the blocks.
-
knockpy.dgp.
sample_gibbs
(n, p, gibbs_graph=None, temp=1, num_iter=15, K=20, max_val=2.5)[source]¶ Samples from a discrete Gibbs measure using a gibbs sampler.
The joint likelihood for the p-dimensional data X1 through Xp is the product of all terms of the following form:
`np.exp(-1*gibbs_graph[i,j]*np.abs(Xi - Xj))`
where gibbs_graph is assumed to be symmetric (and is usually sparse). Each feature takes values on an evenly spaced grid from
-1*max_val
tomax_val
withK
values.- Parameters
- nint
How many data points to sample
- pint
Dimensionality of the data
- gibbs_graphnp.ndarray
A symmetric
(p, p)
-shaped array. See likelihood equation. By default, this is corresponds to an undirected graphical model with a square grid (like an Ising model) with nonzero entries set to 1 or -1 with equal probability.- tempfloat
Governs the strength of interactions between features—see the likelihood equation.
- num_iterint
Number of iterations in the Gibbs sampler; defaults to 15.
- Kint
Number of discrete values each sampled feature can take.
- max_valfloat
The maximum absolute value each feature can take.
- Returns
- Xnp.ndarray
(n, p)
-shaped array of data- gibbs_graphnp.ndarray
The generated gibbs_graph.
-
knockpy.dgp.
sample_response
(X, beta, cond_mean='linear', y_dist='gaussian')[source]¶ Given a design matrix X and coefficients beta, samples a response y.
- Xnp.ndarray
(n, p)
-shaped design matrix- betanp.ndarray
(p, )
-shaped coefficient vector- cond_meanstr
How to calculate the conditional mean of y given X, denoted mu(X). Six options:
“linear” denotes
np.dot(X, beta)
“cubic” denotes
np.dot(X**3, beta) - np.dot(X, beta)
3. “trunclinear”
((X * beta >= 1).sum(axis = 1))
Stands for truncated linear. 4. “pairint”: pairs up non-null coefficients according to the order of beta, multiplies them and their beta values, then sums. “pairint” stands for pairwise-interactions. 5. “cos”:mu(X) = sign(beta) * (beta != 0) * np.cos(X)
6. “quadratic”:mu(X) = np.dot(np.power(X, 2), beta)
- y_diststr
If “gaussian”, y is the conditional mean plus gaussian noise. If “binomial”, Pr(y=1) = softmax(cond_mean).
- Returns
- ynp.ndarray
(n,)
-shaped response vector
-
knockpy.utilities.
apply_pool
(func, constant_inputs={}, num_processes=1, **kwargs)[source]¶ Spawns num_processes processes to apply func to many different arguments. This wraps the multiprocessing.pool object plus the functools partial function.
- Parameters
- funcfunction
An arbitrary function
- constant_inputsdictionary
A dictionary of arguments to func which do not change in each of the processes spawned, defaults to {}.
- num_processesint
The maximum number of processes spawned, defaults to 1.
- kwargsdict
Each key should correspond to an argument to func and should map to a list of different arguments.
- Returns
- outputslist
List of outputs for each input, in the order of the inputs.
-
knockpy.utilities.
blockdiag_to_blocks
(M, groups)[source]¶ Given a square array M, returns a list of diagonal blocks of M as specified by groups.
- Returns
- blockslist
A list of square np.ndarrays. blocks[i] corresponds to group identified by the ith smallest unique value of
groups
.
-
knockpy.utilities.
calc_group_sizes
(groups)[source]¶ Given a list of groups, finds the sizes of the groups.
- Parameters
- groupsnp.ndarray
(p, )
-shaped array which takes m integer values from 1 to m. Ifgroups[i] == j
, this indicates that coordinatei
belongs to groupj
.- :param groups: p-length array of integers between 1 and m,
- Returns
- sizesnp.ndarray
(m, )
-length array of group sizes.
-
knockpy.utilities.
estimate_covariance
(X, tol=0.0001, shrinkage='ledoitwolf')[source]¶ Estimates covariance matrix of X.
- Parameters
- Xnp.ndarray
(n, p)
-shaped design matrix- shrinkagestr
The type of shrinkage to apply during estimation. One of “ledoitwolf”, “graphicallasso”, or None (no shrinkage).
- tolfloat
If shrinkage is None but the minimum eigenvalue of the MLE is below tol, apply LedoitWolf shrinkage anyway.
- Returns
- Sigmanp.ndarray
(p, p)
-shaped estimated covariance matrix of X- invSigmanp.ndarray
(p, p)
-shaped estimated precision matrix of X
-
knockpy.utilities.
fetch_group_nonnulls
(non_nulls, groups)[source]¶ Combines feature-level null hypotheses into group-level hypothesis.
-
knockpy.utilities.
permute_matrix_by_groups
(groups)[source]¶ Create indices which permute a (covariance) matrix according to a list of groups.
-
knockpy.utilities.
preprocess_groups
(groups)[source]¶ Maps the m unique elements of a 1D “groups” array to the integers from 1 to m.
-
knockpy.utilities.
random_permutation_inds
(length)[source]¶ Returns indexes which will randomly permute/unpermute a numpy array of length length. Also returns indices which will undo this permutation.
- Returns
- indsnp.ndarray
(length,)
-shaped ndarray corresponding to a random permutation from 0 to length-1.- rev_indsnp.ndarray
(length,)
-shaped ndarray such that for any(length,)
-shaped array calledx
,x[inds][rev_inds]
equalsx
.