Utilities API Reference

A collection of functions for generating synthetic datasets.

knockpy.dgp.AR1(p=30, a=1, b=1, tol=0.001, rho=None)[source]

Generates p-dimensional correlation matrix for AR(1) Gaussian process, where successive correlations are drawn from Beta(a,`b`) independelty. If rho is specified, then the process is stationary with correlation rho.

class knockpy.dgp.DGP(mu=None, Sigma=None, invSigma=None, beta=None, gibbs_graph=None)[source]

A utility class which creates a (random) data-generating process for a design matrix X and a response y. If the parameters are not specified, they will be randomly generated during the sample_data method.

Parameters
munp.ndarray

(p,)-shaped mean vector for X data

Sigmanp.ndarray

(p, p)-shaped covariance matrix for X data

invSigmanp.ndarray

(p, p)-shaped precision matrix for X data

betanp.ndarray

coefficients used to generate y from X in a single index or sparse additive model.

gibbs_graphnp.ndarray

(p, p)-shaped matrix of coefficients for gibbs grid method.

Attributes
munp.ndarray

See above

Sigmanp.ndarray

See above

invSigmanp.ndarray

See above

betanp.ndarray

See above

gibbs_Graphnp.ndarray

See above

Xnp.ndarray

(n, p)-shaped design matrix

ynp.ndarray

(n, )-shaped response vector

Methods

sample_data([p, n, x_dist, method, y_dist, …])

(Possibly) generates random data-generating parameters and then samples data using those parameters.

sample_data(p=100, n=50, x_dist='gaussian', method='AR1', y_dist='gaussian', cond_mean='linear', coeff_size=1, coeff_dist=None, sparsity=0.5, groups=None, sign_prob=0.5, iid_signs=True, corr_signals=False, df_t=3, **kwargs)[source]

(Possibly) generates random data-generating parameters and then samples data using those parameters. By default, (X, y) are jointly Gaussian and y is a linear response to X.

Parameters
nint

The number of data points

pint

The dimensionality of the data

x_diststr

Specifies the distribution of X. One of “Gaussian”, “blockt”, “ar1t”, or “gibbs”.

methodstr

How to generate the covariance matrix of X. One of “AR1”, “NestedAR1”, “partialcorr”, “factor”, “blockequi”, “ver”, “qer”, “dirichlet”, or “uniformdot”. See the docs for each of these methods.

y_diststr

Specifies the distribution of y. One of “Gaussian” or “binomial”.

cond_meanstr

How to calculate the conditional mean of y given X. Defaults to “linear”. See knockpy.dgp.sample_response for more options.

coeff_size: float

Size of non-zero coefficients

coeff_diststr

Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).

sparsityfloat

Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.

groupsnp.ndarray

A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.

sign_probfloat

The probability that each nonzero coefficient will be positive.

iid_signsbool

If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.

corr_signalsbool

If true, all of the nonzero coefficients will lie in a consecutive block.

df_tfloat

If the X variables are marginally t-distributed, the degrees of freedom.

kwargs: dict

keyword arguments to pass to method for generating the covariance matrix.

knockpy.dgp.DirichletCorr(p=100, temp=1, tol=1e-06)[source]

Generates a correlation matrix by sampling p eigenvalues from a dirichlet distribution whose p parameters are i.i.d. uniform [tol, temp] and generating a random covariance matrix with those eigenvalues.

knockpy.dgp.ErdosRenyi(p=300, delta=0.2, lower=0.1, upper=1, tol=0.1)[source]

Randomly samples bernoulli flags as well as values for partial correlations to generate sparse square matrices. Follows https://arxiv.org/pdf/1908.11611.pdf.

knockpy.dgp.FactorModel(p=500, rank=2)[source]

Generates random correlation matrix from a factor model with dimension p and rank rank.

knockpy.dgp.NestedAR1(p=500, a=7, b=1, tol=0.001, num_nests=5, nest_size=2)[source]

Generates correlation matrix for AR(1) Gaussian process with hierarchical correlation structure.

knockpy.dgp.PartialCorr(p=300, rho=0.3)[source]

Creates a correlation matrix of dimension p with partial correlation rho.

knockpy.dgp.UniformDot(d=100, p=100, tol=0.01)[source]

Let U be a random d x p matrix with i.i.d. uniform entries. Then Sigma = ``cov2corr``(U^T U)

knockpy.dgp.Wishart(d=100, p=100, tol=0.01)[source]

Let W be a random d x p matrix with i.i.d. Gaussian entries. Then Sigma = ``cov2corr``(W^T W).

knockpy.dgp.block_equi_graph(n=3000, p=1000, block_size=5, sparsity=0.1, rho=0.5, gamma=0, coeff_size=3.5, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, beta=None, mu=None, **kwargs)[source]

Samples data according to a block-equicorrelated Gaussian design, where rho is the within-block correlation and gamma * rho is the between-block correlation.

Parameters
nint

The number of data points

pint

The dimensionality of the data

block_sizeint

The size of blocks. Defaults to 5.

sparsityfloat

The proportion of groups which are null. Defaults to 0.1

rhofloat

The within-group correlation

gammafloat

The between-group correlation is gamma * rho

betanp.ndarray

If supplied, the (p,)-shaped set of coefficients. Else, will be generated by calling knockpy.dgp.sample_sparse_coefficients.

munp.ndarray

The p-dimensional mean of the covariates. Defaults to 0.

kwargsdict

Args passed to knockpy.dgp.sample_response.

Notes

This defaults to the same data-generating process as Dai and Barber 2016 (see https://arxiv.org/abs/1602.03589).

knockpy.dgp.construct_gibbs_grid(n, p, temp=1)[source]

Creates gridlike gibbs_graph parameter for sample_gibbs. See sample_gibbs.

knockpy.dgp.coords2num(l, w, gridwidth=10)[source]

Takes coordinates of variable in a Gibbs grid, returns position

knockpy.dgp.cov2blocks(V, tol=1e-05)[source]

Decomposes a PREORDERED block-diagonal matrix V into its blocks.

knockpy.dgp.create_correlation_tree(corr_matrix, method='average')[source]

Creates hierarchical clustering (correlation tree) from a correlation matrix

Parameters
corr_matrixnp.ndarray

(p, p)-shaped correlation matrix

methodstr

the method of hierarchical clustering: ‘single’, ‘average’, ‘fro’, or ‘complete’

Returns
linknp.ndarray

The link of the correlation tree, as in scipy

knockpy.dgp.create_sparse_coefficients(p, sparsity=0.5, groups=None, coeff_size=1, coeff_dist=None, sign_prob=0.5, iid_signs=True, corr_signals=False, n=None)[source]

Generate a set of sparse coefficients for single index or sparse additive models. p : int

Dimensionality of coefficients

sparsityfloat

Proportion of non-null coefficients. Generates np.floor(p*sparsity) non-nulls.

coeff_size: float

Size of non-zero coefficients

groupsnp.ndarray

A p-length array of integers from 1 to num_groups such that groups[j] == i indicates that variable j is a member of group i. Defaults to None. Else, floor(sparsity * num_groups) groups will be chosen to be non-null, where all elements of each group are non-null.

sign_probfloat

The probability that each nonzero coefficient will be positive.

iid_signsbool

If True, the signs of the coeffs are assigned independently. Else, exactly sign_prob*sparsity*p coefficients will be positive.

coeff_diststr

Specifies the distribution of nonzero coefficients. Three options: - None: all coefficients have absolute value coeff_size. - normal: all nonzero coefficients are drawn from Normal(coeff_size, 1). - uniform: nonzero coeffs are drawn from Unif(coeff_size/2, coeff_size).

corr_signalsbool

If true, all of the nonzero coefficients will lie in a consecutive block.

Returns
betanp.ndarray

(p,)-shaped array of sparse coefficients

knockpy.dgp.graph2cliques(Q)[source]

Turns graph Q of connections into binary cliques for Gibbs grid

knockpy.dgp.num2coords(i, gridwidth=10)[source]

Coordinates of variable i in a Gibbs grid

Parameters
iint

Position of variable in ordering

gridwidthint

Width of the grid

Returns
length_coord, width_coord (coordinates)
knockpy.dgp.sample_ar1t(rhos, n=50, df_t=3)[source]

Samples t-distributed variables according to a Markov chain.

Parameters
rhosnp.ndarray

(p-1, )-length array of correlations between consecutive variables.

nint

The number of data points to sample

df_tfloat

The degrees of freedom of the t variables

Returns
Xnp.ndarray

(n, p)-shaped design matrix

knockpy.dgp.sample_block_tmvn(blocks, n=50, df_t=3)[source]

Samples a blocks of multivariate-t distributed variables according to a list of covariance matrices called blocks.

Parameters
blockslist

A list of square, symmetric numpy arrays. These are the covariance matrices for each block of variables.

nint

The number of data points to sample

df_tfloat

The degrees of freedom of the t-distribution

Notes

The variables are scaled such that the marginal variance of each variable equals a diagonal element in one of the blocks.

knockpy.dgp.sample_gibbs(n, p, gibbs_graph=None, temp=1, num_iter=15, K=20, max_val=2.5)[source]

Samples from a discrete Gibbs measure using a gibbs sampler.

The joint likelihood for the p-dimensional data X1 through Xp is the product of all terms of the following form:

`np.exp(-1*gibbs_graph[i,j]*np.abs(Xi - Xj))`

where gibbs_graph is assumed to be symmetric (and is usually sparse). Each feature takes values on an evenly spaced grid from -1*max_val to max_val with K values.

Parameters
nint

How many data points to sample

pint

Dimensionality of the data

gibbs_graphnp.ndarray

A symmetric (p, p)-shaped array. See likelihood equation. By default, this is corresponds to an undirected graphical model with a square grid (like an Ising model) with nonzero entries set to 1 or -1 with equal probability.

tempfloat

Governs the strength of interactions between features—see the likelihood equation.

num_iterint

Number of iterations in the Gibbs sampler; defaults to 15.

Kint

Number of discrete values each sampled feature can take.

max_valfloat

The maximum absolute value each feature can take.

Returns
Xnp.ndarray

(n, p)-shaped array of data

gibbs_graphnp.ndarray

The generated gibbs_graph.

knockpy.dgp.sample_response(X, beta, cond_mean='linear', y_dist='gaussian')[source]

Given a design matrix X and coefficients beta, samples a response y.

Xnp.ndarray

(n, p)-shaped design matrix

betanp.ndarray

(p, )-shaped coefficient vector

cond_meanstr

How to calculate the conditional mean of y given X, denoted mu(X). Six options:

  1. “linear” denotes np.dot(X, beta)

  2. “cubic” denotes np.dot(X**3, beta) - np.dot(X, beta)

3. “trunclinear” ((X * beta >= 1).sum(axis = 1)) Stands for truncated linear. 4. “pairint”: pairs up non-null coefficients according to the order of beta, multiplies them and their beta values, then sums. “pairint” stands for pairwise-interactions. 5. “cos”: mu(X) = sign(beta) * (beta != 0) * np.cos(X) 6. “quadratic”: mu(X) = np.dot(np.power(X, 2), beta)

y_diststr

If “gaussian”, y is the conditional mean plus gaussian noise. If “binomial”, Pr(y=1) = softmax(cond_mean).

Returns
ynp.ndarray

(n,)-shaped response vector

knockpy.utilities.apply_pool(func, constant_inputs={}, num_processes=1, **kwargs)[source]

Spawns num_processes processes to apply func to many different arguments. This wraps the multiprocessing.pool object plus the functools partial function.

Parameters
funcfunction

An arbitrary function

constant_inputsdictionary

A dictionary of arguments to func which do not change in each of the processes spawned, defaults to {}.

num_processesint

The maximum number of processes spawned, defaults to 1.

kwargsdict

Each key should correspond to an argument to func and should map to a list of different arguments.

Returns
outputslist

List of outputs for each input, in the order of the inputs.

knockpy.utilities.blockdiag_to_blocks(M, groups)[source]

Given a square array M, returns a list of diagonal blocks of M as specified by groups.

Returns
blockslist

A list of square np.ndarrays. blocks[i] corresponds to group identified by the ith smallest unique value of groups.

knockpy.utilities.calc_group_sizes(groups)[source]

Given a list of groups, finds the sizes of the groups.

Parameters
groupsnp.ndarray

(p, )-shaped array which takes m integer values from 1 to m. If groups[i] == j, this indicates that coordinate i belongs to group j.

:param groups: p-length array of integers between 1 and m,
Returns
sizesnp.ndarray

(m, )-length array of group sizes.

knockpy.utilities.chol2inv(X)[source]

Uses cholesky decomp to get inverse of matrix

knockpy.utilities.cov2corr(M)[source]

Rescales a p x p cov. matrix M to be a correlation matrix

knockpy.utilities.estimate_covariance(X, tol=0.0001, shrinkage='ledoitwolf')[source]

Estimates covariance matrix of X.

Parameters
Xnp.ndarray

(n, p)-shaped design matrix

shrinkagestr

The type of shrinkage to apply during estimation. One of “ledoitwolf”, “graphicallasso”, or None (no shrinkage).

tolfloat

If shrinkage is None but the minimum eigenvalue of the MLE is below tol, apply LedoitWolf shrinkage anyway.

Returns
Sigmanp.ndarray

(p, p)-shaped estimated covariance matrix of X

invSigmanp.ndarray

(p, p)-shaped estimated precision matrix of X

knockpy.utilities.fetch_group_nonnulls(non_nulls, groups)[source]

Combines feature-level null hypotheses into group-level hypothesis.

knockpy.utilities.permute_matrix_by_groups(groups)[source]

Create indices which permute a (covariance) matrix according to a list of groups.

knockpy.utilities.preprocess_groups(groups)[source]

Maps the m unique elements of a 1D “groups” array to the integers from 1 to m.

knockpy.utilities.random_permutation_inds(length)[source]

Returns indexes which will randomly permute/unpermute a numpy array of length length. Also returns indices which will undo this permutation.

Returns
indsnp.ndarray

(length,)-shaped ndarray corresponding to a random permutation from 0 to length-1.

rev_indsnp.ndarray

(length,)-shaped ndarray such that for any (length,)-shaped array called x, x[inds][rev_inds] equals x.

knockpy.utilities.scale_until_PSD(Sigma, S, tol, num_iter)[source]

Perform a binary search to find the largest gamma such that the minimum eigenvalue of 2*Sigma - gamma*S is at least tol.

Returns
gamma * Snp.ndarray

See description.

gammafloat

See description

knockpy.utilities.shift_until_PSD(M, tol)[source]

Add the identity until a p x p matrix M has eigenvalues of at least tol