sobolev_alignment package

Submodules

sobolev_alignment.SobolevAlignment module

Sobolev Alignment implementation.

Main class for Sobolev Alignment, which wraps all the different operations presented in Sobolev Alignment procedure: - Model selection (scVI and KRR) - scVI models training. - Synthetic models generations. - KRR approximation. - Alignment of KRR models.

sobolev_alignment.feature_analysis module

Feature analysis.

@author: Soufiane Mourragui

This modules contains all the codes used in the Taylor expansion for the Gaussian/Matern kernel.

sobolev_alignment.feature_analysis.basis(x, k, gamma)

Compute the basis function for a single gene, except offset term.

Parameters:
x: np.array

Column vector (each row corresponds to a sample).

k: int

Order to compute.

gamma: float

Parameter of Matérn kernel.

Returns:
np.array

Value of the higher order feature.

sobolev_alignment.feature_analysis.combinatorial_product(x, idx, gamma)

Compute the basis function for a single gene, except offset term.

Parameters:
x: np.array

Data matrix with samples in the rows and genes in the columns

idx: tuple

Combinations, i.e. tuple of features to take into account.

gamma: float

Parameter of Matérn kernel.

Returns:
scipy.sparse.csc_matrix

Values of the higher order feature.

sobolev_alignment.feature_analysis.higher_order_contribution(d: int, data: array, sample_offset: array, gene_names: list, gamma: float, n_jobs: int = 1, return_matrix: bool = False)

Compute the features corresponding to the Taylor expansion of the kernel.

Compute the features corresponding to the Taylor expansion of the kernel, i.e. $x_j exp^{-gamma xx^T}$ for linear features. Returns a sparse pandas DataFrame containing all the features (columns) by samples (rows). We here critically rely on the sparsity of the data-matrix to speed up computations. The current implementation is relevant in two cases: -When dimensionality is small -When data is sparse.

High-dimensional and dense data matrices would lead to a significant over-head without computational gains, and could benefit from another implementation strategy.

Parameters:
d: int

Order of the features to compute, e.g. 1 for linear, 2 for interaction terms.

data: np.array

Data to compute features on, samples in the rows and genes (features) in the columns.

sample_offset: np.array

Offset of each sample from data.

gene_names: list

Names of each columns in data ; corresponds to features naming.

gamma: float

Value of the gamma parameter for Matérn kernel.

n_jobs: int, default to 1

Number of concurrent threads to use. -1 will use all CPU cores possible. WARNING: for d >= 2 and a large number of genes, the routine can be memory-intensive and a high n_jobs could lead to crash.

return_matrix: bool, default to False

If True, then returns simply the feature-matrix without feature-naming. In cases when feature names are not relevant (e.g. computing the proportion of non-linearities), return_matrix=True can help speed-up the process.

Returns:
pd.DataFrame

Sparse dataframe with samples in the rows and named features in the columns. For instance, when d=1, returns each column of data scaled by RKHS normalisation factor and multiplied by offset value.

sobolev_alignment.generate_artificial_sample module

Generate artificial samples.

@author: Soufiane Mourragui

Generate samples using scVI decoder from a multivariate gaussian noise. This module generates the training data used to approximate the VAE encoding functions by Matérn kernel machines.

sobolev_alignment.generate_artificial_sample.generate_samples(sample_size: int, batch_names: list, covariates_values: list, lib_size: dict, model: SCVI, batch_key_dict: dict, return_dist: bool = False)

Generate artificial gene expression profiles.

Note to developers: this method has been designed to be used with scvi-tools classes. Other VAE implementations may break here.

Parameters:
sample_size: int

Number of samples to generate.

batch_names: list or np.ndarray, default to None

List or array with sample_size str values indicating the batch of each sample.

covariate_values: list or np.ndarray, default to None

List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).

lib_size

Dictionary of mean library size per batch.

model

scVI model which decoder is here exploited to generate samples.

batch_key_dict

Dictionary linking the values of the batch (scVI) and the key used in scVI.

return_dist: bool, default to False

Whether to return the distribution parameters (True) or samples from this distribution (False).

Returns:
If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
If return_dist if True, torch.Tensor with distribution parameters (following scVI
order) and one torch.Tensor with artificial samples in the rows (CPU).
sobolev_alignment.generate_artificial_sample.parallel_generate_samples(sample_size, batch_names, covariates_values, lib_size, model, batch_key_dict: Optional[dict] = None, return_dist: bool = False, batch_size=1000, n_jobs=1)

Generate artificial gene expression profiles.

Wrapper of parallelize generate_samples, running several threads in parallel. <b>Note to developers</b>: this function needs to be changed if applied to other VAE model than scVI.

Parameters:
sample_size: int

Number of samples to generate.

batch_names: list or np.ndarray, default to None

List or array with sample_size str values indicating the batch of each sample.

covariate_values: list or np.ndarray, default to None

List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).

lib_size

Dictionary of mean library size per batch.

model

scVI model which decoder is here exploited to generate samples.

batch_key_dict

Dictionary linking the values of the batch (scVI) and the key used in scVI.

return_dist: bool, default to False

Whether to return the distribution parameters (True) or samples from this distribution (False).

batch_size: int, default to 10**3

Number of sample to generate per batch.

n_jobs: int, default to 1

Number of threads to launch. n_jobs=-1 will launch as many threads as there are CPUs available.

Returns:
If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
If return_dist if True, torch.Tensor with distribution parameters (following scVI
order) and one torch.Tensor with artificial samples in the rows (CPU).

sobolev_alignment.interpolated_features module

Compute interpolated features.

@author: Soufiane Mourragui

sobolev_alignment.interpolated_features.compute_optimal_tau(PV_number, pv_projections, principal_angles, n_interpolation=100)

Compute the optimal interpolation step for each PV (Grassmann interpolation).

sobolev_alignment.interpolated_features.project_on_interpolate_PV(angle, PV_number, tau_step, pv_projections)

Project data on interpolated PVs.

sobolev_alignment.kernel_operations module

Kernel operations.

@author: Soufiane Mourragui

Custom scripts for specific matrix operations.

sobolev_alignment.kernel_operations.mat_inv_sqrt(M, threshold=1e-06)

Compute the inverse square root of a symmetric matrix M by SVD.

sobolev_alignment.krr_approx module

Encoder approximation by Kernel Ridge Regression.

@author: Soufiane Mourragui

This modules train a Kernel Ridge Regression (KRR) on a pair of samples (x_hat) and embeddings (z_hat) using two possible implementations: - scikit-learn: deterministic, but limited in memory and time efficiency. - Falkon: stochastic Nyström approximation, faster both in memory and computation time. Optimised for multi GPUs.

References

Mourragui et al 2022 Meanti et al, Kernel methods through the roof: handling billions of points efficiently, NeurIPS, 2020. Pedregosa et al, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 2011.

class sobolev_alignment.krr_approx.KRRApprox(method: str = 'sklearn', kernel: str = 'rbf', M: int = 100, kernel_params: Optional[dict] = None, penalization: float = 1e-06, maxiter: int = 20, falkon_options: Optional[dict] = None, mean_center: bool = False, unit_std: bool = False)

Bases: object

Kernel Ridge Regression approximation.

This class contains the functions used to approximate the encoding functions of a Variational Auto Encoder (VAE) by a kernel machines by means of Kernel Ridge Regression (KRR). <br/> This class takes as input a training data and executes the learning process. The generation of artificial samples and subsequent computation of embeddings is not part of this class.

Methods

anchors()

Return anchor points used in KRR.

fit(X, y)

Train a regression model (KRR) between X and all columns of Y.

load()

Load a KRRApprox instance.

save([folder])

Save the instance.

transform(X)

Apply the trained KRR models to a given data.

anchors()

Return anchor points used in KRR.

fit(X: Tensor, y: Tensor)

Train a regression model (KRR) between X and all columns of Y.

Parameters:
X: torch.Tensor

Tensor containing the artificial input (x_hat), with samples in the rows.

y: torch.Tensor

Tensor containing the artificial embedding (z_hat). Called y for compliance with sklearn functions.

Returns:
self: fitted KRRApprox instance.
load()

Load a KRRApprox instance.

Parameters:
folder: str, default to ‘.’

Folder path where the instance is located

Returns:
KRRApprox: instance saved at the folder location.
save(folder: str = '.')

Save the instance.

Parameters:
folder: str, default to ‘.’

Folder path to use for saving the instance

Returns:
True if the instance was properly saved.
transform(X: Tensor)

Apply the trained KRR models to a given data.

This corresponds to the out-of-sample extension.

Parameters:
X: torch.Tensor

Tensor containing gene expression profiles with samples in the rows. <b>WARNING:</b> genes (features) need to be following the same order as the training data.

Returns:
torch.Tensor with predicted values for each of the encoding functions.
Samples are in the rows and encoding functions (embedding) in the columns.
default_kernel_params = {'falkon': {'gaussian': {'sigma': 1}, 'laplacian': {'sigma': 1}, 'matern': {'nu': 0.5, 'sigma': 1}, 'rbf': {'sigma': 1}}, 'sklearn': {'gaussian': {}, 'laplacian': {}, 'matern': {}, 'rbf': {}}}
falkon_kernel = {'gaussian': None, 'laplacian': None, 'matern': None, 'rbf': None}
sklearn_kernel = {'gaussian': 'wrapper', 'laplacian': 'wrapper', 'matern': <class 'sklearn.gaussian_process.kernels.Matern'>, 'rbf': 'wrapper'}

sobolev_alignment.krr_model_selection module

Kernel Ridge Regression (KRR) model search.

@author: Soufiane Mourragui

Pipeline to perform model selection for the Kernel Ridge Regression (KRR) models, employing the protocol presented in the paper, i.e.,: - Selecting sigma as the value yielding an average of 0.5 for the Gaussian kernel. - Selecting model with lowest training error on input data (trained on artificial data).

sobolev_alignment.krr_model_selection.model_alignment_penalization(X_data: AnnData, data_source: str, sobolev_alignment_clf, sigma: float, optimal_nu: float, M: int = 250)

$\sigma$ and $nu$ selection.

Select the optimal penalization parameter given $\sigma$ and $nu$ by aligning the data_source model to itself and measuring the principal angles. Intuitively, aligning the model to itself must yield high principal angles. Low values indicate over-fitting of the KRR.

Parameters:
X_data: AnnData

Dataset to employ.

data_source: str, ‘source’ or ‘target’

Name of the data stream in SobolevAlignment parameters.

sobolev_alignment_clf: SobolevAlignment

SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.

sigma: float

$\sigma$ parameter in KRR.

optimal_nu: float

Value of $nu$ (Falkon) to be used in the optimization. Can be established using model_selection_nu

M: int, default to 250

Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.

Returns:
DataFrame with principal angles between the same models.
sobolev_alignment.krr_model_selection.model_selection_nu(X_source: AnnData, X_target: AnnData, sobolev_alignment_clf, sigma: float, M: int = 250, test_error_size: int = -1)

Select the optimal $nu$ parameter.

Select the optimal $nu$ parameter (Matérn kernel) by measuring the Spearman correlation for different values of $nu$ and penalization, and selecting the $nu$ with the highest correlation.

Parameters:
X_source: AnnData

Source dataset.

X_target: AnnData

Target dataset.

sobolev_alignment_clf: SobolevAlignment

SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.

sigma: float

$sigma$ parameter in KRR.

M: int, default to 250

Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.

test_error_size: float, default to -1

Number of input points to be considered when computing the error. Input (X_source and X_target) are not used to train the KRR (artificial points are) and are acting as proxy for validation set. Setting test_error_size=-1 would lead to using the complete input data

Returns:
DataFrame with spearman correlation on source and target data for various
hyper-parameter values.

sobolev_alignment.multi_krr_approx module

Multi KRR approximation.

@author: Soufiane Mourragui

Scripts supporting the naive integration of several KRR. No gain is provided by such approach.

class sobolev_alignment.multi_krr_approx.MultiKRRApprox

Bases: object

Multi Kernel Ridge Regression approximation.

This class contains a wrapper around KRRApprox to serialise the approximation of latent factors. Several experiments show that such approach does not yield any advantage.

Methods

add_clf(clf)

Add a classifier.

anchors()

Return anchors.

predict(X)

Predict latent factor values given a tensor.

process_clfs()

Process the different classifiers.

transform(X)

Predict latent factor values given a tensor.

add_clf(clf)

Add a classifier.

anchors()

Return anchors.

predict(X: Tensor)

Predict latent factor values given a tensor.

process_clfs()

Process the different classifiers.

transform(X: Tensor)

Predict latent factor values given a tensor.

Module contents