sobolev_alignment package

Submodules

sobolev_alignment.data_normalisation module

sobolev_alignment.feature_analysis module

Feature analysis.

@author: Soufiane Mourragui

This modules contains all the codes used in the Taylor expansion for the Gaussian/Matern kernel.

sobolev_alignment.feature_analysis.basis(x, k, gamma)

Compute the basis function for a single gene, except offset term.

Parameters:
x: np.array

Column vector (each row corresponds to a sample).

k: int

Order to compute.

gamma: float

Parameter of Matérn kernel.

Returns:
np.array

Value of the higher order feature.

sobolev_alignment.feature_analysis.combinatorial_product(x, idx, gamma)

Compute the basis function for a single gene, except offset term.

Parameters:
x: np.array

Data matrix with samples in the rows and genes in the columns

idx: tuple

Combinations, i.e. tuple of features to take into account.

gamma: float

Parameter of Matérn kernel.

Returns:
scipy.sparse.csc_matrix

Values of the higher order feature.

sobolev_alignment.feature_analysis.higher_order_contribution(d: int, data: array, sample_offset: array, gene_names: list, gamma: float, n_jobs: int = 1, return_matrix: bool = False)

Compute the features corresponding to the Taylor expansion of the kernel.

Compute the features corresponding to the Taylor expansion of the kernel, i.e. $x_j exp^{-gamma xx^T}$ for linear features. Returns a sparse pandas DataFrame containing all the features (columns) by samples (rows). We here critically rely on the sparsity of the data-matrix to speed up computations. The current implementation is relevant in two cases: -When dimensionality is small -When data is sparse.

High-dimensional and dense data matrices would lead to a significant over-head without computational gains, and could benefit from another implementation strategy.

Parameters:
d: int

Order of the features to compute, e.g. 1 for linear, 2 for interaction terms.

data: np.array

Data to compute features on, samples in the rows and genes (features) in the columns.

sample_offset: np.array

Offset of each sample from data.

gene_names: list

Names of each columns in data ; corresponds to features naming.

gamma: float

Value of the gamma parameter for Matérn kernel.

n_jobs: int, default to 1

Number of concurrent threads to use. -1 will use all CPU cores possible. WARNING: for d >= 2 and a large number of genes, the routine can be memory-intensive and a high n_jobs could lead to crash.

return_matrix: bool, default to False

If True, then returns simply the feature-matrix without feature-naming. In cases when feature names are not relevant (e.g. computing the proportion of non-linearities), return_matrix=True can help speed-up the process.

Returns:
pd.DataFrame

Sparse dataframe with samples in the rows and named features in the columns. For instance, when d=1, returns each column of data scaled by RKHS normalisation factor and multiplied by offset value.

sobolev_alignment.generate_artificial_sample module

Generate artificial samples.

@author: Soufiane Mourragui

Generate samples using scVI decoder from a multivariate gaussian noise. This module generates the training data used to approximate the VAE encoding functions by Matérn kernel machines.

sobolev_alignment.generate_artificial_sample.generate_samples(sample_size: int, batch_names: list, covariates_values: list, lib_size: dict, model: SCVI, batch_key_dict: dict, return_dist: bool = False)

Generate artificial gene expression profiles.

Note to developers: this method has been designed to be used with scvi-tools classes. Other VAE implementations may break here.

Parameters:
sample_size: int

Number of samples to generate.

batch_names: list or np.ndarray, default to None

List or array with sample_size str values indicating the batch of each sample.

covariate_values: list or np.ndarray, default to None

List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).

lib_size

Dictionary of mean library size per batch.

model

scVI model which decoder is here exploited to generate samples.

batch_key_dict

Dictionary linking the values of the batch (scVI) and the key used in scVI.

return_dist: bool, default to False

Whether to return the distribution parameters (True) or samples from this distribution (False).

Returns:
If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
If return_dist if True, torch.Tensor with distribution parameters (following scVI
order) and one torch.Tensor with artificial samples in the rows (CPU).
sobolev_alignment.generate_artificial_sample.parallel_generate_samples(sample_size, batch_names, covariates_values, lib_size, model, batch_key_dict: Optional[dict] = None, return_dist: bool = False, batch_size=1000, n_jobs=1)

Generate artificial gene expression profiles.

Wrapper of parallelize generate_samples, running several threads in parallel. <b>Note to developers</b>: this function needs to be changed if applied to other VAE model than scVI.

Parameters:
sample_size: int

Number of samples to generate.

batch_names: list or np.ndarray, default to None

List or array with sample_size str values indicating the batch of each sample.

covariate_values: list or np.ndarray, default to None

List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).

lib_size

Dictionary of mean library size per batch.

model

scVI model which decoder is here exploited to generate samples.

batch_key_dict

Dictionary linking the values of the batch (scVI) and the key used in scVI.

return_dist: bool, default to False

Whether to return the distribution parameters (True) or samples from this distribution (False).

batch_size: int, default to 10**3

Number of sample to generate per batch.

n_jobs: int, default to 1

Number of threads to launch. n_jobs=-1 will launch as many threads as there are CPUs available.

Returns:
If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
If return_dist if True, torch.Tensor with distribution parameters (following scVI
order) and one torch.Tensor with artificial samples in the rows (CPU).

sobolev_alignment.interpolated_features module

Compute interpolated features.

@author: Soufiane Mourragui

sobolev_alignment.interpolated_features.compute_optimal_tau(PV_number, pv_projections, principal_angles, n_interpolation=100)

Compute the optimal interpolation step for each PV (Grassmann interpolation).

sobolev_alignment.interpolated_features.project_on_interpolate_PV(angle, PV_number, tau_step, pv_projections)

Project data on interpolated PVs.

sobolev_alignment.kernel_operations module

Kernel operations.

@author: Soufiane Mourragui

Custom scripts for specific matrix operations.

sobolev_alignment.kernel_operations.mat_inv_sqrt(M, threshold=1e-06)

Compute the inverse square root of a symmetric matrix M by SVD.

sobolev_alignment.krr_approx module

Encoder approximation by Kernel Ridge Regression.

@author: Soufiane Mourragui

This modules train a Kernel Ridge Regression (KRR) on a pair of samples (x_hat) and embeddings (z_hat) using two possible implementations: - scikit-learn: deterministic, but limited in memory and time efficiency. - Falkon: stochastic Nyström approximation, faster both in memory and computation time. Optimised for multi GPUs.

References

Mourragui et al 2022 Meanti et al, Kernel methods through the roof: handling billions of points efficiently, NeurIPS, 2020. Pedregosa et al, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 2011.

class sobolev_alignment.krr_approx.KRRApprox(method: str = 'sklearn', kernel: str = 'rbf', M: int = 100, kernel_params: Optional[dict] = None, penalization: float = 1e-06, maxiter: int = 20, falkon_options: Optional[dict] = None, mean_center: bool = False, unit_std: bool = False)

Bases: object

Kernel Ridge Regression approximation.

This class contains the functions used to approximate the encoding functions of a Variational Auto Encoder (VAE) by a kernel machines by means of Kernel Ridge Regression (KRR). <br/> This class takes as input a training data and executes the learning process. The generation of artificial samples and subsequent computation of embeddings is not part of this class.

Methods

anchors()

Return anchor points used in KRR.

fit(X, y)

Train a regression model (KRR) between X and all columns of Y.

load()

Load a KRRApprox instance.

save([folder])

Save the instance.

transform(X)

Apply the trained KRR models to a given data.

anchors()

Return anchor points used in KRR.

fit(X: Tensor, y: Tensor)

Train a regression model (KRR) between X and all columns of Y.

Parameters:
X: torch.Tensor

Tensor containing the artificial input (x_hat), with samples in the rows.

y: torch.Tensor

Tensor containing the artificial embedding (z_hat). Called y for compliance with sklearn functions.

Returns:
self: fitted KRRApprox instance.
load()

Load a KRRApprox instance.

Parameters:
folder: str, default to ‘.’

Folder path where the instance is located

Returns:
KRRApprox: instance saved at the folder location.
save(folder: str = '.')

Save the instance.

Parameters:
folder: str, default to ‘.’

Folder path to use for saving the instance

Returns:
True if the instance was properly saved.
transform(X: Tensor)

Apply the trained KRR models to a given data.

This corresponds to the out-of-sample extension.

Parameters:
X: torch.Tensor

Tensor containing gene expression profiles with samples in the rows. <b>WARNING:</b> genes (features) need to be following the same order as the training data.

Returns:
torch.Tensor with predicted values for each of the encoding functions.
Samples are in the rows and encoding functions (embedding) in the columns.
default_kernel_params = {'falkon': {'gaussian': {'sigma': 1}, 'laplacian': {'sigma': 1}, 'matern': {'nu': 0.5, 'sigma': 1}, 'rbf': {'sigma': 1}}, 'sklearn': {'gaussian': {}, 'laplacian': {}, 'matern': {}, 'rbf': {}}}
falkon_kernel = {'gaussian': None, 'laplacian': None, 'matern': None, 'rbf': None}
sklearn_kernel = {'gaussian': 'wrapper', 'laplacian': 'wrapper', 'matern': <class 'sklearn.gaussian_process.kernels.Matern'>, 'rbf': 'wrapper'}

sobolev_alignment.krr_model_selection module

Kernel Ridge Regression (KRR) model search.

@author: Soufiane Mourragui

Pipeline to perform model selection for the Kernel Ridge Regression (KRR) models, employing the protocol presented in the paper, i.e.,: - Selecting sigma as the value yielding an average of 0.5 for the Gaussian kernel. - Selecting model with lowest training error on input data (trained on artificial data).

sobolev_alignment.krr_model_selection.model_alignment_penalization(X_data: AnnData, data_source: str, sobolev_alignment_clf, sigma: float, optimal_nu: float, M: int = 250)

$\sigma$ and $nu$ selection.

Select the optimal penalization parameter given $\sigma$ and $nu$ by aligning the data_source model to itself and measuring the principal angles. Intuitively, aligning the model to itself must yield high principal angles. Low values indicate over-fitting of the KRR.

Parameters:
X_data: AnnData

Dataset to employ.

data_source: str, ‘source’ or ‘target’

Name of the data stream in SobolevAlignment parameters.

sobolev_alignment_clf: SobolevAlignment

SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.

sigma: float

$\sigma$ parameter in KRR.

optimal_nu: float

Value of $nu$ (Falkon) to be used in the optimization. Can be established using model_selection_nu

M: int, default to 250

Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.

Returns:
DataFrame with principal angles between the same models.
sobolev_alignment.krr_model_selection.model_selection_nu(X_source: AnnData, X_target: AnnData, sobolev_alignment_clf, sigma: float, M: int = 250, test_error_size: int = -1)

Select the optimal $nu$ parameter.

Select the optimal $nu$ parameter (Matérn kernel) by measuring the Spearman correlation for different values of $nu$ and penalization, and selecting the $nu$ with the highest correlation.

Parameters:
X_source: AnnData

Source dataset.

X_target: AnnData

Target dataset.

sobolev_alignment_clf: SobolevAlignment

SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.

sigma: float

$sigma$ parameter in KRR.

M: int, default to 250

Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.

test_error_size: float, default to -1

Number of input points to be considered when computing the error. Input (X_source and X_target) are not used to train the KRR (artificial points are) and are acting as proxy for validation set. Setting test_error_size=-1 would lead to using the complete input data

Returns:
DataFrame with spearman correlation on source and target data for various
hyper-parameter values.

sobolev_alignment.multi_krr_approx module

Multi KRR approximation.

@author: Soufiane Mourragui

Scripts supporting the naive integration of several KRR. No gain is provided by such approach.

class sobolev_alignment.multi_krr_approx.MultiKRRApprox

Bases: object

Multi Kernel Ridge Regression approximation.

This class contains a wrapper around KRRApprox to serialise the approximation of latent factors. Several experiments show that such approach does not yield any advantage.

Methods

add_clf(clf)

Add a classifier.

anchors()

Return anchors.

predict(X)

Predict latent factor values given a tensor.

process_clfs()

Process the different classifiers.

transform(X)

Predict latent factor values given a tensor.

add_clf(clf)

Add a classifier.

anchors()

Return anchors.

predict(X: Tensor)

Predict latent factor values given a tensor.

process_clfs()

Process the different classifiers.

transform(X: Tensor)

Predict latent factor values given a tensor.

sobolev_alignment.sobolev_alignment module

Sobolev Alignment.

@author: Soufiane Mourragui

References

Mourragui et al, Identifying commonalities between cell lines and tumors at the single cell level using Sobolev Alignment of deep generative models, Biorxiv, 2022. Lopez et al, Deep generative modeling for single-cell transcriptomics, Nature Methods, 2018. Meanti et al, Kernel methods through the roof: handling billions of points efficiently, NeurIPS, 2020.

class sobolev_alignment.sobolev_alignment.SobolevAlignment(source_batch_name: Optional[str] = None, target_batch_name: Optional[str] = None, continuous_covariate_names: Optional[list] = None, source_scvi_params: Optional[dict] = None, target_scvi_params: Optional[dict] = None, source_krr_params: Optional[dict] = None, target_krr_params: Optional[dict] = None, n_artificial_samples: int = 100000, n_samples_per_sample_batch: int = 100000, frac_save_artificial: float = 0.1, save_mmap: Optional[str] = None, log_input: bool = True, n_krr_clfs: int = 1, no_posterior_collapse=True, mean_center: bool = False, unit_std: bool = False, frob_norm_source: bool = False, lib_size_norm: bool = False, n_jobs=1)

Bases: object

Sobolev Alignment implementation.

Main class for Sobolev Alignment, which wraps all the different operations presented in Sobolev Alignment procedure: - Model selection (scVI and KRR) - scVI models training. - Synthetic models generations. - KRR approximation. - Alignment of KRR models.

Methods

compute_consensus_features(X_input, n_similar_pv)

Project data on interpolated consensus features.

compute_error([size])

Compute error of the KRR approximation on the input (data used for VAE training) and used for KRR.

compute_random_direction_(K_X, K_Y, K_XY)

Sample randomly two vectors and compute cosine similarity.

feature_analysis([max_order, gene_names])

Launch feature analysis for a trained scVI model.

fit(X_source, X_target[, fit_vae, ...])

Run complete Sobolev Alignment workflow between a source (e.g.

krr_model_selection(X_source, X_target[, M, ...])

Hyper-parameters selection for KRR.

load([with_krr, with_model])

Load a Sobolev Alignment instance.

null_model_similarity([n_iter, quantile, ...])

Compute the null model for PV similarities.

plot_cosine_similarity([folder, absolute_cos])

Plot cosine similarity.

plot_training_metrics([folder])

Plot the different training metric for the source and target scVI modules.

sample_random_vector_(data_source, K)

Sample a vector randomly for either source or target.

save([folder, with_krr, with_model])

Save Sobolev Alignment model.

scvi_model_selection(X_source, X_target[, ...])

Hyperparameter selection for scVI models.

compute_consensus_features(X_input: dict, n_similar_pv: int, fit: bool = True, return_anndata=False)

Project data on interpolated consensus features.

Project the data on interpolated features, i.e., a linear combination of source and target SPVs which best balances the effect of source and target data.

Parameters:
X_input: dict

Dictionary of data (AnnData) to project. Two keys are needed: ‘source’ and ‘target’.

n_similar_pv: int

Number of top SPVs to project the data on.

fit: bool, default to True

Whether the interpolated times must be computed. If False, will use previously computed times, but will return an error if not previously fitted.

return_anndata: bool, default to False

Whether the projected consensus features must be formatted as an AnnData with overlapping indices in obs. This allows downstream analysis. By default, return a DataFrame.

Returns:
interpolated_proj_df: pd.DataFrame or sc.AnnData

DataFrame or AnnData of concatenated source and target samples after projection on consensus features.

compute_error(size=-1)

Compute error of the KRR approximation on the input (data used for VAE training) and used for KRR.

compute_random_direction_(K_X, K_Y, K_XY)

Sample randomly two vectors and compute cosine similarity.

feature_analysis(max_order: int = 1, gene_names: Optional[list] = None)

Launch feature analysis for a trained scVI model.

Computes the gene contributions (feature weights) associated with the KRRs which approximate the latent factors and the SPVs. Technically, given the kernel machine which approximates a latent factor (KRR), this method computes the weights associated with the orthonormal basis in the Gaussian-kernel associated Sobolev space.

Parameters:
max_order: int, default to 1

Order of the features to compute. 1 corresponds to linear features (genes), two to interaction terms.

gene_names: list of str, default to None

Names of the genes passed as input to Sobolev Alignment. <b>WARNING</b> Must be in the same order as the input to SobolevAlignment.fit

fit(X_source: AnnData, X_target: AnnData, fit_vae: bool = True, krr_approx: bool = True, sample_artificial: bool = True)

Run complete Sobolev Alignment workflow between a source (e.g. cell line) and a target (e.g. tumor) dataset.

Source and target data should be passed as AnnData and potential batch names (source_batch_name, target_batch_name) should be part of the “obs” element of X_source and X_target.

Parameters:
X_source: AnnData

Source data.

X_target: AnnData

Target data.

fit_vae: bool, default to True

Whether a scVI model (VAE) should be trained. If pre-trained VAEs are available, setting the scvi_models to these models and using fit_vae=False would allow to directly use these models.

krr_approx: bool, default to True

Whether the KRR approximation should be performed for source and target scVI models.

sample_artificial: bool, default to True

Whether model points should be sampled. In the case when artificial samples have already been sampled and saved, setting sample_artificial=False allows to use these points without need for re-sampling.

Returns:
self: fitted Sobolev Alignment instance.
krr_model_selection(X_source: AnnData, X_target: AnnData, M: int = 1000, same_model_alignment_thresh: float = 0.9)

Hyper-parameters selection for KRR.

Routine to perform Bayesian hyper-parameter optimisation for scVI model (source and target). Can be called prior to fit. Best parameters will be saved in self.scvi_params

Parameters:
X_source: AnnData

Source dataset.

X_target: AnnData

Target dataset.

M: int, default to 1000

Number of anchor points to use. Larger values of M leads to a better approximation of the latent factors, but come at the price of a higher computational time and memory.

same_model_alignment_thresh: float, default to 0.9

Minimum top principal angles used during same-model alignment, i.e., when source or target models are aligned to themselves.

Returns:
SobolevAlignment instance.
load(with_krr: bool = True, with_model: bool = True)

Load a Sobolev Alignment instance.

Parameters:
folder: str, default to ‘.’

Folder path where the instance is located

with_krr: bool, default to True

Whether KRR approximations must be loaded.

with_model: bool, default to True

Whether scvi models (VAEs) must be loaded.

Returns:
SobolevAlignment: instance saved at the folder location.
null_model_similarity(n_iter=100, quantile=0.95, return_all=False, n_jobs=1)

Compute the null model for PV similarities.

plot_cosine_similarity(folder: str = '.', absolute_cos: bool = False)

Plot cosine similarity.

plot_training_metrics(folder: str = '.')

Plot the different training metric for the source and target scVI modules.

sample_random_vector_(data_source, K)

Sample a vector randomly for either source or target.

save(folder: str = '.', with_krr: bool = True, with_model: bool = True)

Save Sobolev Alignment model.

scvi_model_selection(X_source: ~anndata._core.anndata.AnnData, X_target: ~anndata._core.anndata.AnnData, source_batch_name: ~typing.Optional[str] = None, target_batch_name: ~typing.Optional[str] = None, model=<class 'scvi.model._scvi.SCVI'>, space: dict = {'dispersion': <hyperopt.pyll.base.Apply object>, 'dropout_rate': <hyperopt.pyll.base.Apply object>, 'early_stopping': <hyperopt.pyll.base.Apply object>, 'gene_likelihood': <hyperopt.pyll.base.Apply object>, 'lr': <hyperopt.pyll.base.Apply object>, 'n_hidden': <hyperopt.pyll.base.Apply object>, 'n_latent': <hyperopt.pyll.base.Apply object>, 'n_layers': <hyperopt.pyll.base.Apply object>, 'reduce_lr_on_plateau': <hyperopt.pyll.base.Apply object>, 'weight_decay': <hyperopt.pyll.base.Apply object>}, max_eval: int = 100, test_size: float = 0.1)

Hyperparameter selection for scVI models.

Routine to perform Bayesian hyper-parameter optimisation for scVI model (source and target). Can be called prior to fit. Best parameters will be saved in self.scvi_params

Parameters:
X_source: AnnData

Source dataset.

X_target: AnnData

Target dataset.

source_batch_name: str, default to None

Batch key to use in scVI for the source dataset. If None, no native batch-effect correction performed in source scVI.

target_batch_name: str, default to None

Batch key to use in scVI for the target dataset. If None, no native batch-effect correction performed in target scVI.

model: default to scvi.model.SCVI

scvi-tools model to be used in the analysis.

space: dict, default to DEFAULT_HYPEROPT_SPACE

Hyper-parameter space to be used in Bayesian Optimisation. Default is provided in sobolev_alignment.scvi_model_search.

max_eval: int, default to 100

Number of iterations in the Bayesian optimisation procedures, i.e., number of models assessed.

test_size: float, default to 0.1

Proportion of samples (cells) to be taken inside the test data.

Returns:
SobolevAlignment instance.
default_scvi_params = {'model': {}, 'plan': {}, 'train': {}}

Module contents