sobolev_alignment package¶
Submodules¶
sobolev_alignment.SobolevAlignment module¶
Sobolev Alignment implementation.
Main class for Sobolev Alignment, which wraps all the different operations presented in Sobolev Alignment procedure: - Model selection (scVI and KRR) - scVI models training. - Synthetic models generations. - KRR approximation. - Alignment of KRR models.
sobolev_alignment.feature_analysis module¶
Feature analysis.
@author: Soufiane Mourragui
This modules contains all the codes used in the Taylor expansion for the Gaussian/Matern kernel.
- sobolev_alignment.feature_analysis.basis(x, k, gamma)¶
-
Compute the basis function for a single gene, except offset term.
- Parameters:
-
- x: np.array
-
Column vector (each row corresponds to a sample).
- k: int
-
Order to compute.
- gamma: float
-
Parameter of Matérn kernel.
- Returns:
-
- np.array
-
Value of the higher order feature.
- sobolev_alignment.feature_analysis.combinatorial_product(x, idx, gamma)¶
-
Compute the basis function for a single gene, except offset term.
- Parameters:
-
- x: np.array
-
Data matrix with samples in the rows and genes in the columns
- idx: tuple
-
Combinations, i.e. tuple of features to take into account.
- gamma: float
-
Parameter of Matérn kernel.
- Returns:
-
- scipy.sparse.csc_matrix
-
Values of the higher order feature.
- sobolev_alignment.feature_analysis.higher_order_contribution(d: int, data: array, sample_offset: array, gene_names: list, gamma: float, n_jobs: int = 1, return_matrix: bool = False)¶
-
Compute the features corresponding to the Taylor expansion of the kernel.
Compute the features corresponding to the Taylor expansion of the kernel, i.e. $x_j exp^{-gamma xx^T}$ for linear features. Returns a sparse pandas DataFrame containing all the features (columns) by samples (rows). We here critically rely on the sparsity of the data-matrix to speed up computations. The current implementation is relevant in two cases: -When dimensionality is small -When data is sparse.
High-dimensional and dense data matrices would lead to a significant over-head without computational gains, and could benefit from another implementation strategy.
- Parameters:
-
- d: int
-
Order of the features to compute, e.g. 1 for linear, 2 for interaction terms.
- data: np.array
-
Data to compute features on, samples in the rows and genes (features) in the columns.
- sample_offset: np.array
-
Offset of each sample from data.
- gene_names: list
-
Names of each columns in data ; corresponds to features naming.
- gamma: float
-
Value of the gamma parameter for Matérn kernel.
- n_jobs: int, default to 1
-
Number of concurrent threads to use. -1 will use all CPU cores possible. WARNING: for d >= 2 and a large number of genes, the routine can be memory-intensive and a high n_jobs could lead to crash.
- return_matrix: bool, default to False
-
If True, then returns simply the feature-matrix without feature-naming. In cases when feature names are not relevant (e.g. computing the proportion of non-linearities), return_matrix=True can help speed-up the process.
- Returns:
-
- pd.DataFrame
-
Sparse dataframe with samples in the rows and named features in the columns. For instance, when d=1, returns each column of data scaled by RKHS normalisation factor and multiplied by offset value.
sobolev_alignment.generate_artificial_sample module¶
Generate artificial samples.
@author: Soufiane Mourragui
Generate samples using scVI decoder from a multivariate gaussian noise. This module generates the training data used to approximate the VAE encoding functions by Matérn kernel machines.
- sobolev_alignment.generate_artificial_sample.generate_samples(sample_size: int, batch_names: list, covariates_values: list, lib_size: dict, model: SCVI, batch_key_dict: dict, return_dist: bool = False)¶
-
Generate artificial gene expression profiles.
Note to developers: this method has been designed to be used with scvi-tools classes. Other VAE implementations may break here.
- Parameters:
-
- sample_size: int
-
Number of samples to generate.
- batch_names: list or np.ndarray, default to None
-
List or array with sample_size str values indicating the batch of each sample.
- covariate_values: list or np.ndarray, default to None
-
List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).
- lib_size
-
Dictionary of mean library size per batch.
- model
-
scVI model which decoder is here exploited to generate samples.
- batch_key_dict
-
Dictionary linking the values of the batch (scVI) and the key used in scVI.
- return_dist: bool, default to False
-
Whether to return the distribution parameters (True) or samples from this distribution (False).
- Returns:
-
- If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
- If return_dist if True, torch.Tensor with distribution parameters (following scVI
- order) and one torch.Tensor with artificial samples in the rows (CPU).
- sobolev_alignment.generate_artificial_sample.parallel_generate_samples(sample_size, batch_names, covariates_values, lib_size, model, batch_key_dict: Optional[dict] = None, return_dist: bool = False, batch_size=1000, n_jobs=1)¶
-
Generate artificial gene expression profiles.
Wrapper of parallelize generate_samples, running several threads in parallel. <b>Note to developers</b>: this function needs to be changed if applied to other VAE model than scVI.
- Parameters:
-
- sample_size: int
-
Number of samples to generate.
- batch_names: list or np.ndarray, default to None
-
List or array with sample_size str values indicating the batch of each sample.
- covariate_values: list or np.ndarray, default to None
-
List or array with sample_size float values indicating the covariate values of each sample to generate (as for training scVI model).
- lib_size
-
Dictionary of mean library size per batch.
- model
-
scVI model which decoder is here exploited to generate samples.
- batch_key_dict
-
Dictionary linking the values of the batch (scVI) and the key used in scVI.
- return_dist: bool, default to False
-
Whether to return the distribution parameters (True) or samples from this distribution (False).
- batch_size: int, default to 10**3
-
Number of sample to generate per batch.
- n_jobs: int, default to 1
-
Number of threads to launch. n_jobs=-1 will launch as many threads as there are CPUs available.
- Returns:
-
- If return_dist if False, torch.Tensor (on CPU) with artificial samples in the rows.
- If return_dist if True, torch.Tensor with distribution parameters (following scVI
- order) and one torch.Tensor with artificial samples in the rows (CPU).
sobolev_alignment.interpolated_features module¶
Compute interpolated features.
@author: Soufiane Mourragui
- sobolev_alignment.interpolated_features.compute_optimal_tau(PV_number, pv_projections, principal_angles, n_interpolation=100)¶
-
Compute the optimal interpolation step for each PV (Grassmann interpolation).
- sobolev_alignment.interpolated_features.project_on_interpolate_PV(angle, PV_number, tau_step, pv_projections)¶
-
Project data on interpolated PVs.
sobolev_alignment.kernel_operations module¶
Kernel operations.
@author: Soufiane Mourragui
Custom scripts for specific matrix operations.
- sobolev_alignment.kernel_operations.mat_inv_sqrt(M, threshold=1e-06)¶
-
Compute the inverse square root of a symmetric matrix M by SVD.
sobolev_alignment.krr_approx module¶
Encoder approximation by Kernel Ridge Regression.
@author: Soufiane Mourragui
This modules train a Kernel Ridge Regression (KRR) on a pair of samples (x_hat) and embeddings (z_hat) using two possible implementations: - scikit-learn: deterministic, but limited in memory and time efficiency. - Falkon: stochastic Nyström approximation, faster both in memory and computation time. Optimised for multi GPUs.
References
Mourragui et al 2022 Meanti et al, Kernel methods through the roof: handling billions of points efficiently, NeurIPS, 2020. Pedregosa et al, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 2011.
- class sobolev_alignment.krr_approx.KRRApprox(method: str = 'sklearn', kernel: str = 'rbf', M: int = 100, kernel_params: Optional[dict] = None, penalization: float = 1e-06, maxiter: int = 20, falkon_options: Optional[dict] = None, mean_center: bool = False, unit_std: bool = False)¶
-
Bases:
object
Kernel Ridge Regression approximation.
This class contains the functions used to approximate the encoding functions of a Variational Auto Encoder (VAE) by a kernel machines by means of Kernel Ridge Regression (KRR). <br/> This class takes as input a training data and executes the learning process. The generation of artificial samples and subsequent computation of embeddings is not part of this class.
Methods
anchors
()Return anchor points used in KRR.
fit
(X, y)Train a regression model (KRR) between X and all columns of Y.
load
()Load a KRRApprox instance.
save
([folder])Save the instance.
transform
(X)Apply the trained KRR models to a given data.
- anchors()¶
-
Return anchor points used in KRR.
- fit(X: Tensor, y: Tensor)¶
-
Train a regression model (KRR) between X and all columns of Y.
- Parameters:
-
- X: torch.Tensor
-
Tensor containing the artificial input (x_hat), with samples in the rows.
- y: torch.Tensor
-
Tensor containing the artificial embedding (z_hat). Called y for compliance with sklearn functions.
- Returns:
-
- self: fitted KRRApprox instance.
- load()¶
-
Load a KRRApprox instance.
- Parameters:
-
- folder: str, default to ‘.’
-
Folder path where the instance is located
- Returns:
-
- KRRApprox: instance saved at the folder location.
- save(folder: str = '.')¶
-
Save the instance.
- Parameters:
-
- folder: str, default to ‘.’
-
Folder path to use for saving the instance
- Returns:
-
- True if the instance was properly saved.
- transform(X: Tensor)¶
-
Apply the trained KRR models to a given data.
This corresponds to the out-of-sample extension.
- Parameters:
-
- X: torch.Tensor
-
Tensor containing gene expression profiles with samples in the rows. <b>WARNING:</b> genes (features) need to be following the same order as the training data.
- Returns:
-
- torch.Tensor with predicted values for each of the encoding functions.
- Samples are in the rows and encoding functions (embedding) in the columns.
- default_kernel_params = {'falkon': {'gaussian': {'sigma': 1}, 'laplacian': {'sigma': 1}, 'matern': {'nu': 0.5, 'sigma': 1}, 'rbf': {'sigma': 1}}, 'sklearn': {'gaussian': {}, 'laplacian': {}, 'matern': {}, 'rbf': {}}}¶
- falkon_kernel = {'gaussian': None, 'laplacian': None, 'matern': None, 'rbf': None}¶
- sklearn_kernel = {'gaussian': 'wrapper', 'laplacian': 'wrapper', 'matern': <class 'sklearn.gaussian_process.kernels.Matern'>, 'rbf': 'wrapper'}¶
sobolev_alignment.krr_model_selection module¶
Kernel Ridge Regression (KRR) model search.
@author: Soufiane Mourragui
Pipeline to perform model selection for the Kernel Ridge Regression (KRR) models, employing the protocol presented in the paper, i.e.,: - Selecting sigma as the value yielding an average of 0.5 for the Gaussian kernel. - Selecting model with lowest training error on input data (trained on artificial data).
- sobolev_alignment.krr_model_selection.model_alignment_penalization(X_data: AnnData, data_source: str, sobolev_alignment_clf, sigma: float, optimal_nu: float, M: int = 250)¶
-
$\sigma$ and $nu$ selection.
Select the optimal penalization parameter given $\sigma$ and $nu$ by aligning the data_source model to itself and measuring the principal angles. Intuitively, aligning the model to itself must yield high principal angles. Low values indicate over-fitting of the KRR.
- Parameters:
-
- X_data: AnnData
-
Dataset to employ.
- data_source: str, ‘source’ or ‘target’
-
Name of the data stream in SobolevAlignment parameters.
- sobolev_alignment_clf: SobolevAlignment
-
SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.
- sigma: float
-
$\sigma$ parameter in KRR.
- optimal_nu: float
-
Value of $nu$ (Falkon) to be used in the optimization. Can be established using model_selection_nu
- M: int, default to 250
-
Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.
- Returns:
-
- DataFrame with principal angles between the same models.
- sobolev_alignment.krr_model_selection.model_selection_nu(X_source: AnnData, X_target: AnnData, sobolev_alignment_clf, sigma: float, M: int = 250, test_error_size: int = -1)¶
-
Select the optimal $nu$ parameter.
Select the optimal $nu$ parameter (Matérn kernel) by measuring the Spearman correlation for different values of $nu$ and penalization, and selecting the $nu$ with the highest correlation.
- Parameters:
-
- X_source: AnnData
-
Source dataset.
- X_target: AnnData
-
Target dataset.
- sobolev_alignment_clf: SobolevAlignment
-
SobolevAlignment instance with scVI models trained. Used to find optimal $nu$ parameter on the KRR regression step.
- sigma: float
-
$sigma$ parameter in KRR.
- M: int, default to 250
-
Number of anchor points to use in the KRR approximation. A larger M typically improves the prediction, but at the cost of longer compute time and memory cost.
- test_error_size: float, default to -1
-
Number of input points to be considered when computing the error. Input (X_source and X_target) are not used to train the KRR (artificial points are) and are acting as proxy for validation set. Setting test_error_size=-1 would lead to using the complete input data
- Returns:
-
- DataFrame with spearman correlation on source and target data for various
- hyper-parameter values.
sobolev_alignment.multi_krr_approx module¶
Multi KRR approximation.
@author: Soufiane Mourragui
Scripts supporting the naive integration of several KRR. No gain is provided by such approach.
- class sobolev_alignment.multi_krr_approx.MultiKRRApprox¶
-
Bases:
object
Multi Kernel Ridge Regression approximation.
This class contains a wrapper around KRRApprox to serialise the approximation of latent factors. Several experiments show that such approach does not yield any advantage.
Methods
add_clf
(clf)Add a classifier.
anchors
()Return anchors.
predict
(X)Predict latent factor values given a tensor.
Process the different classifiers.
transform
(X)Predict latent factor values given a tensor.
- add_clf(clf)¶
Add a classifier.
- anchors()¶
Return anchors.
- predict(X: Tensor)¶
-
Predict latent factor values given a tensor.
- process_clfs()¶
-
Process the different classifiers.
- transform(X: Tensor)¶
-
Predict latent factor values given a tensor.
sobolev_alignment.scvi_model_search module¶
scVI model search.
@author: Soufiane Mourragui
Pipeline to perform model selection for the scVI model.
- sobolev_alignment.scvi_model_search.make_objective_function(train_data_an, test_data_an, batch_key=None, model=<class 'scvi.model._scvi.SCVI'>)¶
-
Generate Hyperopt objective function.
Generate the hyperopt objective function performing, for one set of hyperparameters, the training, the evaluation on test data and summing up all the results in a dictionary usable for Hyperopt.
- Parameters:
-
- train_data_an: AnnData
-
AnnData containing the train samples.
- test_data_an: AnnData
-
AnnData containing the test samples.
- batch_key: str, default to None
-
Name of the batch key to be used in scVI.
- model: default to scvi.model.SCVI
-
Model from scvi-tools to be used.
- Returns:
-
- function which can be called using a dictionary of parameters.
- sobolev_alignment.scvi_model_search.model_selection(data_an: ~anndata._core.anndata.AnnData, batch_key: ~typing.Optional[str] = None, model=<class 'scvi.model._scvi.SCVI'>, space={'dispersion': <hyperopt.pyll.base.Apply object>, 'dropout_rate': <hyperopt.pyll.base.Apply object>, 'early_stopping': <hyperopt.pyll.base.Apply object>, 'gene_likelihood': <hyperopt.pyll.base.Apply object>, 'lr': <hyperopt.pyll.base.Apply object>, 'n_hidden': <hyperopt.pyll.base.Apply object>, 'n_latent': <hyperopt.pyll.base.Apply object>, 'n_layers': <hyperopt.pyll.base.Apply object>, 'reduce_lr_on_plateau': <hyperopt.pyll.base.Apply object>, 'weight_decay': <hyperopt.pyll.base.Apply object>}, max_eval=100, test_size=0.1, save=None)¶
-
Model selection for scVI instances (hyper-parameter search).
Perform model selection on an scVI model by dividing a dataset into training and testing, and subsequently performing Bayesian Optimisation on the test data.
- Parameters:
-
- data_an: AnnData
-
Datasets to be used in the model selection.
- batch_key: str, default to None
-
Name of the batch key to be used in scVI.
- model: default to scvi.model.SCVI
-
Model from scvi-tools to be used.
- space: dict, default to DEFAULT_HYPEROPT_SPACE
-
Dictionary with hyper-parameter space to be used in Bayesian optimisation.
- max_eval: int, default to 100
-
Number of iterations in the Bayesian optimisation procedures, i.e., number of models assessed.
- test_size: float, default to 0.1
-
Proportion of samples (cells) to be taken inside the test data.
- save: str, default to None
-
Path to save Bayesian optimisation results to. Must be a csv file. If set to None, then results are not saved.
- Returns:
-
- Tuple containing:
-
- Best model given by hyperopt.
-
- DataFrame with Bayesian optimisation results.
-
- Trials instance from hyperopt.
- sobolev_alignment.scvi_model_search.split_dataset(data_an, test_size=0.1)¶
-
Split between training and testing.