emergene package#
- emergene.EmerGene(adata: AnnData, use_rep: str = 'X_pca', use_rep_acrossDataset: str = 'X_pca', layer: str | None = None, n_nearest_neighbors: int = 10, condition_key: str = 'Sample', random_seed: int = 27, n_repeats: int = 3, mu: float = 1.0, beta: float = 1.0, sigma: float = 100.0, n_cells_expressed_threshold: int = 50, n_top_EG_genes: int = 500, remove_lowly_expressed=True, expressed_pct: float = 0.1, inplace: bool = False, gene_list_as_string: bool = False, verbose: int = 1)#
Compute EmerGene scores and local fold-change matrices for genes across different conditions.
- Parameters:
adata (AnnData) – AnnData object for preprocessed data.
use_rep (str, optional) – Key in adata.obsm for the low-dimensional embedding used for condition-specific diffusion (default: ‘X_pca’).
use_rep_acrossDataset (str, optional) – Key in adata.obsm for computing the across-dataset connectivity matrix (default: ‘X_pca’).
layer (Optional[str], optional) – Key in adata.layers representing the gene expression matrix. If None, the function uses the default expression matrix stored in adata.X. (default: None)
n_nearest_neighbors (int, optional) – Number of nearest neighbors used when constructing adjacency matrices (default: 10).
condition_key (str, optional) – Key in adata.obs that specifies the condition (or batch) label for each cell (default: ‘Sample’).
random_seed (int, optional) – Seed for the random number generator to ensure reproducibility (default: 27).
n_repeats (int, optional) – Number of randomizations to perform for background generation (default: 3).
mu (float, optional) – Weight for subtracting the random background specificity in the final EmerGene score (default: 1.0).
beta (float, optional) – Weight for subtracting the condition-wise background specificity in the final EmerGene score (default: 1.0).
sigma (float, optional) – Parameter for scaling in the adjacency matrix construction (default: 100.0).
n_cells_expressed_threshold (int, optional) – Threshold for the number of cells expressing a gene (default: 50).
n_top_EG_genes (int, optional) – Number of top EmerGene genes to select for output (default: 500).
remove_lowly_expressed (bool, optional) – Flag indicating whether to remove lowly expressed genes (currently not implemented) (default: True).
expressed_pct (float, optional) – Minimum percentage of cells in which a gene must be expressed (currently not implemented) (default: 0.1).
inplace (bool, optional) – If True, saves EmerGene scores into adata.var. If False, returns a pandas DataFrame with the scores (default: False).
gene_list_as_string (bool, optional) – If True, save the genes and scores as a string. If False, save as a pandas DataFrame with two columns for genes and scores separately.
verbose (int, optional) – Verbosity level; if > 0, progress messages will be printed (default: 1).
- Returns:
If inplace is False, returns a tuple containing –
A Dictionary of the DataFrames of top gene sets, with keys are the conditions.
A DataFrame where each column is named EmerGene_{condition} with the corresponding EmerGene scores for all genes.
If inplace is True, returns the top gene set Dictionary and modifies adata in-place.
- emergene.computeScore(adata, geneset_dict, layer=None, n_ctrl: int = 1000, ctrl_match_key='mean_var', n_genebin: int = 200, n_mean_bin: int = 20, n_var_bin: int = 20, weight_opt: str = 'vs', return_ctrl_raw_score: bool = False, return_ctrl_norm_score: bool = False, random_seed: int = 27, verbose: int = 0)#
- emergene.convertTopGeneDictToDF(data_dict, gene_list_as_string: bool = True)#
Converts the dictionary containing the top genes and their scores reported by EmerGene function into a wide-format DataFrame where each condition has two columns: “{condition}_Gene” and “{condition}_EG_score”.
- Parameters:
data_dict (dict) – Dictionary where keys are conditions. - If gene_list_as_string=True: values are “gene:score” formatted strings. - If gene_list_as_string=False: values are DataFrames with ‘Gene’ and ‘EG_score’ columns.
gene_list_as_string (bool, optional (default=True)) –
If True, assumes values in data_dict are strings formatted as “gene:score,gene2:score2,…”.
If False, assumes values in data_dict are DataFrames with ‘Gene’ and ‘EG_score’ columns.
- Returns:
A wide-format DataFrame where each condition has two columns: “{condition}_Gene” and “{condition}_EG_score”.
- Return type:
pd.DataFrame
- emergene.runMarkG(adata, use_rep: str = 'X_pca', layer: str = 'log1p', n_nearest_neighbors: int = 10, random_seed: int = 27, n_repeats: int = 3, mu: float = 1, sigma: float = 100, remove_lowly_expressed=True, expressed_pct=0.1)#
- emergene.score(adata, gene_list, gene_weights=None, n_nearest_neighbors: int = 30, leaf_size: int = 40, layer: str = 'infog', random_seed: int = 1927, n_ctrl_set: int = 100, key_added: str | None = None, verbosity: int = 0)#
For a given gene set, compute gene expression enrichment scores and P values for all the cells.
- Parameters:
adata (AnnData) – The AnnData object for the gene expression matrix.
gene_list (list of str) – A list of gene names for which the score will be computed.
gene_weights (list of floats, optional) – A list of weights corresponding to the genes in gene_list. The length of gene_weights must match the length of gene_list. If None, all genes in gene_list are weighted equally. Default is None.
n_nearest_neighbors (int, optional) – Number of nearest neighbors to consider for randomly selecting control gene sets based on the similarity of genes’ mean and variance among all cells. Default is 30.
leaf_size (int, optional) – Leaf size for the KD-tree or Ball-tree used in nearest neighbor calculations. Default is 40.
layer (str, optional) – The name of the layer in adata.layers to use for gene expression values. Default is ‘infog’.
random_seed (int, optional) – Random seed for reproducibility. Default is 1927.
n_ctrl_set (int, optional) – Number of control gene sets to be used for calculating P values. Default is 100.
key_added (str, optional) – If provided, the computed scores will be stored in adata.obs[key_added]. The scores and P values will be stored in adata.uns[key_added] as well. Default is None, and the INFOG_score will be used as the key.
verbosity (int, optional (default: 0)) – Level of verbosity for logging information.
- Returns:
Modifies the adata object in-place, see key_added.
- Return type:
None
- emergene.identifyGeneModule(adata, gene_list, use_rep: str = 'X_pca', resolution: float = 0.5, n_components: int = 30, verbosity: int = 0)#
- emergene.infog(adata, copy: bool = False, layer='raw', n_top_genes: int = 1000, key_added: str = 'infog', random_state: int = 10, trim: bool = True, verbosity: int = 1)#