spacr package
Submodules
spacr.app_annotate module
spacr.app_classify module
spacr.app_make_masks module
spacr.app_mask module
spacr.app_measure module
spacr.app_sequencing module
spacr.app_umap module
spacr.core module
- spacr.core.analyze_data_reg(sequencing_loc, dv_loc, agg_type='mean', min_cell_count=50, min_reads=100, min_wells=2, max_wells=1000, remove_outlier_genes=False, refine_model=False, by_plate=False, threshold=0.5, fishers=False)[source]
- spacr.core.analyze_recruitment(settings={})[source]
Analyze recruitment data by grouping the DataFrame by well coordinates and plotting controls and recruitment data.
Parameters: settings (dict): settings.
Returns: None
- spacr.core.apply_model(src, model_path, image_size=224, batch_size=64, normalize=True, n_jobs=10)[source]
- spacr.core.find_optimal_threshold(y_true, y_pred_proba)[source]
Find the optimal threshold for binary classification based on the F1-score.
Args: y_true (array-like): True binary labels. y_pred_proba (array-like): Predicted probabilities for the positive class.
Returns: float: The optimal threshold.
- spacr.core.generate_image_umap(settings={})[source]
Generate UMAP or tSNE embedding and visualize the data with clustering.
Parameters: settings (dict): Dictionary containing the following keys: src (str): Source directory containing the data. row_limit (int): Limit the number of rows to process. tables (list): List of table names to read from the database. visualize (str): Visualization type. image_nr (int): Number of images to display. dot_size (int): Size of dots in the scatter plot. n_neighbors (int): Number of neighbors for UMAP. figuresize (int): Size of the figure. black_background (bool): Whether to use a black background. remove_image_canvas (bool): Whether to remove the image canvas. plot_outlines (bool): Whether to plot outlines. plot_points (bool): Whether to plot points. smooth_lines (bool): Whether to smooth lines. verbose (bool): Whether to print verbose output. embedding_by_controls (bool): Whether to use embedding from controls. col_to_compare (str): Column to compare for control-based embedding. pos (str): Positive control value. neg (str): Negative control value. clustering (str): Clustering method (‘DBSCAN’ or ‘KMeans’). exclude (list): List of columns to exclude from the analysis. plot_images (bool): Whether to plot images. reduction_method (str): Dimensionality reduction method (‘UMAP’ or ‘tSNE’). save_figure (bool): Whether to save the figure as a PDF.
Returns: pd.DataFrame: DataFrame with the original data and an additional column ‘cluster’ containing the cluster identity.
- spacr.core.generate_loaders(src, mode='train', image_size=224, batch_size=32, classes=['nc', 'pc'], n_jobs=None, validation_split=0.0, pin_memory=False, normalize=False, channels=[1, 2, 3], augment=False, preload_batches=3, verbose=False)[source]
Generate data loaders for training and validation/test datasets.
Parameters: - src (str): The source directory containing the data. - mode (str): The mode of operation. Options are ‘train’ or ‘test’. - image_size (int): The size of the input images. - batch_size (int): The batch size for the data loaders. - classes (list): The list of classes to consider. - n_jobs (int): The number of worker threads for data loading. - validation_split (float): The fraction of data to use for validation. - pin_memory (bool): Whether to pin memory for faster data transfer. - normalize (bool): Whether to normalize the input images. - verbose (bool): Whether to print additional information and show images. - channels (list): The list of channels to retain. Options are [1, 2, 3] for all channels, [1, 2] for blue and green, etc.
Returns: - train_loaders (list): List of data loaders for training datasets. - val_loaders (list): List of data loaders for validation datasets.
- spacr.core.generate_masks_from_imgs(src, model, model_name, batch_size, diameter, cellprob_threshold, flow_threshold, grayscale, save, normalize, channels, percentiles, circular, invert, plot, resize, target_height, target_width, remove_background, background, Signal_to_noise, verbose)[source]
- spacr.core.generate_training_data_file_list(src, target='protein of interest', cell_dim=4, nucleus_dim=5, pathogen_dim=6, channel_of_interest=1, pathogen_size_min=0, nucleus_size_min=0, cell_size_min=0, pathogen_min=0, nucleus_min=0, cell_min=0, target_min=0, mask_chans=[0, 1, 2], filter_data=False, include_noninfected=False, include_multiinfected=False, include_multinucleated=False, cells_per_well=10, save_filtered_filelist=False)[source]
- spacr.core.jitterplot_by_annotation(src, x_column, y_column, plot_title='Jitter Plot', output_path=None, filter_column=None, filter_values=None)[source]
Reads a CSV file and creates a jitter plot of one column grouped by another column.
Args: src (str): Path to the source data. x_column (str): Name of the column to be used for the x-axis. y_column (str): Name of the column to be used for the y-axis. plot_title (str): Title of the plot. Default is ‘Jitter Plot’. output_path (str): Path to save the plot image. If None, the plot will be displayed. Default is None.
Returns: pd.DataFrame: The filtered and balanced DataFrame.
- spacr.core.join_measurments_and_annotation(src, tables=['cell', 'nucleus', 'pathogen', 'cytoplasm'])[source]
- spacr.core.merge_pred_mes(src, pred_loc, target='protein of interest', cell_dim=4, nucleus_dim=5, pathogen_dim=6, channel_of_interest=1, pathogen_size_min=0, nucleus_size_min=0, cell_size_min=0, pathogen_min=0, nucleus_min=0, cell_min=0, target_min=0, mask_chans=[0, 1, 2], filter_data=False, include_noninfected=False, include_multiinfected=False, include_multinucleated=False, cells_per_well=10, save_filtered_filelist=False, verbose=False)[source]
- spacr.core.ml_analysis(df, channel_of_interest=3, location_column='col', positive_control='c2', negative_control='c1', exclude=None, n_repeats=10, top_features=30, n_estimators=100, test_size=0.2, model_type='xgboost', n_jobs=-1, remove_low_variance_features=True, remove_highly_correlated_features=True, verbose=False)[source]
Calculates permutation importance for numerical features in the dataframe, comparing groups based on specified column values and uses the model to predict the class for all other rows in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. feature_string (str): String to filter features that contain this substring. location_column (str): Column name to use for comparing groups. positive_control, negative_control (str): Values in location_column to create subsets for comparison. exclude (list or str, optional): Columns to exclude from features. n_repeats (int): Number of repeats for permutation importance. top_features (int): Number of top features to plot based on permutation importance. n_estimators (int): Number of trees in the random forest, gradient boosting, or XGBoost model. test_size (float): Proportion of the dataset to include in the test split. random_state (int): Random seed for reproducibility. model_type (str): Type of model to use (‘random_forest’, ‘logistic_regression’, ‘gradient_boosting’, ‘xgboost’). n_jobs (int): Number of jobs to run in parallel for applicable models.
Returns: pandas.DataFrame: The original dataframe with added prediction and data usage columns. pandas.DataFrame: DataFrame containing the importances and standard deviations.
- spacr.core.process_reads(df, min_reads, min_wells, max_wells, gene_column, remove_outliers=False)[source]
- spacr.core.reducer_hyperparameter_search(settings={}, reduction_params=None, dbscan_params=None, kmeans_params=None, save=False)[source]
Perform a hyperparameter search for UMAP or tSNE on the given data.
Parameters: settings (dict): Dictionary containing the following keys: src (str): Source directory containing the data. row_limit (int): Limit the number of rows to process. tables (list): List of table names to read from the database. filter_by (str): Column to filter the data. sample_size (int): Number of samples to use for the hyperparameter search. remove_highly_correlated (bool): Whether to remove highly correlated columns. log_data (bool): Whether to log transform the data. verbose (bool): Whether to print verbose output. reduction_method (str): Dimensionality reduction method (‘UMAP’ or ‘tSNE’). reduction_params (list): List of dictionaries containing hyperparameters to test for the reduction method. dbscan_params (list): List of dictionaries containing DBSCAN hyperparameters to test. kmeans_params (list): List of dictionaries containing KMeans hyperparameters to test. pointsize (int): Size of the points in the scatter plot. save (bool): Whether to save the resulting plot as a file.
Returns: None
- spacr.core.regression_analasys(dv_df, sequencing_loc, min_reads=75, min_wells=2, max_wells=0, model_type='mlr', min_cells=100, transform='logit', min_frequency=0.05, gene_column='gene', effect_size_threshold=0.25, fishers=True, clean_regression=False, VIF_threshold=10)[source]
- spacr.core.shap_analysis(model, X_train, X_test)[source]
Performs SHAP analysis on the given model and data.
Args: model: The trained model. X_train (pandas.DataFrame): Training feature set. X_test (pandas.DataFrame): Testing feature set. Returns: fig: Matplotlib figure object containing the SHAP summary plot.
spacr.deep_spacr module
- spacr.deep_spacr.evaluate_model_performance(model, loader, epoch, loss_type)[source]
Evaluates the performance of a model on a given data loader.
- Parameters:
model (torch.nn.Module) – The model to evaluate.
loader (torch.utils.data.DataLoader) – The data loader to evaluate the model on.
loader_name (str) – The name of the data loader.
epoch (int) – The current epoch number.
loss_type (str) – The type of loss function to use.
- Returns:
The classification metrics data as a DataFrame. prediction_pos_probs (list): The positive class probabilities for each prediction. all_labels (list): The true labels for each prediction.
- Return type:
data_df (pandas.DataFrame)
- spacr.deep_spacr.test_model_performance(loaders, model, loader_name_list, epoch, loss_type)[source]
Test the performance of a model on given data loaders.
- Parameters:
loaders (list) – List of data loaders.
model – The model to be tested.
loader_name_list (list) – List of names for the data loaders.
epoch (int) – The current epoch.
loss_type – The type of loss function.
- Returns:
A tuple containing the test results and the results dataframe.
- Return type:
tuple
- spacr.deep_spacr.train_model(dst, model_type, train_loaders, epochs=100, learning_rate=0.0001, weight_decay=0.05, amsgrad=False, optimizer_type='adamw', use_checkpoint=False, dropout_rate=0, n_jobs=20, val_loaders=None, test_loaders=None, init_weights='imagenet', intermedeate_save=None, chan_dict=None, schedule=None, loss_type='binary_cross_entropy_with_logits', gradient_accumulation=False, gradient_accumulation_steps=4, channels=['r', 'g', 'b'], verbose=False)[source]
Trains a model using the specified parameters.
- Parameters:
dst (str) – The destination path to save the model and results.
model_type (str) – The type of model to train.
train_loaders (list) – A list of training data loaders.
epochs (int, optional) – The number of training epochs. Defaults to 100.
learning_rate (float, optional) – The learning rate for the optimizer. Defaults to 0.0001.
weight_decay (float, optional) – The weight decay for the optimizer. Defaults to 0.05.
amsgrad (bool, optional) – Whether to use AMSGrad for the optimizer. Defaults to False.
optimizer_type (str, optional) – The type of optimizer to use. Defaults to ‘adamw’.
use_checkpoint (bool, optional) – Whether to use checkpointing during training. Defaults to False.
dropout_rate (float, optional) – The dropout rate for the model. Defaults to 0.
n_jobs (int, optional) – The number of n_jobs for data loading. Defaults to 20.
val_loaders (list, optional) – A list of validation data loaders. Defaults to None.
test_loaders (list, optional) – A list of test data loaders. Defaults to None.
init_weights (str, optional) – The initialization weights for the model. Defaults to ‘imagenet’.
intermedeate_save (list, optional) – The intermediate save thresholds. Defaults to None.
chan_dict (dict, optional) – The channel dictionary. Defaults to None.
schedule (str, optional) – The learning rate schedule. Defaults to None.
loss_type (str, optional) – The loss function type. Defaults to ‘binary_cross_entropy_with_logits’.
gradient_accumulation (bool, optional) – Whether to use gradient accumulation. Defaults to False.
gradient_accumulation_steps (int, optional) – The number of steps for gradient accumulation. Defaults to 4.
- Returns:
None
- spacr.deep_spacr.visualize_grad_cam(src, model_path, target_layers=None, image_size=224, channels=[1, 2, 3], normalize=True, class_names=None, save_cam=False, save_dir='grad_cam')[source]
- spacr.deep_spacr.visualize_integrated_gradients(src, model_path, target_label_idx=0, image_size=224, channels=[1, 2, 3], normalize=True, save_integrated_grads=False, save_dir='integrated_grads')[source]
spacr.graph_learning module
- class spacr.graph_learning.Decoder(hidden_feats, out_feats)[source]
Bases:
Module
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.graph_learning.Encoder(in_feats, hidden_feats)[source]
Bases:
Module
- forward(g, features)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.graph_learning.GraphTransformer(in_feats, hidden_feats, out_feats)[source]
Bases:
Module
- forward(g, features)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- spacr.graph_learning.extract_metadata_from_path(path)[source]
Extract metadata from the image path. The path format is expected to be plate_well_field_objectnumber.png
Parameters: path (str): The path to the image file.
Returns: dict: A dictionary with the extracted metadata.
- spacr.graph_learning.load_images(image_paths, image_size=224, channels=[1, 2, 3], normalize=True)[source]
spacr.gui module
spacr.gui_core module
- spacr.gui_core.initiate_root(parent, settings_type='mask')[source]
Initializes the root window and sets up the GUI components based on the specified settings type.
- Parameters:
parent (tkinter.Tk or tkinter.Toplevel) – The parent window for the GUI.
settings_type (str, optional) – The type of settings to be displayed in the GUI. Defaults to ‘mask’.
- Returns:
A tuple containing the parent frame and the dictionary of variables used in the GUI.
- Return type:
tuple
- spacr.gui_core.set_globals(thread_control_var, q_var, console_output_var, parent_frame_var, vars_dict_var, canvas_var, canvas_widget_var, scrollable_frame_var, fig_queue_var, figures_var, figure_index_var, progress_bar_var, usage_bars_var, fig_memory_limit_var, figure_current_memory_usage_var)[source]
spacr.gui_elements module
- class spacr.gui_elements.AnnotateApp(root, db_path, src, image_type=None, channels=None, image_size=200, annotation_column='annotate', normalize=False, percentiles=(1, 99), measurement=None, threshold=None)[source]
Bases:
object
- spacr.gui_elements.modify_figure_properties(fig, scale_x=None, scale_y=None, line_width=None, font_size=None, x_lim=None, y_lim=None, grid=False, legend=None, title=None, x_label_rotation=None, remove_axes=False, bg_color=None, text_color=None, line_color=None)[source]
Modifies the properties of the figure, including scaling, line widths, font sizes, axis limits, x-axis label rotation, background color, text color, line color, and other common options.
Parameters: - fig: The Matplotlib figure object to modify. - scale_x: Scaling factor for the width of subplots (optional). - scale_y: Scaling factor for the height of subplots (optional). - line_width: Desired line width for all lines (optional). - font_size: Desired font size for all text (optional). - x_lim: Tuple specifying the x-axis limits (min, max) (optional). - y_lim: Tuple specifying the y-axis limits (min, max) (optional). - grid: Boolean to add grid lines to the plot (optional). - legend: Boolean to show/hide the legend (optional). - title: String to set as the title of the plot (optional). - x_label_rotation: Angle to rotate the x-axis labels (optional). - remove_axes: Boolean to remove or show the axes labels (optional). - bg_color: Color for the figure and subplot background (optional). - text_color: Color for all text in the figure (optional). - line_color: Color for all lines in the figure (optional).
- spacr.gui_elements.set_dark_style(style, parent_frame=None, containers=None, widgets=None, font_family='OpenSans', font_size=12, bg_color='black', fg_color='white', active_color='blue', inactive_color='dark_gray')[source]
- class spacr.gui_elements.spacrButton(parent, text='', command=None, font=None, icon_name=None, size=50, show_text=True, outline=False, animation=True, *args, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrCheck(parent, text='', variable=None, *args, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrCheckbutton(parent, text='', variable=None, command=None, *args, **kwargs)[source]
Bases:
Checkbutton
- class spacr.gui_elements.spacrCombo(parent, textvariable=None, values=None, width=None, *args, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrContainer(parent, orient='vertical', bg=None, *args, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrDropdownMenu(parent, variable, options, command=None, font=None, size=50, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrEntry(parent, textvariable=None, outline=False, width=None, *args, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrFont(font_name, font_style, font_size=12)[source]
Bases:
object
- get_font(size=None)[source]
Returns the font in the specified size.
Parameters: - size: int, the size of the font (optional).
Returns: - tkFont.Font object.
- class spacr.gui_elements.spacrFrame(container, width=None, *args, bg='black', radius=20, scrollbar=True, textbox=False, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrLabel(parent, text='', font=None, style=None, align='right', height=None, **kwargs)[source]
Bases:
Frame
- class spacr.gui_elements.spacrProgressBar(parent, label=True, *args, **kwargs)[source]
Bases:
Progressbar
- class spacr.gui_elements.spacrSwitch(parent, text='', variable=None, command=None, *args, **kwargs)[source]
Bases:
Frame
spacr.gui_utils module
- class spacr.gui_utils.WriteToQueue(q)[source]
Bases:
TextIOBase
A custom file-like class that writes any output to a given queue. This can be used to redirect stdout and stderr.
- spacr.gui_utils.create_input_field(frame, label_text, row, var_type='entry', options=None, default_value=None)[source]
Create an input field in the specified frame.
- Parameters:
frame (tk.Frame) – The frame in which the input field will be created.
label_text (str) – The text to be displayed as the label for the input field.
row (int) – The row in which the input field will be placed.
var_type (str, optional) – The type of input field to create. Defaults to ‘entry’.
options (list, optional) – The list of options for a combo box input field. Defaults to None.
default_value (str, optional) – The default value for the input field. Defaults to None.
- Returns:
A tuple containing the label, input widget, variable, and custom frame.
- Return type:
tuple
- Raises:
Exception – If an error occurs while creating the input field.
- spacr.gui_utils.download_dataset(q, repo_id, subfolder, local_dir=None, retries=5, delay=5)[source]
Downloads a dataset or settings files from Hugging Face and returns the local path.
- Parameters:
repo_id (str) – The repository ID (e.g., ‘einarolafsson/toxo_mito’ or ‘einarolafsson/spacr_settings’).
subfolder (str) – The subfolder path within the repository (e.g., ‘plate1’ or the settings subfolder).
local_dir (str) – The local directory where the files will be saved. Defaults to the user’s home directory.
retries (int) – Number of retry attempts in case of failure.
delay (int) – Delay in seconds between retries.
- Returns:
The local path to the downloaded files.
- Return type:
str
- spacr.gui_utils.function_gui_wrapper(function=None, settings={}, q=None, fig_queue=None, imports=1)[source]
Wraps the run_multiple_simulations function to integrate with GUI processes.
Parameters: - settings: dict, The settings for the run_multiple_simulations function. - q: multiprocessing.Queue, Queue for logging messages to the GUI. - fig_queue: multiprocessing.Queue, Queue for sending figures to the GUI.
- spacr.gui_utils.hide_all_settings(vars_dict, categories)[source]
Function to initially hide all settings in the GUI.
Parameters: - categories: dict, The categories of settings with their corresponding settings. - vars_dict: dict, The dictionary containing the settings and their corresponding widgets.
- spacr.gui_utils.initialize_cuda()[source]
Initializes CUDA in the main process by performing a simple GPU operation.
- spacr.gui_utils.parse_list(value)[source]
Parses a string representation of a list and returns the parsed list.
- Parameters:
value (str) – The string representation of the list.
- Returns:
The parsed list.
- Return type:
list
- Raises:
ValueError – If the input value is not a valid list format or contains mixed types or unsupported types.
spacr.io module
- class spacr.io.CombineLoaders(train_loaders)[source]
Bases:
object
A class that combines multiple data loaders into a single iterator.
- Parameters:
train_loaders (list) – A list of data loaders.
- train_loaders
A list of data loaders.
- Type:
list
- loader_iters
A list of iterator objects for each data loader.
- Type:
list
- Raises:
StopIteration – If all data loaders have been exhausted.
- class spacr.io.CombinedDataset(datasets, shuffle=True)[source]
Bases:
Dataset
A dataset that combines multiple datasets into one.
- Parameters:
datasets (list) – A list of datasets to be combined.
shuffle (bool, optional) – Whether to shuffle the combined dataset. Defaults to True.
- class spacr.io.NoClassDataset(data_dir, transform=None, shuffle=True, load_to_memory=False)[source]
Bases:
Dataset
- spacr.io.concatenate_and_normalize(src, channels, save_dtype=<class 'numpy.float32'>, settings={})[source]
- spacr.io.convert_numpy_to_tiff(folder_path, limit=None)[source]
Converts all numpy files in a folder to TIFF format and saves them in a subdirectory ‘tiff’.
Args: folder_path (str): The path to the folder containing numpy files.
- spacr.io.delete_empty_subdirectories(folder_path)[source]
Deletes all empty subdirectories in the specified folder.
Args: - folder_path (str): The path to the folder in which to look for empty subdirectories.
spacr.logger module
spacr.measure module
- spacr.measure.get_components(cell_mask, nucleus_mask, pathogen_mask)[source]
Get the components (nucleus and pathogens) for each cell in the given masks.
- Parameters:
cell_mask (ndarray) – Binary mask of cell labels.
nucleus_mask (ndarray) – Binary mask of nucleus labels.
pathogen_mask (ndarray) – Binary mask of pathogen labels.
- Returns:
- A tuple containing two dataframes - nucleus_df and pathogen_df.
- nucleus_df (DataFrame): Dataframe with columns ‘cell_id’ and ‘nucleus’,
representing the mapping of each cell to its nucleus.
- pathogen_df (DataFrame): Dataframe with columns ‘cell_id’ and ‘pathogen’,
representing the mapping of each cell to its pathogens.
- Return type:
tuple
- spacr.measure.img_list_to_grid(grid, titles=None)[source]
Plot a grid of images with optional titles.
- Parameters:
grid (list) – List of images to be plotted.
titles (list) – List of titles for the images.
- Returns:
The matplotlib figure object containing the image grid.
- Return type:
fig (Figure)
- spacr.measure.measure_crop(settings)[source]
Measure the crop of an image based on the provided settings.
- Parameters:
settings (dict) – The settings for measuring the crop.
- Returns:
None
- spacr.measure.process_meassure_crop_results(partial_results, settings)[source]
Process the results, display, and optionally save the figures.
- Parameters:
partial_results (list) – List of partial results.
settings (dict) – Settings dictionary.
save_figures (bool) – Flag to save figures or not.
- spacr.measure.save_and_add_image_to_grid(png_channels, img_path, grid, plot=False)[source]
Add an image to a grid and save it as PNG.
- Parameters:
png_channels (ndarray) – The array representing the image channels.
img_path (str) – The path to save the image as PNG.
grid (list) – The grid of images to be plotted later.
- Returns:
Updated grid with the new image added.
- Return type:
grid (list)
spacr.plot module
- spacr.plot.generate_mask_random_cmap(mask)[source]
Generate a random colormap based on the unique labels in the given mask.
Parameters: mask (numpy.ndarray): The input mask array.
Returns: matplotlib.colors.ListedColormap: The random colormap.
- spacr.plot.generate_plate_heatmap(df, plate_number, variable, grouping, min_max, min_count)[source]
- spacr.plot.normalize_and_visualize(image, normalized_image, title='')[source]
Utility function for visualization
- spacr.plot.plot_arrays(src, figuresize=10, cmap='inferno', nr=1, normalize=True, q1=1, q2=99)[source]
Plot randomly selected arrays from a given directory.
Parameters: - src (str): The directory path containing the arrays. - figuresize (int): The size of the figure (default: 50). - cmap (str): The colormap to use for displaying the arrays (default: ‘inferno’). - nr (int): The number of arrays to plot (default: 1). - normalize (bool): Whether to normalize the arrays (default: True). - q1 (int): The lower percentile for normalization (default: 1). - q2 (int): The upper percentile for normalization (default: 99).
Returns: None
- spacr.plot.plot_image_mask_overlay(file, channels, cell_channel, nucleus_channel, pathogen_channel, figuresize=10, normalize=True, thickness=3, save_pdf=True)[source]
Plot image and mask overlays.
- spacr.plot.plot_images_and_arrays(folders, lower_percentile=1, upper_percentile=99, threshold=1000, extensions=['.npy', '.tif', '.tiff', '.png'], overlay=False, max_nr=None, randomize=True)[source]
Plot images and arrays from the given folders.
- Parameters:
folders (list) – A list of folder paths containing the images and arrays.
lower_percentile (int, optional) – The lower percentile for image normalization. Defaults to 1.
upper_percentile (int, optional) – The upper percentile for image normalization. Defaults to 99.
threshold (int, optional) – The threshold for determining whether to display an image as a mask or normalize it. Defaults to 1000.
extensions (list, optional) – A list of file extensions to consider. Defaults to [‘.npy’, ‘.tif’, ‘.tiff’, ‘.png’].
overlay (bool, optional) – If True, overlay the outlines of the objects on the image. Defaults to False.
- spacr.plot.plot_masks(batch, masks, flows, cmap='inferno', figuresize=10, nr=1, file_type='.npz', print_object_number=True)[source]
Plot the masks and flows for a given batch of images.
- Parameters:
batch (numpy.ndarray) – The batch of images.
masks (list or numpy.ndarray) – The masks corresponding to the images.
flows (list or numpy.ndarray) – The flows corresponding to the images.
cmap (str, optional) – The colormap to use for displaying the images. Defaults to ‘inferno’.
figuresize (int, optional) – The size of the figure. Defaults to 20.
nr (int, optional) – The maximum number of images to plot. Defaults to 1.
file_type (str, optional) – The file type of the flows. Defaults to ‘.npz’.
print_object_number (bool, optional) – Whether to print the object number on the mask. Defaults to True.
- Returns:
None
- spacr.plot.plot_merged(src, settings)[source]
Plot the merged images after applying various filters and modifications.
- Parameters:
src (path) – Path to folder with images.
settings (dict) – The settings for the plot.
- Returns:
None
- spacr.plot.plot_object_outlines(src, objects=['nucleus', 'cell', 'pathogen'], channels=[0, 1, 2], max_nr=10)[source]
- spacr.plot.random_cmap(num_objects=100)[source]
Generate a random colormap.
Parameters: num_objects (int): The number of objects to generate colors for. Default is 100.
Returns: random_cmap (matplotlib.colors.ListedColormap): A random colormap.
- spacr.plot.read_and_plot__vision_results(base_dir, y_axis='accuracy', name_split='_time', y_lim=[0.8, 0.9])[source]
- spacr.plot.visualize_cellpose_masks(masks, titles=None, filename=None, save=False, src=None)[source]
Visualize multiple masks with optional titles.
- Parameters:
masks (list of np.ndarray) – A list of masks to visualize.
titles (list of str, optional) – A list of titles for the masks. If None, default titles will be used.
comparison_title (str) – Title for the entire figure.
spacr.sequencing module
- spacr.sequencing.check_normality(data, variable_name, verbose=False)[source]
Check if the data is normally distributed using the Shapiro-Wilk test.
- spacr.sequencing.consensus_sequence(fastq_r1, fastq_r2, output_file, chunk_size=1000000, n_jobs=None)[source]
Calculate the consensus sequence from two FASTQ files (R1 and R2) and write the result to an output file.
Parameters: - fastq_r1 (str): Path to the R1 FASTQ file. - fastq_r2 (str): Path to the R2 FASTQ file. - output_file (str): Path to the output file where the consensus sequence will be written. - chunk_size (int): Number of reads to process in each chunk. Default is 1000000. - n_jobs (int): Number of parallel processes to use. If None, it will use the number of available CPUs minus 2.
Returns: None
- spacr.sequencing.consensus_sequence_v1(fastq_r1, fastq_r2, output_file, chunk_size=1000000)[source]
Generate a consensus sequence from paired-end FASTQ files.
- Parameters:
fastq_r1 (str) – Path to the first input FASTQ file.
fastq_r2 (str) – Path to the second input FASTQ file.
output_file (str) – Path to the output FASTQ file.
chunk_size (int, optional) – Number of reads to process in each iteration. Defaults to 1000000.
- Returns:
None
- spacr.sequencing.extract_barcodes_from_fastq(fastq, output_file, chunk_size, barcode_mapping, n_jobs=None, compression='zlib', complevel=9)[source]
Extracts barcodes from a FASTQ file and maps them based on a barcode mapping.
- Parameters:
fastq (str) – Path to the input FASTQ file.
output_file (str) – Path to the output file where the mapped barcodes will be saved.
chunk_size (int) – Number of records to process in each chunk.
barcode_mapping (dict) – Dictionary containing barcode mapping information. The keys are the names of the barcode sets, and the values are tuples containing the path to the CSV file, barcode coordinates, and reverse complement flag.
n_jobs (int, optional) – Number of parallel processes to use for mapping. Defaults to None.
compression (str, optional) – Compression algorithm to use for saving the output file. Defaults to ‘zlib’.
complevel (int, optional) – Compression level to use for saving the output file. Defaults to 9.
- Returns:
None
- spacr.sequencing.extract_barcodes_from_fastq_v1(fastq, output_file, chunk_size, barcode_mapping, n_jobs=None, compression='zlib', complevel=9)[source]
Extracts barcodes from a FASTQ file and saves the results to an output file.
Parameters: - fastq (str): Path to the input FASTQ file. - output_file (str): Path to the output file where the barcode data will be saved. - chunk_size (int): Number of records to process in each chunk. - barcode_mapping (dict): Mapping of barcode keys to CSV file paths, barcode coordinates, and reverse complement flags. - n_jobs (int, optional): Number of parallel processes to use for barcode mapping. Defaults to None. - compression (str, optional): Compression algorithm to use for the output file. Defaults to ‘zlib’. - complevel (int, optional): Compression level to use for the output file. Defaults to 9.
- spacr.sequencing.generate_fraction_map(df, gene_column, min_=10, plates=['p1', 'p2', 'p3', 'p4'], metric='count', plot=False)[source]
- spacr.sequencing.get_top_two_matches(seq, barcode_dict)[source]
Finds the top two closest matches for a given sequence in a barcode dictionary.
- Parameters:
seq (str) – The sequence to find the closest matches for.
barcode_dict (dict) – A dictionary containing barcodes as keys and their corresponding values.
- Returns:
A list containing up to two tuples, each with a barcode match and its score.
- Return type:
list of tuples
- spacr.sequencing.grna_plate_heatmap(path, specific_grna=None, min_max='all', cmap='viridis', min_count=0, save=True)[source]
Generate a heatmap of gRNA plate data.
- Parameters:
path (str) – The path to the CSV file containing the gRNA plate data.
specific_grna (str, optional) – The specific gRNA to filter the data for. Defaults to None.
min_max (str or list or tuple, optional) – The range of values to use for the color scale. If ‘all’, the range will be determined by the minimum and maximum values in the data. If ‘allq’, the range will be determined by the 2nd and 98th percentiles of the data. If a list or tuple of two values, the range will be determined by those values. Defaults to ‘all’.
cmap (str, optional) – The colormap to use for the heatmap. Defaults to ‘viridis’.
min_count (int, optional) – The minimum count threshold for including a gRNA in the heatmap. Defaults to 0.
save (bool, optional) – Whether to save the heatmap as a PDF file. Defaults to True.
- Returns:
The generated heatmap figure.
- Return type:
matplotlib.figure.Figure
- spacr.sequencing.parse_gz_files(folder_path)[source]
Parses the .fastq.gz files in the specified folder path and returns a dictionary containing the sample names and their corresponding file paths.
- Parameters:
folder_path (str) – The path to the folder containing the .fastq.gz files.
- Returns:
A dictionary where the keys are the sample names and the values are dictionaries containing the file paths for the ‘R1’ and ‘R2’ read directions.
- Return type:
dict
- spacr.sequencing.plot_data(df, v, h, color, n_col, ax, x_axis, y_axis, fontsize=12, lw=2, ls='-', log_x=False, log_y=False, title=None)[source]
- spacr.sequencing.process_chunk_for_consensus(r1_chunk, r2_chunk)[source]
Process a chunk of paired-end sequencing reads to generate consensus sequences.
- Parameters:
r1_chunk (list) – List of SeqRecord objects representing the first read in each pair.
r2_chunk (list) – List of SeqRecord objects representing the second read in each pair.
- Returns:
List of SeqRecord objects representing the consensus sequences.
- Return type:
list
- spacr.sequencing.process_chunk_for_mapping(records, barcode_mapping, barcode_dicts, barcode_coordinates, reverse_complements)[source]
Process a chunk of records for barcode mapping, including highest and second-highest scores.
- Parameters:
records (list) – A list of records to process.
barcode_mapping (dict) – A dictionary mapping barcodes to their corresponding keys.
barcode_dicts (dict) – A dictionary of barcode dictionaries.
barcode_coordinates (dict) – A dictionary mapping barcode keys to their start and end coordinates.
reverse_complements (dict) – A dictionary indicating whether to reverse complement the extracted sequences for each barcode key.
- Returns:
A DataFrame containing the processed data.
- Return type:
pandas.DataFrame
- spacr.sequencing.process_scores(df, dependent_variable, plate, min_cell_count=25, agg_type='mean', transform=None, regression_type='ols')[source]
- spacr.sequencing.regression(df, csv_path, dependent_variable='predictions', regression_type=None, alpha=1.0, remove_row_column_effect=False)[source]
- spacr.sequencing.regression_model(X, y, regression_type='ols', groups=None, alpha=1.0, remove_row_column_effect=True)[source]
- spacr.sequencing.save_to_hdf(queue, output_file, complevel=9, compression='zlib')[source]
Save data from a queue to an HDF file.
Parameters: - queue: Queue object containing chunks of data to be saved - output_file: Path to the output HDF file - complevel: Compression level (default: 9) - compression: Compression algorithm (default: ‘zlib’)
Returns: None
spacr.settings module
spacr.sim module
- spacr.sim.append_database(src, table, table_name)[source]
Append a pandas DataFrame to an SQLite database table.
Parameters: src (str): The source directory where the database file is located. table (pandas.DataFrame): The DataFrame to be appended to the database table. table_name (str): The name of the database table.
Returns: None
- spacr.sim.calculate_permutation_importance(df, target='prauc', exclude=None, n_repeats=10, clean=True)[source]
Calculates permutation importance for the given features in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. features (list): List of column names to include as features. target (str): The name of the target variable column.
Returns: dict: Dictionary containing the importances and standard deviations.
- spacr.sim.cell_level_roc_auc(cell_scores)[source]
Compute the ROC AUC and precision-recall metrics at the cell level.
- Parameters:
cell_scores (list) – List of scores for each cell.
- Returns:
DataFrame containing the ROC AUC metrics for each cell. cell_pr_dict_df (DataFrame): DataFrame containing the precision-recall metrics for each cell. cell_scores (list): Updated list of scores after applying the optimum threshold. cell_cm (array): Confusion matrix for the cell-level classification.
- Return type:
cell_roc_dict_df (DataFrame)
- spacr.sim.classifier(positive_mean, positive_variance, negative_mean, negative_variance, classifier_accuracy, df)[source]
Classifies the data in the DataFrame based on the given parameters and a classifier error rate.
- Parameters:
positive_mean (float) – The mean of the positive distribution.
positive_variance (float) – The variance of the positive distribution.
negative_mean (float) – The mean of the negative distribution.
negative_variance (float) – The variance of the negative distribution.
classifier_accuracy (float) – The likelihood (0 to 1) that a gene is correctly classified according to its true label.
df (pandas.DataFrame) – The DataFrame containing the data to be classified.
- Returns:
The DataFrame with an additional ‘score’ column containing the classification scores.
- Return type:
pandas.DataFrame
- spacr.sim.classifier_v2(positive_mean, positive_variance, negative_mean, negative_variance, df)[source]
Classifies the data in the DataFrame based on the given parameters.
- Parameters:
positive_mean (float) – The mean of the positive distribution.
positive_variance (float) – The variance of the positive distribution.
negative_mean (float) – The mean of the negative distribution.
negative_variance (float) – The variance of the negative distribution.
df (pandas.DataFrame) – The DataFrame containing the data to be classified.
- Returns:
The DataFrame with an additional ‘score’ column containing the classification scores.
- Return type:
pandas.DataFrame
- spacr.sim.compute_precision_recall(cell_scores)[source]
Compute precision, recall, F1 score, and PR AUC for a given set of cell scores.
Parameters: - cell_scores (DataFrame): A DataFrame containing the cell scores with columns ‘is_active’ and ‘score’.
Returns: - cell_pr_dict (dict): A dictionary containing the computed precision, recall, F1 score, PR AUC, and threshold values.
- spacr.sim.compute_roc_auc(cell_scores)[source]
Compute the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) for cell scores.
Parameters: - cell_scores (DataFrame): DataFrame containing cell scores with columns ‘is_active’ and ‘score’.
Returns: - cell_roc_dict (dict): Dictionary containing the ROC curve information, including the threshold, true positive rate (TPR), false positive rate (FPR), and ROC AUC.
- spacr.sim.create_database(db_path)[source]
Creates a SQLite database at the specified path.
- Parameters:
db_path (str) – The path where the database should be created.
- Returns:
None
- spacr.sim.dist_gen(mean, sd, df)[source]
Generate a Poisson distribution based on a gamma distribution.
Parameters: mean (float): Mean of the gamma distribution. sd (float): Standard deviation of the gamma distribution. df (pandas.DataFrame): Input data.
Returns: tuple: A tuple containing the generated Poisson distribution and the length of the input data.
- spacr.sim.generate_gene_list(number_of_genes, number_of_all_genes)[source]
Generates a list of randomly selected genes.
- Parameters:
number_of_genes (int) – The number of genes to be selected.
number_of_all_genes (int) – The total number of genes available.
- Returns:
A list of randomly selected genes.
- Return type:
list
- spacr.sim.generate_gene_weights(positive_mean, positive_variance, df)[source]
Generate gene weights using a beta distribution.
Parameters: - positive_mean (float): The mean value for the positive distribution. - positive_variance (float): The variance value for the positive distribution. - df (pandas.DataFrame): The DataFrame containing the data.
Returns: - weights (numpy.ndarray): An array of gene weights generated using a beta distribution.
- spacr.sim.generate_paramiters(settings)[source]
Generate a list of parameter sets for simulation based on the given settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
A list of parameter sets for simulation.
- Return type:
list
- spacr.sim.generate_plate_map(nr_plates)[source]
Generate a plate map based on the number of plates.
Parameters: nr_plates (int): The number of plates to generate the map for.
Returns: pandas.DataFrame: The generated plate map dataframe.
- spacr.sim.generate_power_law_distribution(num_elements, coeff)[source]
Generate a power law distribution.
Parameters: - num_elements (int): The number of elements in the distribution. - coeff (float): The coefficient of the power law.
Returns: - normalized_distribution (ndarray): The normalized power law distribution.
- spacr.sim.generate_shap_summary_plot(df, target='prauc', clean=True)[source]
Generates a SHAP summary plot for the given features in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. features (list): List of column names to include as features. target (str): The name of the target variable column.
Returns: None
- spacr.sim.generate_well_score(cell_scores)[source]
Generate well scores based on cell scores.
- Parameters:
cell_scores (DataFrame) – DataFrame containing cell scores.
- Returns:
DataFrame containing well scores with average active score, gene list, and score.
- Return type:
DataFrame
- spacr.sim.get_optimum_threshold(cell_pr_dict)[source]
Calculates the optimum threshold based on the f1_score in the given cell_pr_dict.
Parameters: cell_pr_dict (dict): A dictionary containing precision, recall, and f1_score values for different thresholds.
Returns: float: The optimum threshold value.
- spacr.sim.gini(x)[source]
Calculate the Gini coefficient for a given array of values.
Parameters: x (array-like): The input array of values.
Returns: float: The Gini coefficient.
References: - Based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif - From: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm - All values are treated equally, arrays must be 1d.
- spacr.sim.gini_coefficient(x)[source]
Compute Gini coefficient of array of values.
Parameters: x (array-like): Array of values.
Returns: float: Gini coefficient.
- spacr.sim.gini_gene_well(x)[source]
Calculate the Gini coefficient for a given income distribution.
The Gini coefficient measures income inequality in a population. A value of 0 represents perfect income equality (everyone has the same income), while a value of 1 represents perfect income inequality (one individual has all the income).
Parameters: x (array-like): An array-like object representing the income distribution.
Returns: float: The Gini coefficient for the given income distribution.
- spacr.sim.normalize_array(arr)[source]
Normalize an array by scaling its values between 0 and 1.
Parameters: arr (numpy.ndarray): The input array to be normalized.
Returns: numpy.ndarray: The normalized array.
- spacr.sim.plot_confusion_matrix(data, ax, title)[source]
Plots a confusion matrix using a heatmap.
Parameters: data (numpy.ndarray): The confusion matrix data. ax (matplotlib.axes.Axes): The axes object to plot the heatmap on. title (str): The title of the plot.
Returns: None
- spacr.sim.plot_correlation_matrix(df, annot=False, cmap='inferno', clean=True)[source]
Plots a correlation matrix for the specified variables and the target variable.
Args: df (pandas.DataFrame): The DataFrame containing the data. variables (list): List of column names to include in the correlation matrix. target_variable (str): The target variable column name.
Returns: None
- spacr.sim.plot_feature_importance(df, target='prauc', exclude=None, clean=True)[source]
Trains a RandomForestRegressor to determine the importance of each feature in predicting the target.
Args: df (pandas.DataFrame): The DataFrame containing the data. target (str): The target variable column name. exclude (list or str, optional): Column names to exclude from features.
Returns: matplotlib.figure.Figure: The figure object containing the feature importance plot.
- spacr.sim.plot_histogram(data, x_label, ax, color, title, binwidth=0.01, log=False)[source]
Plots a histogram of the given data.
Parameters: - data: The data to be plotted. - x_label: The label for the x-axis. - ax: The matplotlib axis object to plot on. - color: The color of the histogram bars. - title: The title of the plot. - binwidth: The width of each histogram bin. - log: Whether to use a logarithmic scale for the y-axis.
Returns: None
- spacr.sim.plot_partial_dependences(df, target='prauc', clean=True)[source]
Creates partial dependence plots for the specified features, with improved layout to avoid text overlap.
Args: df (pandas.DataFrame): The DataFrame containing the data. target (str): The target variable.
Returns: None
- spacr.sim.plot_roc_pr(data, ax, title, x_label, y_label)[source]
Plot the ROC (Receiver Operating Characteristic) and PR (Precision-Recall) curves.
Parameters: - data: DataFrame containing the data to be plotted. - ax: The matplotlib axes object to plot on. - title: The title of the plot. - x_label: The label for the x-axis. - y_label: The label for the y-axis.
- spacr.sim.plot_simulations(df, variable, x_rotation=None, legend=False, grid=False, clean=True, verbose=False)[source]
Creates separate line plots for ‘prauc’ against a specified ‘variable’, for each unique combination of conditions defined by ‘grouping_vars’, displayed on a grid.
Args: df (pandas.DataFrame): DataFrame containing the necessary columns. variable (str): Name of the column to use as the x-axis for grouping and plotting. x_rotation (int, optional): Degrees to rotate the x-axis labels. legend (bool, optional): Whether to display a legend. grid (bool, optional): Whether to display grid lines. verbose (bool, optional): Whether to print the filter conditions.
Returns: None
- spacr.sim.power_law_dist_gen(df, avg, well_ineq_coeff)[source]
Generate a power-law distribution for wells.
Parameters: - df: DataFrame: The input DataFrame containing the wells. - avg: float: The average value for the distribution. - well_ineq_coeff: float: The inequality coefficient for the power-law distribution.
Returns: - dist: ndarray: The generated power-law distribution for the wells.
- spacr.sim.read_simulations_table(db_path)[source]
Reads the ‘simulations’ table from an SQLite database into a pandas DataFrame.
Args: db_path (str): The file path to the SQLite database.
Returns: pandas.DataFrame: DataFrame containing the ‘simulations’ table data.
- spacr.sim.regression_roc_auc(results_df, active_gene_list, control_gene_list, alpha=0.05, optimal=False)[source]
Calculate regression ROC AUC and other statistics.
Parameters: results_df (DataFrame): DataFrame containing the results of regression analysis. active_gene_list (list): List of active gene IDs. control_gene_list (list): List of control gene IDs. alpha (float, optional): Significance level for determining hits. Default is 0.05. optimal (bool, optional): Whether to use the optimal threshold for classification. Default is False.
Returns: tuple: A tuple containing the following: - results_df (DataFrame): Updated DataFrame with additional columns. - reg_roc_dict_df (DataFrame): DataFrame containing regression ROC curve data. - reg_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data. - reg_cm (ndarray): Confusion matrix. - sim_stats (DataFrame): DataFrame containing simulation statistics.
- spacr.sim.remove_columns_with_single_value(df)[source]
Removes columns from the DataFrame that have the same value in all rows.
Args: df (pandas.DataFrame): The original DataFrame.
Returns: pandas.DataFrame: A DataFrame with the columns removed that contained only one unique value.
- spacr.sim.remove_constant_columns(df)[source]
Removes columns in the DataFrame where all entries have the same value.
Parameters: df (pd.DataFrame): The input DataFrame from which to remove constant columns.
Returns: pd.DataFrame: A DataFrame with the constant columns removed.
- spacr.sim.run_and_save(i, settings, time_ls, total_sims)[source]
Run the simulation and save the results.
- Parameters:
i (int) – The simulation index.
settings (dict) – The simulation settings.
time_ls (list) – The list to store simulation times.
total_sims (int) – The total number of simulations.
- Returns:
A tuple containing the simulation index, simulation time, and None.
- Return type:
tuple
- spacr.sim.run_experiment(plate_map, number_of_genes, active_gene_list, avg_genes_per_well, sd_genes_per_well, avg_cells_per_well, sd_cells_per_well, well_ineq_coeff, gene_ineq_coeff)[source]
Run a simulation experiment.
- Parameters:
plate_map (DataFrame) – The plate map containing information about the wells.
number_of_genes (int) – The total number of genes.
active_gene_list (list) – The list of active genes.
avg_genes_per_well (float) – The average number of genes per well.
sd_genes_per_well (float) – The standard deviation of genes per well.
avg_cells_per_well (float) – The average number of cells per well.
sd_cells_per_well (float) – The standard deviation of cells per well.
well_ineq_coeff (float) – The coefficient for well inequality.
gene_ineq_coeff (float) – The coefficient for gene inequality.
- Returns:
- A tuple containing the following:
cell_df (DataFrame): The DataFrame containing information about the cells.
genes_per_well_df (DataFrame): The DataFrame containing gene counts per well.
wells_per_gene_df (DataFrame): The DataFrame containing well counts per gene.
df_ls (list): A list containing gene counts per well, well counts per gene, Gini coefficients for wells, Gini coefficients for genes, gene weights array, and well weights.
- Return type:
tuple
- spacr.sim.run_multiple_simulations(settings)[source]
Run multiple simulations in parallel using the provided settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
None
- spacr.sim.run_simulation(settings)[source]
Run the simulation based on the given settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
A tuple containing the simulation results and distances. - cell_scores (DataFrame): Scores for each cell. - cell_roc_dict_df (DataFrame): ROC AUC scores for each cell. - cell_pr_dict_df (DataFrame): Precision-Recall AUC scores for each cell. - cell_cm (DataFrame): Confusion matrix for each cell. - well_score (DataFrame): Scores for each well. - gene_fraction_map (DataFrame): Fraction of genes for each well. - metadata (DataFrame): Metadata for each well. - results_df (DataFrame): Results of the regression analysis. - reg_roc_dict_df (DataFrame): ROC AUC scores for each gene. - reg_pr_dict_df (DataFrame): Precision-Recall AUC scores for each gene. - reg_cm (DataFrame): Confusion matrix for each gene. - sim_stats (dict): Additional simulation statistics. - genes_per_well_df (DataFrame): Number of genes per well. - wells_per_gene_df (DataFrame): Number of wells per gene. dists (list): List of distances.
- Return type:
tuple
- spacr.sim.save_data(src, output, settings, save_all=False, i=0, variable='all')[source]
Save simulation data to specified location.
- Parameters:
src (str) – The directory path where the data will be saved.
output (list) – A list of dataframes containing simulation output.
settings (dict) – A dictionary containing simulation settings.
save_all (bool, optional) – Flag indicating whether to save all tables or only a subset. Defaults to False.
i (int, optional) – The simulation number. Defaults to 0.
variable (str, optional) – The variable name. Defaults to ‘all’.
- Returns:
None
- spacr.sim.save_plot(fig, src, variable, i)[source]
Save a matplotlib figure as a PDF file.
Parameters: - fig: The matplotlib figure to be saved. - src: The directory where the file will be saved. - variable: The name of the variable being plotted. - i: The index of the figure.
Returns: None
- spacr.sim.sequence_plates(well_score, number_of_genes, avg_reads_per_gene, sd_reads_per_gene, sequencing_error=0.01)[source]
Simulates the sequencing of plates and calculates gene fractions and metadata.
Parameters: well_score (pd.DataFrame): DataFrame containing well scores and gene lists. number_of_genes (int): Number of genes. avg_reads_per_gene (float): Average number of reads per gene. sd_reads_per_gene (float): Standard deviation of reads per gene. sequencing_error (float, optional): Probability of introducing sequencing error. Defaults to 0.01.
Returns: gene_fraction_map (pd.DataFrame): DataFrame containing gene fractions for each well. metadata (pd.DataFrame): DataFrame containing metadata for each well.
- spacr.sim.update_scores_and_get_cm(cell_scores, optimum)[source]
Update the cell scores based on the given optimum value and calculate the confusion matrix.
- Parameters:
cell_scores (DataFrame) – The DataFrame containing the cell scores.
optimum (float) – The optimum value used for updating the scores.
- Returns:
A tuple containing the updated cell scores DataFrame and the confusion matrix.
- Return type:
tuple
- spacr.sim.validate_and_adjust_beta_params(sim_params)[source]
Validates and adjusts Beta distribution parameters in simulation settings to ensure they are possible.
Args: sim_params (list of dict): List of dictionaries, each containing the simulation parameters.
Returns: list of dict: The adjusted list of simulation parameter sets.
- spacr.sim.vis_dists(dists, src, v, i)[source]
Visualizes the distributions of given distances.
- Parameters:
dists (list) – List of distance arrays.
src (str) – Source directory for saving the plot.
v (int) – Number of vertices.
i (int) – Index of the plot.
- Returns:
None
- spacr.sim.visualize_all(output)[source]
Visualizes various plots based on the given output data.
- Parameters:
output (list) – A list containing the following elements: - cell_scores (DataFrame): DataFrame containing cell scores. - cell_roc_dict_df (DataFrame): DataFrame containing ROC curve data for cell classification. - cell_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data for cell classification. - cell_cm (array-like): Confusion matrix for cell classification. - well_score (DataFrame): DataFrame containing well scores. - gene_fraction_map (dict): Dictionary mapping genes to fractions. - metadata (dict): Dictionary containing metadata. - results_df (DataFrame): DataFrame containing results. - reg_roc_dict_df (DataFrame): DataFrame containing ROC curve data for gene regression. - reg_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data for gene regression. - reg_cm (array-like): Confusion matrix for gene regression. - sim_stats (dict): Dictionary containing simulation statistics. - genes_per_well_df (DataFrame): DataFrame containing genes per well data. - wells_per_gene_df (DataFrame): DataFrame containing wells per gene data.
- Returns:
The generated figure object.
- Return type:
fig (matplotlib.figure.Figure)
spacr.sim_app module
spacr.timelapse module
spacr.utils module
- class spacr.utils.Cache(max_size)[source]
Bases:
object
A class representing a cache with a maximum size.
- max_size
The maximum size of the cache.
- Type:
int
- cache
The cache data structure.
- Type:
OrderedDict
- class spacr.utils.CustomCellClassifier(num_classes, pathogen_channel, use_attention, use_checkpoint, dropout_rate)[source]
Bases:
Module
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.utils.EarlyFusion(in_channels)[source]
Bases:
Module
Early Fusion module for image classification.
- Parameters:
in_channels (int) – Number of input channels.
- class spacr.utils.FocalLossWithLogits(alpha=1, gamma=2)[source]
Bases:
Module
- forward(logits, target)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.utils.MultiScaleBlockWithAttention(in_channels, out_channels)[source]
Bases:
Module
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.utils.ResNet(resnet_type='resnet50', dropout_rate=None, use_checkpoint=False, init_weights='imagenet')[source]
Bases:
Module
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class spacr.utils.SelfAttention(in_channels, d_k)[source]
Bases:
Module
Self-Attention module that applies scaled dot-product attention mechanism.
- Parameters:
in_channels (int) – Number of input channels.
d_k (int) – Dimensionality of the key and query vectors.
- class spacr.utils.TorchModel(model_name='resnet50', pretrained=True, dropout_rate=None, use_checkpoint=False)[source]
Bases:
Module
- spacr.utils.adjust_cell_masks(parasite_folder, cell_folder, nuclei_folder, overlap_threshold=5, perimeter_threshold=30)[source]
Process all npy files in the given folders. Merge and relabel cells in cell masks based on parasite overlap and cell perimeter sharing conditions.
- Parameters:
parasite_folder (str) – Path to the folder containing parasite masks.
cell_folder (str) – Path to the folder containing cell masks.
nuclei_folder (str) – Path to the folder containing nuclei masks.
overlap_threshold (float) – The percentage threshold for merging cells based on parasite overlap.
perimeter_threshold (float) – The percentage threshold for merging cells based on shared perimeter.
- spacr.utils.annotate_conditions(df, cells=['HeLa'], cell_loc=None, pathogens=['rh'], pathogen_loc=None, treatments=['cm'], treatment_loc=None, types=['col', 'col', 'col'])[source]
Annotates conditions in a DataFrame based on specified criteria.
- Parameters:
df (pandas.DataFrame) – The DataFrame to annotate.
cells (list, optional) – List of host cell types. Defaults to [‘HeLa’].
cell_loc (list, optional) – List of corresponding values for each host cell type. Defaults to None.
pathogens (list, optional) – List of pathogens. Defaults to [‘rh’].
pathogen_loc (list, optional) – List of corresponding values for each pathogen. Defaults to None.
treatments (list, optional) – List of treatments. Defaults to [‘cm’].
treatment_loc (list, optional) – List of corresponding values for each treatment. Defaults to None.
types (list, optional) – List of column types for host cells, pathogens, and treatments. Defaults to [‘col’,’col’,’col’].
- Returns:
The annotated DataFrame.
- Return type:
pandas.DataFrame
- spacr.utils.augment_dataset(dataset, is_grayscale=False)[source]
Perform data augmentation on the entire dataset by rotating and reflecting the images.
Parameters: - dataset (list of tuples): The input dataset, each entry is a tuple (image, label, filename). - is_grayscale (bool): Flag indicating if the images are grayscale.
Returns: - augmented_dataset (list of tuples): A dataset with augmented (image, label, filename) tuples.
- spacr.utils.augment_image(image)[source]
Perform data augmentation by rotating and reflecting the image.
Parameters: - image (PIL Image or numpy array): The input image.
Returns: - augmented_images (list): A list of augmented images.
- spacr.utils.check_multicollinearity(x)[source]
Checks multicollinearity of the predictors by computing the VIF.
- spacr.utils.check_normality(series)[source]
Helper function to check if a feature is normally distributed.
- spacr.utils.choose_model(model_type, device, init_weights=True, dropout_rate=0, use_checkpoint=False, channels=3, height=224, width=224, chan_dict=None, num_classes=2, verbose=False)[source]
Choose a model for classification.
- Parameters:
model_type (str) – The type of model to choose. Can be one of the pre-defined TorchVision models or ‘custom’ for a custom model.
device (str) – The device to use for model inference.
init_weights (bool, optional) – Whether to initialize the model with pre-trained weights. Defaults to True.
dropout_rate (float, optional) – The dropout rate to use in the model. Defaults to 0.
use_checkpoint (bool, optional) – Whether to use checkpointing during model training. Defaults to False.
channels (int, optional) – The number of input channels for the model. Defaults to 3.
height (int, optional) – The height of the input images for the model. Defaults to 224.
width (int, optional) – The width of the input images for the model. Defaults to 224.
chan_dict (dict, optional) – A dictionary containing channel information for custom models. Defaults to None.
num_classes (int, optional) – The number of output classes for the model. Defaults to 2.
- Returns:
The chosen model.
- Return type:
torch.nn.Module
- spacr.utils.class_visualization(target_y, model_path, dtype, img_size=224, channels=[0, 1, 2], l2_reg=0.001, learning_rate=25, num_iterations=100, blur_every=10, max_jitter=16, show_every=25, class_names=['nc', 'pc'])[source]
- spacr.utils.classification_metrics(all_labels, prediction_pos_probs, loss, epoch)[source]
Calculate classification metrics for binary classification.
Parameters: - all_labels (list): List of true labels. - prediction_pos_probs (list): List of predicted positive probabilities. - loader_name (str): Name of the data loader. - loss (float): Loss value. - epoch (int): Epoch number.
Returns: - data_df (DataFrame): DataFrame containing the calculated metrics.
- spacr.utils.cluster_feature_analysis(all_df, cluster_col='cluster')[source]
Perform Random Forest feature importance, ANOVA for normally distributed features, and Kruskal-Wallis for non-normally distributed features. Combine results into a single DataFrame.
- spacr.utils.combine_results(rf_df, anova_df, kruskal_df)[source]
Combine the results into a single DataFrame.
- spacr.utils.compute_irm_penalty(losses, dummy_w, device)[source]
Computes the Invariant Risk Minimization (IRM) penalty.
- Parameters:
losses (list) – A list of losses.
dummy_w (torch.Tensor) – A dummy weight tensor.
device (torch.device) – The device to perform computations on.
- Returns:
The computed IRM penalty.
- Return type:
float
- spacr.utils.compute_segmentation_ap(true_masks, pred_masks, iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]
- spacr.utils.convert_and_relabel_masks(folder_path)[source]
Converts all int64 npy masks in a folder to uint16 with relabeling to ensure all labels are retained.
Parameters: - folder_path (str): The path to the folder containing int64 npy mask files.
Returns: - None
- spacr.utils.extract_features(image_paths, resnet=<function resnet50>)[source]
Extract features from images using a pre-trained ResNet model.
- spacr.utils.filter_dataframe_features(df, channel_of_interest, exclude=None, remove_low_variance_features=True, remove_highly_correlated_features=True, verbose=False)[source]
Filter the dataframe df based on the specified channel_of_interest and exclude parameters.
Parameters: - df (pandas.DataFrame): The input dataframe to be filtered. - channel_of_interest (str, int, list, None): The channel(s) of interest to filter the dataframe. If None, no filtering is applied. If ‘morphology’, only morphology features are included.If an integer, only the specified channel is included. If a list, only the specified channels are included.If a string, only the specified channel is included. - exclude (str, list, None): The feature(s) to exclude from the filtered dataframe. If None, no features are excluded. If a string, the specified feature is excluded.If a list, the specified features are excluded.
Returns: - filtered_df (pandas.DataFrame): The filtered dataframe based on the specified parameters. - features (list): The list of selected features after filtering.
- spacr.utils.find_non_overlapping_position(x, y, image_positions, threshold, max_attempts=100)[source]
- spacr.utils.generate_dependent_variable(df, dv_loc, pc_min=0.95, nc_max=0.05, agg_type='mean')[source]
- spacr.utils.is_multiprocessing_process(process)[source]
Check if the process is a multiprocessing process.
- spacr.utils.mask_object_count(mask)[source]
Counts the number of objects in a given mask.
Parameters: - mask: numpy.ndarray. The mask containing object labels.
Returns: - int. The number of objects in the mask.
- spacr.utils.merge_regression_res_with_metadata(results_file, metadata_file, name='_metadata')[source]
- spacr.utils.merge_touching_objects(mask, threshold=0.25)[source]
Merges touching objects in a binary mask based on the percentage of their shared boundary.
- Parameters:
mask (ndarray) – Binary mask representing objects.
threshold (float, optional) – Threshold value for merging objects. Defaults to 0.25.
- Returns:
Merged mask.
- Return type:
ndarray
- spacr.utils.normalize_to_dtype(array, p1=2, p2=98, percentile_list=None, new_dtype=None)[source]
Normalize each image in the stack to its own percentiles.
Parameters: - array: numpy array The input stack to be normalized. - p1: int, optional The lower percentile value for normalization. Default is 2. - p2: int, optional The upper percentile value for normalization. Default is 98. - percentile_list: list, optional A list of pre-calculated percentiles for each image in the stack. Default is None.
Returns: - new_stack: numpy array The normalized stack with the same shape as the input stack.
- spacr.utils.perform_statistical_tests(all_df, cluster_col='cluster')[source]
Perform ANOVA or Kruskal-Wallis tests depending on normality of features.
- spacr.utils.plot_clusters(ax, embedding, labels, colors, cluster_centers, plot_outlines, plot_points, smooth_lines, figuresize=10, dot_size=50, verbose=False)[source]
- spacr.utils.plot_clusters_grid(embedding, labels, image_nr, image_paths, colors, figuresize, black_background, verbose)[source]
- spacr.utils.plot_embedding(embedding, image_paths, labels, image_nr, img_zoom, colors, plot_by_cluster, plot_outlines, plot_points, plot_images, smooth_lines, black_background, figuresize, dot_size, remove_image_canvas, verbose)[source]
- spacr.utils.plot_images_by_cluster(ax, image_paths, embedding, labels, image_nr, img_zoom, colors, cluster_indices, remove_image_canvas, verbose)[source]
- spacr.utils.plot_umap_images(ax, image_paths, embedding, labels, image_nr, img_zoom, colors, plot_by_cluster, remove_image_canvas, verbose)[source]
- spacr.utils.preprocess_data(df, filter_by, remove_highly_correlated, log_data, exclude)[source]
Preprocesses the given dataframe by applying filtering, removing highly correlated columns, applying log transformation, filling NaN values, and scaling the numeric data.
Args: df (pandas.DataFrame): The input dataframe. filter_by (str or None): The channel of interest to filter the dataframe by. remove_highly_correlated (bool or float): Whether to remove highly correlated columns. If a float is provided, it represents the correlation threshold. log_data (bool): Whether to apply log transformation to the numeric data. exclude (list or None): List of features to exclude from the filtering process. verbose (bool): Whether to print verbose output during preprocessing.
Returns: numpy.ndarray: The preprocessed numeric data.
Raises: ValueError: If no numeric columns are available after filtering.
- spacr.utils.preprocess_image(image_path, normalize=True, image_size=224, channels=[1, 2, 3])[source]
- spacr.utils.print_progress(files_processed, files_to_process, n_jobs, time_ls=None, batch_size=None, operation_type='')[source]
- spacr.utils.process_masks(mask_folder, image_folder, channel, batch_size=50, n_clusters=2, plot=False)[source]
- spacr.utils.random_forest_feature_importance(all_df, cluster_col='cluster')[source]
Random Forest feature importance.
- spacr.utils.reduction_and_clustering(numeric_data, n_neighbors, min_dist, metric, eps, min_samples, clustering, reduction_method='umap', verbose=False, embedding=None, n_jobs=-1, mode='fit', model=False)[source]
Perform dimensionality reduction and clustering on the given data.
Parameters: numeric_data (np.ndarray): Numeric data for embedding and clustering. n_neighbors (int or float): Number of neighbors for UMAP or perplexity for t-SNE. min_dist (float): Minimum distance for UMAP. metric (str): Metric for UMAP and DBSCAN. eps (float): Epsilon for DBSCAN. min_samples (int): Minimum samples for DBSCAN or number of clusters for KMeans. clustering (str): Clustering method (‘DBSCAN’ or ‘KMeans’). reduction_method (str): Dimensionality reduction method (‘UMAP’ or ‘tSNE’). verbose (bool): Whether to print verbose output. embedding (np.ndarray, optional): Precomputed embedding. Default is None. return_model (bool): Whether to return the reducer model. Default is False.
Returns: tuple: embedding, labels (and optionally the reducer model)
Removes columns from the dataframe that are highly correlated with one another.
Parameters: df (pandas.DataFrame): The DataFrame containing the data. threshold (float): The correlation threshold above which columns will be removed.
Returns: pandas.DataFrame: The DataFrame with highly correlated columns removed.
- spacr.utils.remove_intensity_objects(image, mask, intensity_threshold, mode)[source]
Removes objects from the mask based on their mean intensity in the original image.
- Parameters:
image (ndarray) – The original image.
mask (ndarray) – The mask containing labeled objects.
intensity_threshold (float) – The threshold value for mean intensity.
mode (str) – The mode for intensity comparison. Can be ‘low’ or ‘high’.
- Returns:
The updated mask with objects removed.
- Return type:
ndarray
- spacr.utils.remove_low_variance_columns(df, threshold=0.01, verbose=False)[source]
Removes columns from the dataframe that have low variance.
Parameters: df (pandas.DataFrame): The DataFrame containing the data. threshold (float): The variance threshold below which columns will be removed.
Returns: pandas.DataFrame: The DataFrame with low variance columns removed.
- spacr.utils.resize_images_and_labels(images, labels, target_height, target_width, show_example=True)[source]
- spacr.utils.search_reduction_and_clustering(numeric_data, n_neighbors, min_dist, metric, eps, min_samples, clustering, reduction_method, verbose, reduction_param=None, embedding=None, n_jobs=-1)[source]
Perform dimensionality reduction and clustering on the given data.
Parameters: numeric_data (np.array): Numeric data to process. n_neighbors (int): Number of neighbors for UMAP or perplexity for tSNE. min_dist (float): Minimum distance for UMAP. metric (str): Metric for UMAP, tSNE, and DBSCAN. eps (float): Epsilon for DBSCAN clustering. min_samples (int): Minimum samples for DBSCAN or number of clusters for KMeans. clustering (str): Clustering method (‘DBSCAN’ or ‘KMeans’). reduction_method (str): Dimensionality reduction method (‘UMAP’ or ‘tSNE’). verbose (bool): Whether to print verbose output. reduction_param (dict): Additional parameters for the reduction method. embedding (np.array): Precomputed embedding (optional). n_jobs (int): Number of parallel jobs to run.
Returns: embedding (np.array): Embedding of the data. labels (np.array): Cluster labels.
- spacr.utils.split_my_dataset(dataset, split_ratio=0.1)[source]
Splits a dataset into training and validation subsets.
- Parameters:
dataset (torch.utils.data.Dataset) – The dataset to be split.
split_ratio (float, optional) – The ratio of validation samples to total samples. Defaults to 0.1.
- Returns:
A tuple containing the training dataset and validation dataset.
- Return type:
tuple
spacr.version module
Copyright © 2024 Something