MIRA is a python package and statistical methodology for analyzing single-cell multiomics data - or sc-RNA-seq and scATAC-seq in the same cell.

This tutorial assumes:

What does MIRA do?

Part 1: Topic modeling


  1. Represents cells in joint expression and accessibility space.
  2. Performs lineage inference and visualization
  3. Identifies patterns of co-regulation in data

Getting started

Installation

To start, you can install MIRA from conda or PyPi:

conda install -c conda-forge -c bioconda -c liulab-dfci mira-multiome

Data

MIRA just needs scRNA-seq and scATAC-seq count matrices from a multiomics experiment. No pre-processing of the counts is required.

Fist, we'll import some data. Today, we'll analyze the SHARE-seq skin dataset used in our manuscript.

Part 1.1: Joint expression and accessibility representation

The first thing we need to do when analyzing a new dataset is to train a MIRA topic model.

A topic model finds "topics" of co-regulated elements in single-cell data. For instance, an expression topic may be a group of genes that are coregulated by some signaling element. Then, the model represents cells as a composition of those topics depending on which genes are active.

With a trained topic model, the first thing we can do is find the topic compositions of our cells:

The model is constrained such that the cell-topic compositions are sparse, so only a few topics may describe any given cell:

We have found expression and accessibility topic compositions for each cell. Let's combine them into a joint representation to visualize the dataset.

First, we need to transform the topic compositions into a numerical basis that can be used to find inter-cell distances. We transform the topic compositions using the get_umap_features command.

Then, we use mira.utils.make_joint_representation to combine the transformed spaces for each modality into a joint representation for each cell:

Finally, we can use scanpy to define a KNN graph of cells using the joint space representation. The resulting KNN graph identifies cells in similar states based on both modalities.

With a joint KNN graph, there's a variety of useful things we can do. For instance, clustering and UMAP visualiztion:

Part 1.2: Lineage inference using the joint KNN graph

The joint KNN graph encodes paths between cells in similar states. We can exploit this to identify lineages using both modalities. Let's focus in on a group of cells that are undergoing differentiation:

The MIRA lineage inference algorith requires just a start cell, and optionally user-provided terminal cells. You must follow these steps:

1. Make a diffusion map of the data

2. Make the forward Markov matrix

3. Choose terminal states and get lineage probabilities

4. Parse lineage structure

Important feature:

mira.pl.plot_stream

Is a workhorse for analyzing regulatory dynamics over timecourses. This plot has four modes you can use to analyze data:

  1. Swarm mode
  2. Stream mode
  3. Scatter/Line mode
  4. Heatmap model

Stream mode can show high-dimensional comparisons between many features over the lineage tree:

Swarm mode is good for plotting discrete or sparse values, like cluster identity or Expression counts:

Heatmap cannot show lineage trees, but works best for comparing a large number of features at once:

Part 1.3: Topic Analysis

We can use streams to investigate gene expression over time, but they are also useful for analyzing the flow of topic composition over the differentiation. This is where the sparsity constraint of MIRA topic model comes in handy.

We can plot topic composition projected onto the UMAP view.

Remember, we also have accessibility topics from the ATAC-seq data:

One interesting question we can ask is how does the emergence of expression topics coincide with accessibility topics? How do state changes line up between modes?

What does each topic mean? We can use enrichment analysis to find out.

For expression topics, we simply take the top n genes and perform gene set enrichment analysis to find overlaps with precompiled ontologies. We use Enrichr.

For accessibility topics, we can find the top transcription factor regulators that bind in peaks with the strongest association with that topic. First, we have to call motif or ChIP hits in our peaks using either: mira.tl.get_motif_hits_in_peaks or mira.tl.get_ChIP_hits_in_peaks.

I ran the command below ahead-of-time, so we will load the results and continue.

mira.tl.get_motif_hits_in_peaks(atac_data, genome_fasta='/Users/alynch/genomes/mm10/mm10.fa')

We can use mira.utils.subset_factors to down-select which factors we are analyzing. Often, it is convenient to limit the analysis to motifs that represent factors that you have measured expression for:

Then, we get enrichments of factors in the top peaks associated with the topics of interest:

Finally, it is most effective to analyze influential factors by juxtaposing the transition between similar topics. We can see below that the transition from topic 15 to topic 23 is fascilitated by increasing influence of terminal HF-specific factor HOXC13 using atac_model.plot_compare_module_enrichments.

Part 1 Summary: Using streams and topics, one can find key regulators and identify drivers of state transitions while comparing and contrasting modes for a deeper understanding of multiomics data.

Part 2: Regulatory Dynamics

One of the key advantages of multiomics data is the ability to study cis-regulatory relationships between genes and their local chromatin. We learn regulatory relationships between genes and chromatin using RP models. In this section, we will cover:


  1. Learn cis-regulatory relationships.
  2. Finding genes where local chromatin and expression are out of sync.
  3. Identifying transcription factor drivers of expression.

Part 2.1: Training RP models

We'll start by training models for a couple of genes. The LITE_Model stands for Locally-Influenced Transcriptional Expression model, and it integrates local chromatin around a gene to try to predict it's expression.

Before we start, we have to give MIRA gene locations so that the RP model knows how far peaks are away from the TSS. I loaded in a dataframe with the required gene information (chrom, name, start, end, strand):

And provided that to mira.tl.get_distance_to_TSS to mark gene locations and find peak-gene distances:

Next, we instantiate a LITE_Model:

To fit a model, you provide both your RNA and ATAC-seq data aligned at the barcode:

The LITE_Model learns upstream and downstream decay rates of local cis-regulatory influence for each gene:

Part 2.2: Comparing RP model predictions to expression

The local chromatin prediction for BRAF look pretty good! But for KRT23, less so. This presents an interesting case where local chromatin does not seem to explain changes in expression. To quantify this effect, we need to train another model: the NITE_Model, or Non-locally Influenced Transcriptional Expression model. This model uses information from the accessibility topic model to predict expression not just from the local chromatin around a gene, but from the cell's global chromatin state.

Comparing the LITE model prediction the NITE model prediction:

The NITE_Model incorporating genome-wide information will always predict expression better than the local-only LITE_Model. And we can compare the predictions of the model at a per-cell basis to see where local chromatin is over or under-estimating expression using mira.tl.get_chromatin_differential:

Use mira.pl.plot_chromatin_differential to show panels for genes. Notice chromatin_differential indicates that local chromatin accessibility strongly over-estimated observed gene expression for KRT23 in Cortex cells.

Another way we can investigate differences between local chromatin and expression is with streams. Using the line mode of mira.pl.plot_stream, we can quantitatively compare levels of the two modes:

We can quantify the strength of the decoupling of expression and accessibility for each gene by calculating a NITE score for that gene. First, run the get_logp function with both models:

Then, mira.tl.get_NITE_score_genes:

(this function calculates statistics across many genes to calculate a NITE score that is not subject to variablility due to differences in count distributions. Since we are only using two genes, I manually specify a statistic).

As expected, KRT23 has a much higher NITE score than BRAF.

Now, let's load in data with close to 5000 genes tested.

With many genes tested, one can also find the NITE score for entire cell states across genes. The cell-level NITE score determines how well local chromatin state in the cell predicts expression observed from that cell.

In the hair follicle, the Cortex and Medulla lineages showed the most NITE-style gene expression. Interestingly, in all three systems we tested, terminal gene expression was more NITE than early differentiation expression.

Part 2.3: Predicting driver TF expression using RP models

Finally, we can use RP models to predict drivers of gene expression based on occupancy of transcription factors in local chromatin around the gene.

We use an algorithm called Probabilistic In-Silico Deletion (pISD), which compares the ability of the RP model to predict expression before and after the binding sites of a certain TF are masked.

ISD scores for a particular gene are noisy, so it's best to test for TF drivers across gene sets with similar regulatory dynamics to find commonly-influential factors.

I've pre-computed ISD scores for many genes:

I will load genesets containing genes involved in Cortex and Medulla fate commitment, which we outlined in our paper.

Then, you can compare drivers of two sets of genes by using mira.pl.compare_driver_TFs_plot:

Summary

MIRA offers a comprehensive methodology for multiomics analysis that enhances many stages of single-cell data analysis:

Acknowledgements