Basic Cell Clustering Using mCG-5Kb Bins

Content

Here we go through the basic steps to perform cell clustering using genome non-overlapping 5Kb bins as features. We start from hypo-methylation probability data stored in mC-AnnData format (.mcad to distinguish from normal .h5ad files). It can be used to quickly evaluate get an idea on cell-type composition in a single-cell methylome dataset (e.g., the dataset from a single experiment). Comparing to 100Kb bins clustering process, this clustering process is more suitable for samples with low mCH fraction (many non-brain tissues) and narrow methylation diversity (so smaller feature works better).

Dataset used in this notebook

  • Adult (age P56) male mouse brain puititary (PIT) snmC-seq2 data from Frederique et al. 2021 (REF)

Input

  • MCAD file

  • Cell metadata

Output

  • Cell-by-5kb-bin AnnData (sparse matrix) with embedding coordinates and cluster labels.

Import

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import anndata
import scanpy as sc

from ALLCools.clustering import tsne, significant_pc_test, filter_regions, remove_black_list_region, tf_idf, binarize_matrix
from ALLCools.plot import *

Parameters

metadata_path = '../../data/PIT/PIT.CellMetadata.csv.gz'
mcad_path = '../../data/PIT/PIT.mcad'

# Basic filtering parameters
mapping_rate_cutoff = 0.5
mapping_rate_col_name = 'MappingRate'  # Name may change
final_reads_cutoff = 500000
final_reads_col_name = 'FinalmCReads'  # Name may change
mccc_cutoff = 0.03
mccc_col_name = 'mCCCFrac'  # Name may change
mch_cutoff = 0.2
mch_col_name = 'mCHFrac'  # Name may change
mcg_cutoff = 0.5
mcg_col_name = 'mCGFrac'  # Name may change

# PC cutoff
pc_cutoff = 0.1

# KNN
knn = -1  # -1 means auto determine

# Leiden
resolution = 1

Load Cell Metadata

metadata = pd.read_csv(metadata_path, index_col=0)
print(f'Metadata of {metadata.shape[0]} cells')
metadata.head()
Metadata of 2756 cells
CellInputReadPairs MappingRate FinalmCReads mCCCFrac mCGFrac mCHFrac Plate Col384 Row384 CellTypeAnno
index
PIT_P1-PIT_P2-A1-AD001 1858622.0 0.685139 1612023.0 0.003644 0.679811 0.005782 PIT_P1 0 0 Outlier
PIT_P1-PIT_P2-A1-AD004 1599190.0 0.686342 1367004.0 0.004046 0.746012 0.008154 PIT_P1 1 0 Gonadotropes
PIT_P1-PIT_P2-A1-AD006 1932242.0 0.669654 1580990.0 0.003958 0.683584 0.005689 PIT_P1 1 1 Somatotropes
PIT_P1-PIT_P2-A1-AD007 1588505.0 0.664612 1292770.0 0.003622 0.735217 0.005460 PIT_P2 0 0 Rbpms+
PIT_P1-PIT_P2-A1-AD010 1738409.0 0.703835 1539676.0 0.003769 0.744640 0.006679 PIT_P2 1 0 Rbpms+

Filter Cells

judge = (metadata[mapping_rate_col_name] > mapping_rate_cutoff) & \
        (metadata[final_reads_col_name] > final_reads_cutoff) & \
        (metadata[mccc_col_name] < mccc_cutoff) & \
        (metadata[mch_col_name] < mch_cutoff) & \
        (metadata[mcg_col_name] > mcg_cutoff)

metadata = metadata[judge].copy()
# cell metadata for this example is filtered already
print(f'{metadata.shape[0]} cells passed filtering')
2756 cells passed filtering

Load MCAD

mcad = anndata.read_h5ad(mcad_path)
# add cell metadata to mcad:
mcad.obs = pd.concat([mcad.obs, metadata.reindex(mcad.obs_names)], axis=1)
mcad
AnnData object with n_obs × n_vars = 2756 × 545118
    obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
    var: 'chrom', 'start', 'end'

Binarize

binarize_matrix(mcad, cutoff=0.95)

Filter Features

filter_regions(mcad, hypo_cutoff=6)
remove_black_list_region(mcad, black_list_path='/home/hanliu/ref/blacklist/mm10-blacklist.v2.bed.gz')
11202 features removed due to overlapping (bedtools intersect -f 0.2) with black list regions.
mcad
AnnData object with n_obs × n_vars = 2756 × 388143
    obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
    var: 'chrom', 'start', 'end'

TF-IDF Transform and Dimension Reduction

# by default we save the results in adata.obsm['X_pca'] which is the scanpy defaults in many following functions
# But this matrix is not calculated by PCA
tf_idf(mcad, algorithm='arpack', obsm='X_pca')
# choose significant components
significant_pc_test(mcad, p_cutoff=pc_cutoff, update=True)
17 components passed P cutoff of 0.1.
Changing adata.obsm['X_pca'] from shape (2756, 50) to (2756, 17)
17

Clustering

Calculate Nearest Neighbors

if knn == -1:
    knn = max(15, int(np.log2(mcad.shape[0])*2))
sc.pp.neighbors(mcad, n_neighbors=knn)

Leiden Clustering

sc.tl.leiden(mcad, resolution=resolution)

Manifold learning

def dump_embedding(adata, name, n_dim=2):
    # put manifold coordinates into adata.obs
    for i in range(n_dim):
        adata.obs[f'{name}_{i}'] = adata.obsm[f'X_{name}'][:, i]
    return adata

tSNE

tsne(mcad,
     obsm='X_pca',
     metric='euclidean',
     exaggeration=-1,  # auto determined
     perplexity=30,
     n_jobs=-1)
mcad = dump_embedding(mcad, 'tsne')
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad.obs,
                        ax=ax,
                        coord_base='tsne',
                        hue='leiden',
                        text_anno='leiden',
                        show_legend=True)
../../_images/mcg_5kb_basic_28_0.png

UMAP

sc.tl.umap(mcad)
mcad = dump_embedding(mcad, 'umap')
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad.obs,
                        ax=ax,
                        coord_base='umap',
                        hue='leiden',
                        text_anno='leiden',
                        show_legend=True)
../../_images/mcg_5kb_basic_31_0.png

Interactive plot

interactive_scatter(data=mcad.obs, hue='leiden', coord_base='umap')

Save results

mcad.write_h5ad('PIT.mCG-5K-clustering.h5ad')
mcad
... storing 'Plate' as categorical
... storing 'CellTypeAnno' as categorical
AnnData object with n_obs × n_vars = 2756 × 388143
    obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno', 'leiden', 'tsne_0', 'tsne_1', 'umap_0', 'umap_1'
    var: 'chrom', 'start', 'end'
    uns: 'neighbors', 'leiden', 'umap'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    obsp: 'distances', 'connectivities'
mcad.obs.to_csv('PIT.ClusteringResults.csv.gz')
mcad.obs.head()
CellInputReadPairs MappingRate FinalmCReads mCCCFrac mCGFrac mCHFrac Plate Col384 Row384 CellTypeAnno leiden tsne_0 tsne_1 umap_0 umap_1
cell
PIT_P1-PIT_P2-A1-AD001 1858622.0 0.685139 1612023.0 0.003644 0.679811 0.005782 PIT_P1 0 0 Outlier 1 11.064474 0.924930 -4.100986 -6.430517
PIT_P1-PIT_P2-A1-AD004 1599190.0 0.686342 1367004.0 0.004046 0.746012 0.008154 PIT_P1 1 0 Gonadotropes 8 -46.723765 9.968594 2.603770 3.784141
PIT_P1-PIT_P2-A1-AD006 1932242.0 0.669654 1580990.0 0.003958 0.683584 0.005689 PIT_P1 1 1 Somatotropes 2 5.542319 11.836112 -1.326179 -6.594718
PIT_P1-PIT_P2-A1-AD007 1588505.0 0.664612 1292770.0 0.003622 0.735217 0.005460 PIT_P2 0 0 Rbpms+ 4 -1.566156 -44.860068 4.273837 9.352254
PIT_P1-PIT_P2-A1-AD010 1738409.0 0.703835 1539676.0 0.003769 0.744640 0.006679 PIT_P2 1 0 Rbpms+ 6 23.956624 -32.991981 6.260434 3.239084

Sanity test

This test dataset come from Frederique et al. 2021 (REF), so we already annotated the cell types. For new datasets, see following notebooks about identifying cluster markers and annotate clusters

if 'CellTypeAnno' in mcad.obs:
    mcad.obs['CellTypeAnno'] = mcad.obs['CellTypeAnno'].fillna('Outlier')
    
    fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
    _ = categorical_scatter(data=mcad.obs,
                            ax=ax,
                            coord_base='umap',
                            hue='CellTypeAnno',
                            text_anno='CellTypeAnno',
                            palette='tab20',
                            show_legend=True)
../../_images/mcg_5kb_basic_38_0.png