Feature Basic Filtering¶
Purpose¶
Apply basic filters to remove these problematic features:
Extremly low coverage or high coverage features
ENCODE Blcaklist
Some chromosomes (usually, chrY and chrM)
Input¶
Cell metadata (after basic cell filter)
MCDS files
Output¶
FeatureList.BasicFilter.txt: List of feature ids passed all filters
Import¶
import pandas as pd
import seaborn as sns
from ALLCools import MCDS
sns.set_context(context='notebook', font_scale=1.3)
Parameters¶
# change this to the path to your filtered metadata
metadata_path = 'CellMetadata.PassQC.csv.gz'
# change this to the paths to your MCDS files
mcds_path_list = [
'../../../data/Brain/3C-171206.mcds',
'../../../data/Brain/3C-171207.mcds',
'../../../data/Brain/9H-190212.mcds',
'../../../data/Brain/9H-190219.mcds',
]
# Dimension name used to do clustering
obs_dim = 'cell' # observation
var_dim = 'chrom100k' # feature
min_cov = 500
max_cov = 3000
# change this to the path to ENCODE blacklist.
# The ENCODE blacklist can be download from https://github.com/Boyle-Lab/Blacklist/
black_list_path = '../../../data/genome/mm10-blacklist.v2.bed.gz'
f = 0.2
exclude_chromosome = ['chrM', 'chrY']
Load Data¶
Metadata¶
metadata = pd.read_csv(metadata_path, index_col=0)
total_cells = metadata.shape[0]
print(f'Metadata of {total_cells} cells')
Metadata of 4958 cells
metadata.head()
CCC_Rate | CG_Rate | CH_Rate | FinalReads | InputReads | MappingRate | Plate | Col384 | Row384 | CellTypeAnno | |
---|---|---|---|---|---|---|---|---|---|---|
cell | ||||||||||
3C_M_0 | 0.00738 | 0.75953 | 0.02543 | 1195574.0 | 2896392 | 0.625773 | CEMBA171206-3C-1 | 18 | 0 | IT-L23 |
3C_M_1 | 0.00938 | 0.77904 | 0.03741 | 1355517.0 | 3306366 | 0.631121 | CEMBA171206-3C-1 | 18 | 1 | IT-L5 |
3C_M_10 | 0.00915 | 0.82430 | 0.03678 | 2815807.0 | 7382298 | 0.657560 | CEMBA171206-3C-1 | 21 | 1 | L6b |
3C_M_100 | 0.00978 | 0.79705 | 0.04231 | 2392650.0 | 5865154 | 0.671600 | CEMBA171206-3C-1 | 0 | 3 | MGE-Pvalb |
3C_M_1000 | 0.00776 | 0.78781 | 0.02789 | 1922013.0 | 4800236 | 0.646285 | CEMBA171206-3C-4 | 3 | 8 | IT-L6 |
MCDS¶
mcds = MCDS.open(mcds_path_list, obs_dim='cell', use_obs=metadata.index)
total_feature = mcds.get_index(var_dim).size
mcds
<xarray.MCDS> Dimensions: (cell: 4958, chrom100k: 27269, count_type: 2, gene: 55487, mc_type: 2) Coordinates: * mc_type (mc_type) object 'CGN' 'CHN' * cell (cell) object '3C_M_0' '3C_M_1' ... '9H_M_3061' * gene (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG000... * count_type (count_type) object 'mc' 'cov' strand_type <U4 'both' gene_chrom (gene) object dask.array<chunksize=(55487,), meta=np.ndarray> gene_start (gene) int64 dask.array<chunksize=(55487,), meta=np.ndarray> gene_end (gene) int64 dask.array<chunksize=(55487,), meta=np.ndarray> * chrom100k (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268 chrom100k_chrom (chrom100k) object dask.array<chunksize=(27269,), meta=np.ndarray> chrom100k_bin_start (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray> chrom100k_bin_end (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray> Data variables: gene_da (cell, gene, mc_type, count_type) uint16 dask.array<chunksize=(1199, 55487, 2, 2), meta=np.ndarray> chrom100k_da (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(1199, 27269, 2, 2), meta=np.ndarray>
xarray.MCDS
- cell: 4958
- chrom100k: 27269
- count_type: 2
- gene: 55487
- mc_type: 2
- mc_type(mc_type)object'CGN' 'CHN'
array(['CGN', 'CHN'], dtype=object)
- cell(cell)object'3C_M_0' '3C_M_1' ... '9H_M_3061'
array(['3C_M_0', '3C_M_1', '3C_M_10', ..., '9H_M_3057', '9H_M_3059', '9H_M_3061'], dtype=object)
- gene(gene)object'ENSMUSG00000102693.1' ... 'ENSM...
array(['ENSMUSG00000102693.1', 'ENSMUSG00000064842.1', 'ENSMUSG00000051951.5', ..., 'ENSMUSG00000064370.1', 'ENSMUSG00000064371.1', 'ENSMUSG00000064372.1'], dtype=object)
- count_type(count_type)object'mc' 'cov'
array(['mc', 'cov'], dtype=object)
- strand_type()<U4'both'
array('both', dtype='<U4')
- gene_chrom(gene)objectdask.array<chunksize=(55487,), meta=np.ndarray>
Array Chunk Bytes 443.90 kB 443.90 kB Shape (55487,) (55487,) Count 15 Tasks 1 Chunks Type object numpy.ndarray - gene_start(gene)int64dask.array<chunksize=(55487,), meta=np.ndarray>
Array Chunk Bytes 443.90 kB 443.90 kB Shape (55487,) (55487,) Count 13 Tasks 1 Chunks Type int64 numpy.ndarray - gene_end(gene)int64dask.array<chunksize=(55487,), meta=np.ndarray>
Array Chunk Bytes 443.90 kB 443.90 kB Shape (55487,) (55487,) Count 13 Tasks 1 Chunks Type int64 numpy.ndarray - chrom100k(chrom100k)int640 1 2 3 ... 27265 27266 27267 27268
array([ 0, 1, 2, ..., 27266, 27267, 27268])
- chrom100k_chrom(chrom100k)objectdask.array<chunksize=(27269,), meta=np.ndarray>
Array Chunk Bytes 218.15 kB 218.15 kB Shape (27269,) (27269,) Count 15 Tasks 1 Chunks Type object numpy.ndarray - chrom100k_bin_start(chrom100k)int64dask.array<chunksize=(27269,), meta=np.ndarray>
Array Chunk Bytes 218.15 kB 218.15 kB Shape (27269,) (27269,) Count 13 Tasks 1 Chunks Type int64 numpy.ndarray - chrom100k_bin_end(chrom100k)int64dask.array<chunksize=(27269,), meta=np.ndarray>
Array Chunk Bytes 218.15 kB 218.15 kB Shape (27269,) (27269,) Count 13 Tasks 1 Chunks Type int64 numpy.ndarray
- gene_da(cell, gene, mc_type, count_type)uint16dask.array<chunksize=(1199, 55487, 2, 2), meta=np.ndarray>
Array Chunk Bytes 2.20 GB 583.28 MB Shape (4958, 55487, 2, 2) (1314, 55487, 2, 2) Count 22 Tasks 4 Chunks Type uint16 numpy.ndarray - chrom100k_da(cell, chrom100k, mc_type, count_type)uint16dask.array<chunksize=(1199, 27269, 2, 2), meta=np.ndarray>
Array Chunk Bytes 1.08 GB 286.65 MB Shape (4958, 27269, 2, 2) (1314, 27269, 2, 2) Count 22 Tasks 4 Chunks Type uint16 numpy.ndarray
Filter Features¶
Filter by mean coverage¶
mcds.add_feature_cov_mean(var_dim=var_dim)
Feature chrom100k mean cov across cells added in MCDS.coords['chrom100k_cov_mean'].

mcds = mcds.filter_feature_by_cov_mean(
var_dim=var_dim,
min_cov=min_cov, # minimum coverage
max_cov=max_cov # Maximum coverage
)
Before cov mean filter: 27269 chrom100k
After cov mean filter: 25235 chrom100k 92.5%
Filter by ENCODE Blacklist¶
mcds = mcds.remove_black_list_region(
var_dim,
black_list_path,
f=f # Features having overlap > f with any black list region will be removed.
)
1187 chrom100k features removed due to overlapping (bedtools intersect -f 0.2) with black list regions.
Remove chromosomes¶
mcds = mcds.remove_chromosome(var_dim, exclude_chromosome)
18 chrom100k features in ['chrM', 'chrY'] removed.
Save Feature List¶
print(
f'{mcds.get_index(var_dim).size} ({mcds.get_index(var_dim).size * 100 / total_feature:.1f}%) '
f'{var_dim} remained after all the basic filter.')
24042 (88.2%) chrom100k remained after all the basic filter.
with open('FeatureList.BasicFilter.txt', 'w') as f:
for var in mcds.get_index(var_dim).astype(str):
f.write(var + '\n')