Filtering¶

We now move on to filtering out BCR contigs (and corresponding cells if necessary) from the BCR data and transcriptome object loaded in scanpy.
Import dandelion module
[1]:
import os
os.chdir(os.path.expanduser('/Users/kt16/Documents/Github/dandelion'))
import dandelion as ddl
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
Import modules for use with scanpy
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import warnings
import functools
import seaborn as sns
import scipy.stats
import anndata
warnings.filterwarnings('ignore')
sc.logging.print_header()
scanpy==1.6.0 anndata==0.7.4 umap==0.4.6 numpy==1.19.4 scipy==1.5.3 pandas==1.1.4 scikit-learn==0.23.2 statsmodels==0.12.1 python-igraph==0.8.3 leidenalg==0.8.3
Import the transcriptome data
[3]:
samples = ['sc5p_v2_hs_PBMC_1k', 'sc5p_v2_hs_PBMC_10k', 'vdj_v1_hs_pbmc3', 'vdj_nextgem_hs_pbmc3']
adata_list = []
for sample in samples:
adata = sc.read_10x_h5(sample +'/' + sample + '_filtered_feature_bc_matrix.h5', gex_only=True)
adata.obs['sampleid'] = sample
# rename cells to sample id + barcode
adata.obs_names = [str(sample)+'_'+str(j) for j in adata.obs_names]
adata.var_names_make_unique()
adata_list.append(adata)
adata = adata_list[0].concatenate(adata_list[1:])
# rename the obs_names again, this time cleaving the trailing -#
adata.obs_names = [str(j).split('-')[0] for j in adata.obs_names]
adata
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[3]:
AnnData object with n_obs × n_vars = 30471 × 31915
obs: 'sampleid', 'batch'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
I’m using a wrapper called pp.recipe_scanpy_qc
to run through a generic scanpy workflow. You can skip this if you already have a pre-processed AnnData
object for the subsequent steps.
[4]:
ddl.pp.recipe_scanpy_qc(adata)
adata
[4]:
AnnData object with n_obs × n_vars = 30471 × 31915
obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
To proceed, the .obs
need to contain a filter_rna
column.
If you have a pre-processed/filltered AnnData
object that is ready, just add a 'filter_rna'
column into the .obs
slot with every value set to False
and you should be good to go:
adata.obs['filter_rna'] = False
[5]:
# adata.obs['filter_rna'] = False
Filter cells that are potental doublets and poor quality in both the BCR data and transcriptome data
We use the function pp.filter_bcr
to mark and filter out cells and contigs from both the BCR data and transcriptome data in AnnData
. The operation will remove bad quality cells based on transcriptome information as well as remove BCR doublets (multiplet heavy chain, and/or light chains) from the BCR data. In some situations, a single cell can have multiple heavy/light chain contigs although they have an identical V(D)J+C alignment; in situations like this, the contigs with lesser umis
will be dropped and the umis transferred to duplicate_count column. The same procedure is applied to both heavy chain and light chains before identifying doublets.
Cells in the gene expression object without BCR information will not be affected i.e. the AnnData
object can hold non-B cells. Run ?ddl.pp.filter_bcr
to check what each option does.
[6]:
# first we read in the 4 bcr files
bcr_files = []
for sample in samples:
file_location = sample +'/dandelion/data/'+sample+'_b_filtered_contig_igblast_db-pass_genotyped.tsv'
bcr_files.append(pd.read_csv(file_location, sep = '\t'))
bcr = bcr_files[0].append(bcr_files[1:])
bcr.reset_index(inplace = True, drop = True)
bcr
[6]:
sequence_id | sequence | rev_comp | productive | v_call | d_call | j_call | sequence_alignment | germline_alignment | junction | ... | fwr3_aa | fwr4_aa | cdr1_aa | cdr2_aa | cdr3_aa | sequence_alignment_aa | v_sequence_alignment_aa | d_sequence_alignment_aa | j_sequence_alignment_aa | mu_freq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_2 | ATACTTTCTGAGAGTCCTGGACCTCCTGTGCAAGAACATGAAACAT... | F | T | IGHV4-61*02 | IGHD3-3*01 | IGHJ6*02 | CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT... | CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT... | TGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTG... | ... | NYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYC | WGQGTTVTVSS | GGSISSGSYY | IYTSGST | ARENYDFWSGYYHGADV | QVQLQESGPGLVKPSQTLSLTCTVSGGSISSGSYYWSWIRQPAGKG... | QVQLQESGPGLVKPSQTLSLTCTVSGGSISSGSYYWSWIRQPAGKG... | YDFWSGY | YHGADVWGQGTTVTVSS | 0.008571 |
1 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_3 | GGCTGGGGTCTCAGGAGGCAGCGCTCTGGGGACGTCTCCACCATGG... | F | F | IGLV2-5*01 | NaN | IGLJ3*02 | CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG... | CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG... | TGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTC | ... | TQPSGVPDRFSGSKSGNTASMTISGLQAEDEADY*C | RRRDQADRP | SSDVGSYDY | NVN | CSYTSSATFLG | QSALIQPPSVSGSPGQSVTISCTGTSSDVGSYDYVSWYQQHPGTVP... | QSALIQPPSVSGSPGQSVTISCTGTSSDVGSYDYVSWYQQHPGTVP... | NaN | LGVRRRDQADRP | 0.000000 |
2 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_1 | ACTGCGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC... | F | T | IGLV5-45*03 | NaN | IGLJ3*02 | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | TGTATGATTTGGCACAGCAGCGCTTGGGTGTTC | ... | QQGSGVPSRFSGSKDASANAGILLISGLQSEDEADYYC | FGGGTKLTVL | SGINVGTYR | YKSDSDK | MIWHSSAWV | QAVLTQPSSLSASPGASASLTCTLRSGINVGTYRIYWYQQKPGSPP... | QAVLTQPSSLSASPGASASLTCTLRSGINVGTYRIYWYQQKPGSPP... | NaN | VFGGGTKLTVL | 0.000000 |
3 | sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_3 | AGCAGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTC... | F | F | IGLV4-69*01 | NaN | IGLJ3*02 | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | TGTCAGACCTGGGGCACTGGCATTCTTGGGTGTTC | ... | SKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYC | SAEGPS*PS | SGHSSYA | LNSDGSH | QTWGTGILG | QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY... | QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY... | NaN | GCSAEGPS*PS* | 0.000000 |
4 | sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_1 | AGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTCTTC... | F | T | IGLV4-69*01 | NaN | IGLJ1*01 | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | TGTCAGACCTGGGGCACTGGCATTTATGTCTTC | ... | SKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYC | FGTGTKVTVL | SGHSSYA | LNSDGSH | QTWGTGIYV | QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY... | QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY... | NaN | YVFGTGTKVTVL | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7507 | vdj_nextgem_hs_pbmc3_TTTGCGCTCTGTCAAG_contig_2 | ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCCGGGAGCC... | F | T | IGHV1-69*01,IGHV1-69D*01 | IGHD3-22*01 | IGHJ4*02 | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAAGTGAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGAGGGGGAAGTATTACTATGATAAAAGTGGGTCTCCACCTC... | ... | NYAQKFQGRVSITADESTTTAYMELSSLRSEDSAVYYC | WGQGTLVTVSS | GGIFSSYA | IIPIFGAT | ARGKYYYDKSGSPPPIYSFDY | QVQLVQSGAEVKKPGSSVKVSCKVSGGIFSSYAISWVRQAPGQGLE... | QVQLVQSGAEVKKPGSSVKVSCKVSGGIFSSYAISWVRQAPGQGLE... | YYYDKSG | FDYWGQGTLVTVSS | 0.047619 |
7508 | vdj_nextgem_hs_pbmc3_TTTGGTTGTAAGGATT_contig_1 | AGAGCTCTGGAGAAGAGCTGCTCAGTTAGGACCCAGAGGGAACCAT... | F | T | IGKV3-20*01 | NaN | IGKJ2*01,IGKJ2*02 | GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... | GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... | TGTCAGCAGTATGATGAGTCACCTCTGACTTTT | ... | SRATGIPDRFSGSGSGTDFTLTISRLVPEDFAVYYC | FGQGTKLEIK | QSLTNSQ | GAS | QQYDESPLT | EIVLTQSPGTLSLSPGERATLSCRASQSLTNSQLAWYQQKPGQAPR... | EIVLTQSPGTLSLSPGERATLSCRASQSLTNSQLAWYQQKPGQAPR... | NaN | TFGQGTKLEIK | 0.034161 |
7509 | vdj_nextgem_hs_pbmc3_TTTGGTTGTAAGGATT_contig_2 | AGCTCTGGGAGAGGAGCCCCAGCCCTGAGATTCCCAGGTGTTTCCA... | F | T | IGHV3-9*01 | IGHD5-18*01,IGHD5-5*01 | IGHJ6*03 | GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | TGTGCAAAAGACGGATACAGCTATCGTTCGTCATACTACTTTTACA... | ... | GYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTALYYC | WGKGTTVTVSS | GFSFDDYV | ISWNSGRT | AKDGYSYRSSYYFYMDV | EVQLVESGGGLVQPGRSLRLSCAASGFSFDDYVMHWVRQAPGKGLE... | EVQLVESGGGLVQPGRSLRLSCAASGFSFDDYVMHWVRQAPGKGLE... | GYSYR | YYFYMDVWGKGTTVTVSS | 0.028571 |
7510 | vdj_nextgem_hs_pbmc3_TTTGTCACAGTAGAGC_contig_1 | AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCAT... | F | T | IGHV3-23*01,IGHV3-23D*01 | IGHD4-17*01 | IGHJ4*02 | GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | TGTGCGAAAGATTTTAGGTCGCCATACGGTGACTACTACTTTGACT... | ... | YYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYC | WGQGTLVTVSS | GFTFSSYA | ISGSGGST | AKDFRSPYGDYYFDY | EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... | EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... | YGD | YFDYWGQGTLVTVSS | 0.000000 |
7511 | vdj_nextgem_hs_pbmc3_TTTGTCACAGTAGAGC_contig_2 | GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGAT... | F | T | IGLV3-25*03 | NaN | IGLJ1*01 | TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG... | TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG... | TGTCAATCAGCAGACAGCAGTGGTACTTATCTTTATGTCTTC | ... | ERPSGIPERFSGSSSGTTVTLTISGVQAEDEADYYC | FGTGTKVTVL | ALPKQY | KDS | QSADSSGTYLYV | SYELTQPPSVSVSPGQTARITCSGDALPKQYAYWYQQKPGQAPVLV... | SYELTQPPSVSVSPGQTARITCSGDALPKQYAYWYQQKPGQAPVLV... | NaN | YVFGTGTKVTVL | 0.000000 |
7512 rows × 81 columns
[7]:
# The function will return both objects.
vdj, adata = ddl.pp.filter_bcr(bcr, adata)
Scanning for poor quality/ambiguous contigs with 3 cpus
Annotating in anndata obs slot : 100%|██████████| 30471/30471 [00:00<00:00, 79083.21it/s]
Finishing up filtering
Initializing Dandelion object
The default mode is to filter any remaining ‘doublet’ light chains, but some may be interested in keeping them. The option to change the behaviour is by toggling:
filter_lightchains=False
Another default behavour is that if the cell in the BCR table cannot be found in the transcriptomic data, it will also be removed from the BCR data. This can be changed by toggling:
filter_missing=False
Also, when contigs are marked as poor quality, the default behaviour is to remove the contigs associated with the barcode, and not the barcode from the transcriptome data. This can be toggled to remove the entire cell if the intention is to retain a conservative dataset for both BCR and transcriptome data:
filter_poorqualitybcr=True
And lastly, the default behaviour is to rescue the heavy chain contig with the highest umi if there are multiple contigs for a single cell. The function requires a minimum fold-difference of 2 between the highest and lowest umi in order to rescue the contig. However, if the contigs have similar number of umis, or if the sum of the umis are very low, then the entire cell will be filtered. The fold-difference cut-off can be specified via the option umi_foldchange_cutoff
. This can be toggled to
be ignored i.e. drop all multiple IGH contigs:
rescue_igh = False
Check the output V(D)J table
The vdj table is returned as a Dandelion
class object in the .data
slot (described in further detail in the next notebook); if a file was provided for filter_bcr
above, a new file will be created in the same folder with the filtered
prefix. Note that this vdj
table is indexed based on contigs (sequence_id).
[8]:
vdj
[8]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count'
metadata: 'sample_id', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light'
distance: None
edges: None
layout: None
graph: None
Check the AnnData object as well
And the AnnData
object is indexed based on cells.
[9]:
adata
[9]:
View of AnnData object with n_obs × n_vars = 16492 × 31915
obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
The .obs
slot in the AnnData
object now contains a few new columns related to BCR:
has_bcr :
True/False
statement marking cells with a matching BCR (pre-contig filtering).filter_bcr_quality :
True/False
recommendation for filtering cells identified as having poor quality contigs iffilter_poorqualitybcr=True
(pre-contig filtering).filter_bcr_heavy :
True/False
recommendation for filtering cells identified as heavy chain ‘doublets’ (after rescue ifrescue_igh=True
) (pre-contig filtering).filter_bcr_light :
True/False
recommendation for filtering cells identifed as having multiple light chain contigs (pre-contig filtering).
Most importantly:
bcr_QC_pass :
True/False
statement marking cells where BCR contigs were removed from #1 due to contigs failing QC. (post-contig filtering)filter_bcr :
True/False
recommendation for filter for cells flagged in 2-4 (post-contig filtering).
So this means that to go forward, you want to only select cells that have BCR that passed QC (has_bcr == True
and bcr_QC_pass == True
) with filtering recommendation to be false (filter_bcr == False
).
The number of cells that actually has a matching BCR can be tabluated.
[10]:
pd.crosstab(adata.obs['has_bcr'], adata.obs['filter_bcr'])
[10]:
filter_bcr | False |
---|---|
has_bcr | |
False | 15603 |
True | 889 |
[11]:
pd.crosstab(adata.obs['has_bcr'], adata.obs['bcr_QC_pass'])
[11]:
bcr_QC_pass | False | True |
---|---|---|
has_bcr | ||
False | 15603 | 0 |
True | 51 | 838 |
[12]:
pd.crosstab(adata.obs['bcr_QC_pass'], adata.obs['filter_bcr'])
[12]:
filter_bcr | False |
---|---|
bcr_QC_pass | |
False | 15654 |
True | 838 |
Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data
Because the ‘filtered’ AnnData
object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.
[13]:
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata
Trying to set attribute `.var` of view, copying.
Identify highly-variable genes
[14]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

Filter the genes to only those marked as highly-variable
[15]:
adata = adata[:, adata.var.highly_variable]
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.
[16]:
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
Trying to set attribute `.obs` of view, copying.
... storing 'sampleid' as categorical
Trying to set attribute `.var` of view, copying.
... storing 'feature_types' as categorical
Trying to set attribute `.var` of view, copying.
... storing 'genome' as categorical
Run PCA
[17]:
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)

Computing the neighborhood graph, umap and clusters
[18]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)
Visualizing the clusters and whether or not there’s a corresponding BCR
[19]:
sc.pl.umap(adata, color=['leiden', 'bcr_QC_pass'])

Visualizing some B cell genes
[20]:
sc.pl.umap(adata, color=['IGHM', 'JCHAIN'])

Save AnnData
We can save this AnnData
object for now.
[21]:
adata.write('adata.h5ad', compression = 'gzip')
... storing 'feature_types' as categorical
... storing 'genome' as categorical
Save dandelion
To save the vdj object, we have two options - either save the .data
and .metadata
slots with pandas’ functions:
[22]:
vdj.data.to_csv('filtered_vdj_table.tsv', sep = '\t')
Or save the whole Dandelion class object with either .write_h5
, which saves the class to a HDF5 format, or using a pickle-based .write_pkl
function.
[23]:
vdj.write_h5('dandelion_results.h5', complib = 'bzip2')
[24]:
vdj.write_pkl('dandelion_results.pkl.pbz2') # this will automatically use bzip2 for compression, swith the extension to .gz for gzip