BCR clustering

dandelion_logo

On the topic of finding clones/clonotypes, there are many ways used for clustering BCRs, almost all involving some measure based on sequence similarity. There are also a lot of very well established guidelines and criterias maintained by the BCR community. For example, immcantation uses a number of model-based methods to group clones based on the distribution of length-normalised junctional hamming distance while others use the whole BCR V(D)J sequence to define clones as shown in this recent paper.

Import modules

[1]:
import os
import pandas as pd
import dandelion as ddl
ddl.logging.print_header()
dandelion==0.1.0 pandas==1.1.4 numpy==1.19.4 matplotlib==3.3.3 networkx==2.5 scipy==1.5.3 skbio==0.5.6
[2]:
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
# I'm importing scanpy here to make use of its logging module.
import scanpy as sc
sc.settings.verbosity = 3
import warnings
warnings.filterwarnings('ignore')
sc.logging.print_header()
scanpy==1.6.0 anndata==0.7.4 umap==0.4.6 numpy==1.19.4 scipy==1.5.3 pandas==1.1.4 scikit-learn==0.23.2 statsmodels==0.12.1 python-igraph==0.8.3 leidenalg==0.8.3

Read in the previously saved files

I will work with the same example from the previous notebook since I have the filtered V(D)J data stored in a Dandelion class.

[3]:
vdj = ddl.read_h5('dandelion_results.h5')
vdj
[3]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count'
    metadata: 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status'
    distance: None
    edges: None
    layout: None
    graph: None

Finding clones

The following is dandelion’s implementation of a rather conventional method to define clones, tl.find_clones.

Clone definition is based on the following criterias:

(1) Identical IGH V-J gene usage.

(2) Identical CDR3 junctional sequence length.

(3) CDR3 Junctional sequences attains a minimum of % sequence similarity, based on hamming distance. The similarity cut-off is tunable (default is 85%).

(4) Light chain usage. If cells within clones use different light chains, the clone will be splitted following the same conditions for heavy chains in (1-3) as above.

The ‘clone_id’ name follows a {A}_{B}_{C}_{D} format and largely reflects the conditions above where:

{A} indicates if the contigs use the same IGH V/J genes.

{B} indicates if IGH junctional sequences are equal in length.

{C} indicates if clones are splitted based on junctional hamming distance threshold

{D} indicates light chain pairing.

The last position will not be annotated if there’s only one group of light chains usage detected in the clone.

Running tl.find_clones

The function will take a file path, a pandas DataFrame (for example if you’ve used pandas to read in the filtered file already), or a Dandelion class object. The default mode for calculation of junctional hamming distance is to use the CDR3 junction amino acid sequences, specified via the key option (None defaults to junction_aa). You can switch it to using CDR3 junction nucleotide sequences (key = 'junction', or even the full V(D)J amino acid sequence (key = 'sequence_alignment_aa), as long as the column name exists in the .data slot.

If you want to use the alleles for defining V-J gene usuage, specify:

by_alleles = True
[4]:
ddl.tl.find_clones(vdj)
vdj
Finding clonotypes
Finding clones based on heavy chains : 100%|██████████| 176/176 [00:00<00:00, 2821.65it/s]
Refining clone assignment based on light chain pairing : 100%|██████████| 819/819 [00:00<00:00, 1216.02it/s]
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:01)
[4]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status'
    distance: None
    edges: None
    layout: None
    graph: None

This will return a new column with the column name 'clone_id' as per convention. If a file path is provided as input, it will also save the file automatically into the base directory of the file name. Otherwise, a Dandelion object will be returned.

[5]:
vdj.metadata
[5]:
clone_id clone_id_by_size sample_id locus_heavy locus_light productive_heavy productive_light v_call_genotyped_heavy v_call_genotyped_light j_call_heavy ... umi_count_light_0 umi_count_light_1 umi_count_light_2 junction_aa_heavy junction_aa_light status productive isotype vdj_status_detail vdj_status
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC 140_3_1 362 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV1-69 IGKV1-8 IGHJ3 ... 43.0 NaN NaN CATTYYYDSSGYYQNDAFDIW CQQYYSYPRTF IGH + IGK T + T IgM Single + Single Single
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG 88_4_1 278 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV1-2 IGLV5-45 IGHJ3 ... 90.0 NaN NaN CAREIEGDGVFEIW CMIWHSSAWVV IGH + IGL T + T IgM Single + Single Single
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC 98_2_2 726 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV5-51 IGKV1D-8 IGHJ3 ... 22.0 NaN NaN CARHIRGNRFGNDAFDIW CQQYYSFPYTF IGH + IGK T + T IgM Single + Single Single
sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT 64_8_1 645 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-15 IGLV6-57 IGHJ4 ... 40.0 NaN NaN CTTDDEKRPYSGSYLPFDYW CQSYDSSNVVF IGH + IGL T + T IgM Single + Multi_light_j Single
sc5p_v2_hs_PBMC_10k_AACCATGCAAGCTGTT 9_4_1 677 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-33 IGLV2-14 IGHJ6 ... 36.0 NaN NaN CARDWVRGVNDMDVW CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_v1_hs_pbmc3_TCTTCGGTCCTAAGTG 63_8_2 181 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-59 IGKV1-12 IGHJ4 ... 33.0 NaN NaN CARVNVGGIAVAGYFDYW CQQANSFPLTF IGH + IGK T + T IgM Single + Single Single
vdj_v1_hs_pbmc3_TGCACCTCAGACAAAT 128_8_1 723 vdj_v1_hs_pbmc3 IGH IGK T T IGHV3-21 IGKV3-20 IGHJ6 ... 18.0 NaN NaN CARVRQEYYDFWSGYPAEVYYYMDVW CQQYGSSPLFTF IGH + IGK T + T IgM Single + Single Single
vdj_v1_hs_pbmc3_TGTATTCTCTGTTGAG 146_7_2 799 vdj_v1_hs_pbmc3 IGH IGL T T IGHV3-48 IGLV2-14 IGHJ4 ... 80.0 NaN NaN CAREKYDFWSGDSYYFDYW CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single
vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT 145_4_2 276 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-34 IGKV1-39|IGKV1D-39 IGHJ3 ... 5.0 NaN NaN CARRRLTYYYDSSGPLSAFDIW CQQSYSTPRTF IGH + IGK T + T IgM Single + Multi_light_v Single
vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC 124_4_1_1|124_4_1_2 635|368 vdj_v1_hs_pbmc3 IGH IGL|IGL T T|F IGHV4-4 IGLV1-40|IGLV1-51 IGHJ5 ... 58.0 15.0 NaN CARGGVSTAFWFDPW CQSYDRSLGGHYVF|CGTWDSSLSAGCA IGH + IGL|IGL T + T|F IgM Single + Multi_light_v|Multi_light_j Multi

838 rows × 24 columns

Alternative : Running tl.define_clones

Alternatively, a wrapper to call changeo’s DefineClones.py is also included. To run it, you need to choose the distance threshold for clonal assignment. To facilitate this, the function pp.calculate_threshold will run shazam’s distToNearest function and return a plot showing the length normalized hamming distance distribution and automated threshold value.

Again, pp.calculate_threshold will take a file path, pandas DataFrame or Dandelion object as input. If a dandelion object is provided, the threshold value will be inserted into the .threshold slot. For more fine control, please use the DefineClones.py function directly.

[6]:
ddl.pp.calculate_threshold(vdj)
Calculating threshold
      Threshold method 'density' did not return with any values. Switching to method = 'gmm'.
../_images/notebooks_3_dandelion_findingclones-10x_data_13_1.png
<ggplot: (360081101)>
 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:00:43)
[7]:
# see the actual value in .threshold slot
vdj.threshold
[7]:
0.21354295894548617

You can also manually select a value as the threshold if you wish.

[8]:
ddl.pp.calculate_threshold(vdj, manual_threshold = 0.1)
Calculating threshold
      Threshold method 'density' did not return with any values. Switching to method = 'gmm'.
../_images/notebooks_3_dandelion_findingclones-10x_data_16_1.png
<ggplot: (360720905)>
 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:00:26)
[9]:
# see the updated .threshold slot
vdj.threshold
[9]:
0.1

We can run tl.define_clones to call changeo’s DefineClones.py; see here for more info. Note, if a pandas.DataFrame or file path is provided as the input, the value in dist option (corresponds to threshold value) needs to be manually supplied. If a Dandelion object is provided, it will automatically retrieve it from the threshold slot.

[10]:
ddl.tl.define_clones(vdj, key_added = 'changeo_clone_id')
vdj
Finding clones
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:09)
[10]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id'
    distance: None
    edges: None
    layout: None
    graph: None

Note that I specified the option key_added and this adds the output from tl.define_clones into a separate column. If left as default (None), it will write into clone_id column. The same option can be specified in tl.find_clones earlier.

Generation of BCR network

dandelion generates a network to facilitate visualisation of results. This uses the full V(D)J contig sequences instead of just the junctional sequences to chart a tree-like network for each clone. The actual visualization will be achieved through scanpy later.

tl.generate_network

First we need to generate the network. tl.generate_network will take a V(D)J table that has clones defined, specifically under the 'clone_id' column. The default mode is to use amino acid sequences for constructing Levenshtein distance matrices, but can be toggled using the key option.

If you have a pre-processed table parsed from immcantation’s method, or any other method as long as it’s in a AIRR format, the table can be used as well.

You can specify the clone_key option for generating the network for the clone id definition of choice as long as it exists as a column in the .data slot.

[11]:
ddl.tl.generate_network(vdj)
Generating network
Calculating distances... : 100%|██████████| 4/4 [00:03<00:00,  1.15it/s]
Generating edge list : 100%|██████████| 7/7 [00:00<00:00, 848.90it/s]
Linking edges : 100%|██████████| 821/821 [00:00<00:00, 5324.08it/s]
generating network layout
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
   'distance', heavy and light chain distance matrices
   'edges', network edges
   'layout', network layout
   'graph', network (0:00:10)

This step works reasonably fast here but will take quite a while when a lot of contigs are provided.

You can also downsample the number of cells. This will return a new object as a downsampled copy of the original with it’s own distance matrix.

[12]:
vdj_downsample = ddl.tl.generate_network(vdj, downsample = 500)
vdj_downsample
Generating network
Downsampling to 500 cells.
Calculating distances... : 100%|██████████| 4/4 [00:01<00:00,  3.40it/s]
Generating edge list : 100%|██████████| 2/2 [00:00<00:00, 757.03it/s]
Linking edges : 100%|██████████| 492/492 [00:00<00:00, 6924.50it/s]
generating network layout
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
   'distance', heavy and light chain distance matrices
   'edges', network edges
   'layout', network layout
   'graph', network (0:00:08)
[12]:
Dandelion class object with n_obs = 500 and n_contigs = 1016
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status'
    distance: 'heavy_0', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 500 vertices, layout for 10 vertices
    graph: networkx graph of 500 vertices, networkx graph of 10 vertices

check the newly re-initialized Dandelion object

[13]:
vdj
[13]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id'
    distance: 'heavy_0', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices

The graph/networks can be accessed through the .graph slot as an networkx graph object if you want to extract the data for network statistics or make any changes to the network.

At this point, we can save the dandelion object; the file can be quite big because the distance matrix is not sparse. I reccomend some form of compression (I use bzip2 below but that can impact on read/write times significantly). See here for options compression options.

[14]:
vdj.write_h5('dandelion_results.h5', complib = 'bzip2')
[ ]: