Dandelion class

dandelion_logo

Much of the functions and utility of the dandelion package revolves around the Dandelion class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This notebook will run through a quick primer to the Dandelion class.

Import modules

[1]:
import os
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
import dandelion as ddl
ddl.logging.print_versions()
dandelion==0.1.0 pandas==1.1.4 numpy==1.19.4 matplotlib==3.3.3 networkx==2.5 scipy==1.5.3 skbio==0.5.6
[2]:
vdj = ddl.read_h5('dandelion_results.h5')
vdj
[2]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id', 'd_call_heavy', 'd_call_light', 'clone_id_heavy_only'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices

Basically, the object can be summarized in the following illustration:

dandelion_class <

Essentially, the .data slot holds the AIRR contig table while the .metadata holds a collapsed version that is compatible with combining with AnnData’s .obs slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

[3]:
vdj.metadata
[3]:
clone_id clone_id_by_size sample_id locus_heavy locus_light productive_heavy productive_light v_call_genotyped_heavy v_call_genotyped_light j_call_heavy ... junction_aa_light status productive isotype vdj_status_detail vdj_status changeo_clone_id d_call_heavy d_call_light clone_id_heavy_only
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC 102_3_1 563 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV1-69 IGKV1-8 IGHJ3 ... CQQYYSYPRTF IGH + IGK T + T IgM Single + Single Single 110_33 IGHD3-22 102_3_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG 141_4_1 658 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV1-2 IGLV5-45 IGHJ3 ... CMIWHSSAWVV IGH + IGL T + T IgM Single + Single Single 467_34 IGHD3-16|IGHD4-17 141_4_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC 26_2_2 670 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV5-51 IGKV1D-8 IGHJ3 ... CQQYYSFPYTF IGH + IGK T + T IgM Single + Single Single 306_35 IGHD1/OR15-1a|IGHD1/OR15-1b|IGHD1-26 26_2_2
sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT 66_8_3 527 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-15 IGLV6-57 IGHJ4 ... CQSYDSSNVVF IGH + IGL T + T IgM Single + Multi_light_j Single 56_36 IGHD1-26 66_8_3
sc5p_v2_hs_PBMC_10k_AACCATGCAAGCTGTT 18_4_1 244 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-33 IGLV2-14 IGHJ6 ... CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single 125_37 IGHD3-10 18_4_1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_v1_hs_pbmc3_TCTTCGGTCCTAAGTG 15_8_1 653 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-59 IGKV1-12 IGHJ4 ... CQQANSFPLTF IGH + IGK T + T IgM Single + Single Single 348_483 IGHD6-19 15_8_1
vdj_v1_hs_pbmc3_TGCACCTCAGACAAAT 69_8_1 189 vdj_v1_hs_pbmc3 IGH IGK T T IGHV3-21 IGKV3-20 IGHJ6 ... CQQYGSSPLFTF IGH + IGK T + T IgM Single + Single Single 731_484 IGHD3-3 69_8_1
vdj_v1_hs_pbmc3_TGTATTCTCTGTTGAG 90_7_2 713 vdj_v1_hs_pbmc3 IGH IGL T T IGHV3-48 IGLV2-14 IGHJ4 ... CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single 229_485 IGHD3-3 90_7_2
vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT 172_4_2 372 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-34 IGKV1D-39|IGKV1-39 IGHJ3 ... CQQSYSTPRTF IGH + IGK T + T IgM Single + Multi_light_v Single 702_486 IGHD3-22 172_4_2
vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC 48_4_1_1|48_4_1_2 699|28 vdj_v1_hs_pbmc3 IGH IGL|IGL T T|F IGHV4-4 IGLV1-51|IGLV1-40 IGHJ5 ... CQSYDRSLGGHYVF|CGTWDSSLSAGCA IGH + IGL|IGL T + T|F IgM Single + Multi_light_j|Multi_light_v Multi 155_487 IGHD4-17|IGHD4-23 48_4_1

838 rows × 28 columns

copy

You can deep copy the Dandelion object to another variable which will inherit all slots:

[4]:
vdj2 = vdj.copy()
vdj2.metadata
[4]:
clone_id clone_id_by_size sample_id locus_heavy locus_light productive_heavy productive_light v_call_genotyped_heavy v_call_genotyped_light j_call_heavy ... junction_aa_light status productive isotype vdj_status_detail vdj_status changeo_clone_id d_call_heavy d_call_light clone_id_heavy_only
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC 102_3_1 563 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV1-69 IGKV1-8 IGHJ3 ... CQQYYSYPRTF IGH + IGK T + T IgM Single + Single Single 110_33 IGHD3-22 102_3_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG 141_4_1 658 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV1-2 IGLV5-45 IGHJ3 ... CMIWHSSAWVV IGH + IGL T + T IgM Single + Single Single 467_34 IGHD3-16|IGHD4-17 141_4_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC 26_2_2 670 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV5-51 IGKV1D-8 IGHJ3 ... CQQYYSFPYTF IGH + IGK T + T IgM Single + Single Single 306_35 IGHD1/OR15-1a|IGHD1/OR15-1b|IGHD1-26 26_2_2
sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT 66_8_3 527 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-15 IGLV6-57 IGHJ4 ... CQSYDSSNVVF IGH + IGL T + T IgM Single + Multi_light_j Single 56_36 IGHD1-26 66_8_3
sc5p_v2_hs_PBMC_10k_AACCATGCAAGCTGTT 18_4_1 244 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV3-33 IGLV2-14 IGHJ6 ... CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single 125_37 IGHD3-10 18_4_1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_v1_hs_pbmc3_TCTTCGGTCCTAAGTG 15_8_1 653 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-59 IGKV1-12 IGHJ4 ... CQQANSFPLTF IGH + IGK T + T IgM Single + Single Single 348_483 IGHD6-19 15_8_1
vdj_v1_hs_pbmc3_TGCACCTCAGACAAAT 69_8_1 189 vdj_v1_hs_pbmc3 IGH IGK T T IGHV3-21 IGKV3-20 IGHJ6 ... CQQYGSSPLFTF IGH + IGK T + T IgM Single + Single Single 731_484 IGHD3-3 69_8_1
vdj_v1_hs_pbmc3_TGTATTCTCTGTTGAG 90_7_2 713 vdj_v1_hs_pbmc3 IGH IGL T T IGHV3-48 IGLV2-14 IGHJ4 ... CSSYTSSSTRVF IGH + IGL T + T IgM Single + Single Single 229_485 IGHD3-3 90_7_2
vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT 172_4_2 372 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-34 IGKV1D-39|IGKV1-39 IGHJ3 ... CQQSYSTPRTF IGH + IGK T + T IgM Single + Multi_light_v Single 702_486 IGHD3-22 172_4_2
vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC 48_4_1_1|48_4_1_2 699|28 vdj_v1_hs_pbmc3 IGH IGL|IGL T T|F IGHV4-4 IGLV1-51|IGLV1-40 IGHJ5 ... CQSYDRSLGGHYVF|CGTWDSSLSAGCA IGH + IGL|IGL T + T|F IgM Single + Multi_light_j|Multi_light_v Multi 155_487 IGHD4-17|IGHD4-23 48_4_1

838 rows × 28 columns

Retrieving entries with update_metadata

The .metadata slot in Dandelion class automatically initializes whenever the .data slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the .data slot, we can update the metadata with ddl.update_metadata and specify the option retrieve.

The following options determine how the retrieval is completed:

split - splits the retrieval into heavy and light chains calls.

split_locus - smiliar to split but splits the retrieval to IGH/IGK/IGL.

collapse - Adds a | to separate every element.

combine - similar to collapse but only retains unique elements (separated by a | if multiple are found).

Example 1 : retrieving junction amino acid sequences

[5]:
ddl.update_metadata(vdj, retrieve = 'd_call')
vdj
[5]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id', 'd_call_heavy', 'd_call_light', 'clone_id_heavy_only'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices

Note the additional d_call heavy and light columns in the metadata slot.

By default, dandelion will not try to merge numerical columns as it can create mixed dtype columns.

Example 2 : editing clone_id column

Perhaps you want to have a bit more control with how clones are called. We can edit this directly from the .data slot and retrieve accordingly.

[6]:
# if we only want to keep the light chain clone assignment
clones = []
for clone in vdj.data['clone_id']:
    if '|' in clone: # this is because clones were merged into the the same column if they have different pairing of BCR combinations
        clone_list = clone.split('|')
        clones.append('|'.join(list(set([clone_2.rsplit('_', 1)[0] if clone_2.count('_') == 3 else clone_2 for clone_2 in clone_list]))))
    else:
        if clone.count('_') == 3: # this means it's looking for X_X_X_X, 3 underscores
            clones.append(clone.rsplit('_', 1)[0]) # split the 3rd underscore but only keep the first entry
        else:
            clones.append(clone)
vdj.data['clone_id_heavy_only'] = clones
ddl.update_metadata(vdj, retrieve = 'clone_id_heavy_only', split = False, collapse = True)
vdj.metadata[['clone_id', 'clone_id_heavy_only']]
[6]:
clone_id clone_id_heavy_only
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC 102_3_1 102_3_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG 141_4_1 141_4_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC 26_2_2 26_2_2
sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT 66_8_3 66_8_3
sc5p_v2_hs_PBMC_10k_AACCATGCAAGCTGTT 18_4_1 18_4_1
... ... ...
vdj_v1_hs_pbmc3_TCTTCGGTCCTAAGTG 15_8_1 15_8_1
vdj_v1_hs_pbmc3_TGCACCTCAGACAAAT 69_8_1 69_8_1
vdj_v1_hs_pbmc3_TGTATTCTCTGTTGAG 90_7_2 90_7_2
vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT 172_4_2 172_4_2
vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC 48_4_1_1|48_4_1_2 48_4_1

838 rows × 2 columns

concatenating multiple objects

This is a simple function to concatenate (append) two or more Dandelion class, or pandas dataframes.

[7]:
# for example, the original dandelion class has 838 unique cell barcodes and 1700 contigs
vdj
[7]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id', 'd_call_heavy', 'd_call_light', 'clone_id_heavy_only'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices
[8]:
# now it has 5100 contigs instead, and the metadata should also be properly populated
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat
[8]:
Dandelion class object with n_obs = 838 and n_contigs = 5100
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_heavy_1', 'umi_count_heavy_2', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'umi_count_light_3', 'umi_count_light_4', 'umi_count_light_5', 'umi_count_light_6', 'umi_count_light_7', 'umi_count_light_8', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status'
    distance: None
    edges: None
    layout: None
    graph: None

read/write

Dandelion class can be saved using .write_h5 and .write_pkl functions with accompanying compression methods. write_h5 primarily uses pandas to_hdf library and write_pkl just uses pickle. read_h5 and read_pkl functions will read the respective file formats accordingly.

[9]:
%time vdj.write_h5('dandelion_results.h5', complib = 'bzip2')
CPU times: user 1.53 s, sys: 65.7 ms, total: 1.59 s
Wall time: 1.64 s
[10]:
%time vdj_1 = ddl.read_h5('dandelion_results.h5')
vdj_1
CPU times: user 564 ms, sys: 54.6 ms, total: 619 ms
Wall time: 631 ms
[10]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id', 'd_call_heavy', 'd_call_light', 'clone_id_heavy_only'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices

The read/write times using pickle can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

[11]:
%time vdj.write_pkl('dandelion_results.pkl.gz')
CPU times: user 9.14 s, sys: 68 ms, total: 9.21 s
Wall time: 9.41 s
[12]:
%time vdj_2 = ddl.read_pkl('dandelion_results.pkl.gz')
vdj_2
CPU times: user 89.9 ms, sys: 9.16 ms, total: 99.1 ms
Wall time: 106 ms
[12]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id', 'clone_id_heavy_only'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id', 'd_call_heavy', 'd_call_light', 'clone_id_heavy_only'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices
[ ]: