BCR clustering and visualization

dandelion_logo

Now that we have both 1) a pre-processed BCR data and 2) matching AnnData object, we can start finding clones and ‘integrate’ the results. All the BCR analyses files can be saved as .tsv format so that it can be used in other tools like immcantation, immunoarch, vdjtools, etc.

On the topic of finding clones, there are many ways used for identifying BCR clones, almost all involving some measure based on sequence similarity. There are also a lot of very well established guidelines and criterias maintained by the BCR community. For example, immcantation uses a number of model-based methods to group clones based on the distribution of length-normalised junctional hamming distance while rbr1 developed a method to use the whole BCR VDJ sequence to define clones as shown in this recent paper. While these methods have mainly been applied to bulk BCR-seq protocols, they are biological grounded and should be applicable to single cells.

Import modules

[1]:
import os
import pandas as pd
os.chdir(os.path.expanduser('/Users/kt16/Documents/Github/dandelion'))
import dandelion as ddl
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
# I'm importing scanpy here to make use of its logging module.
import scanpy as sc
sc.settings.verbosity = 3
import warnings
warnings.filterwarnings('ignore')

Read in the previously saved files

I will work with the same example from the previous notebook since I have the AnnData object saved and vdj table filtered.

[2]:
adata = sc.read_h5ad('adata.h5ad')
adata
[2]:
AnnData object with n_obs × n_vars = 16492 × 1497
    obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'leiden'
    var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'bcr_QC_pass_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'
[3]:
vdj = ddl.read_h5('dandelion_results.h5')
vdj
[3]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count'
    metadata: 'sample_id', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light'
    distance: None
    edges: None
    layout: None
    graph: None

Quick primer to the Dandelion class

So far, we have been operating with the Dandelion class:

[4]:
vdj
[4]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count'
    metadata: 'sample_id', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light'
    distance: None
    edges: None
    layout: None
    graph: None

Essentially, the .data slot holds the AIRR contig table while the .metadata holds a collapsed version that is compatible with combining with AnnData’s .obs slot. The other slots will be gradually filled as we go through the notebook. You can retrieve these slots like a typical class object; for example, if I want the metadata:

[5]:
vdj.metadata
[5]:
sample_id isotype lightchain status vdj_status productive umi_counts_heavy umi_counts_light v_call_heavy v_call_light j_call_heavy j_call_light c_call_heavy c_call_light
cell_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG sc5p_v2_hs_PBMC_1k IgM IgL IGH + IGL Single T 17 34.0 IGHV3-15 IGLV3-1 IGHJ4 IGLJ3,IGLJ2 IGHM IGLC
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 25 36.0 IGHV3-30 IGKV3-20 IGHJ6 IGKJ3 IGHM IGKC
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 41 41.0 IGHV1-18 IGKV4-1 IGHJ4 IGKJ5 IGHM IGKC
sc5p_v2_hs_PBMC_1k_ACACCGGGTTATTCTC sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 24 37.0 IGHV3-23 IGKV1-8 IGHJ4 IGKJ1 IGHM IGKC
sc5p_v2_hs_PBMC_1k_ACCTTTAAGACAAAGG sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 34 53.0 IGHV1-18 IGKV1-39,IGKV1D-39 IGHJ4 IGKJ2 IGHM IGKC
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_nextgem_hs_pbmc3_TTGCGTCTCGTAGGTT vdj_nextgem_hs_pbmc3 IgM IgK IGH + IGK Single T 14 65.0 IGHV4-59 IGKV3-11 IGHJ6 IGKJ5 IGHM IGKC
vdj_nextgem_hs_pbmc3_TTTATGCTCCGCATAA vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 31 38.0 IGHV3-7 IGLV1-51 IGHJ6 IGLJ3,IGLJ2 IGHM IGLC
vdj_nextgem_hs_pbmc3_TTTATGCTCTGATTCT vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 16 4.0 IGHV2-26 IGLV9-49 IGHJ4 IGLJ3 IGHM IGLC
vdj_nextgem_hs_pbmc3_TTTGCGCCACCATGTA vdj_nextgem_hs_pbmc3 IgM IgK IGH + IGK Single T 30 38.0 IGHV1-18 IGKV1-27 IGHJ4 IGKJ4 IGHM IGKC
vdj_nextgem_hs_pbmc3_TTTGCGCTCCAGTATG vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 40 83.0 IGHV2-5 IGLV1-47 IGHJ4 IGLJ3,IGLJ2 IGHM IGLC

838 rows × 14 columns

You can deep copy the Dandelion object to another variable which will inherit all slots:

[6]:
vdj2 = vdj.copy()
vdj2.metadata
[6]:
sample_id isotype lightchain status vdj_status productive umi_counts_heavy umi_counts_light v_call_heavy v_call_light j_call_heavy j_call_light c_call_heavy c_call_light
cell_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG sc5p_v2_hs_PBMC_1k IgM IgL IGH + IGL Single T 17 34.0 IGHV3-15 IGLV3-1 IGHJ4 IGLJ3,IGLJ2 IGHM IGLC
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 25 36.0 IGHV3-30 IGKV3-20 IGHJ6 IGKJ3 IGHM IGKC
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 41 41.0 IGHV1-18 IGKV4-1 IGHJ4 IGKJ5 IGHM IGKC
sc5p_v2_hs_PBMC_1k_ACACCGGGTTATTCTC sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 24 37.0 IGHV3-23 IGKV1-8 IGHJ4 IGKJ1 IGHM IGKC
sc5p_v2_hs_PBMC_1k_ACCTTTAAGACAAAGG sc5p_v2_hs_PBMC_1k IgM IgK IGH + IGK Single T 34 53.0 IGHV1-18 IGKV1-39,IGKV1D-39 IGHJ4 IGKJ2 IGHM IGKC
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_nextgem_hs_pbmc3_TTGCGTCTCGTAGGTT vdj_nextgem_hs_pbmc3 IgM IgK IGH + IGK Single T 14 65.0 IGHV4-59 IGKV3-11 IGHJ6 IGKJ5 IGHM IGKC
vdj_nextgem_hs_pbmc3_TTTATGCTCCGCATAA vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 31 38.0 IGHV3-7 IGLV1-51 IGHJ6 IGLJ3,IGLJ2 IGHM IGLC
vdj_nextgem_hs_pbmc3_TTTATGCTCTGATTCT vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 16 4.0 IGHV2-26 IGLV9-49 IGHJ4 IGLJ3 IGHM IGLC
vdj_nextgem_hs_pbmc3_TTTGCGCCACCATGTA vdj_nextgem_hs_pbmc3 IgM IgK IGH + IGK Single T 30 38.0 IGHV1-18 IGKV1-27 IGHJ4 IGKJ4 IGHM IGKC
vdj_nextgem_hs_pbmc3_TTTGCGCTCCAGTATG vdj_nextgem_hs_pbmc3 IgM IgL IGH + IGL Single T 40 83.0 IGHV2-5 IGLV1-47 IGHJ4 IGLJ3,IGLJ2 IGHM IGLC

838 rows × 14 columns

updating metadata

The .metadata slot in Dandelion class automatically initializes whenever the .data slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns in the .data slot, we can update the metadata with ddl.update_metadata and specify the option retrieve. you can also pass collapse = True and split_heavy_light = False if the two columns can be reasonably combined for that cell/barcode.

Example 1 : retrieving junction amino acid sequences

[7]:
ddl.update_metadata(vdj, retrieve = 'junction_aa')
vdj
[7]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count'
    metadata: 'sample_id', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'junction_aa_heavy', 'junction_aa_light'
    distance: None
    edges: None
    layout: None
    graph: None

Note the additional junction_aa columns in the metadata slot. There’s a second example but this comes later on.

Saving

And as described in the previous notebook, dandelion class can be saved using .write_h5 and .write_pkl functions with accompanying compression methods.

Finding clones

The following is dandelion’s implementation of a rather conventional method to define clones, tl.find_clones.

Clone definition is based on the following criterias:

(1) Identical IGH V-J gene usage.

(2) Identical CDR3 junctional sequence length.

(3) CDR3 Junctional sequences attains a minimum of % sequence similarity, based on hamming distance. The similarity cut-off is tunable (default is 85%).

(4) Light chain usage. If cells within clones use different light chains, the clone will be splitted following the same conditions for heavy chains in (1-3) as above.

The ‘clone_id’ name follows a {A}_{B}_{C}_{D} format and largely reflects the conditions above where:

{A} indicates if the contigs use the same IGH V/J genes.

{B} indicates if IGH junctional sequences are equal in length.

{C} indicates if clones are splitted based on junctional hamming distance threshold

{D} indicates light chain pairing.

The last position will not be annotated if there’s only one group of light chains usage detected in the clone.

Running tl.find_clones

The function will take a file path, a pandas DataFrame (for example if you’ve used pandas to read in the filtered file already), or a Dandelion class object (described in next section). The default mode for calculation of junctional hamming distance is to use the junction amino acid sequences. If you want to do it via nucleotide, you can specify the option:

clustering_by = 'nt'

If you want to use the alleles for defining V-J gene usuage, specify:

by_alleles = True
[8]:
ddl.tl.find_clones(vdj)
Finding clones
Finding clones based on heavy chains : 100%|██████████| 176/176 [00:00<00:00, 2060.53it/s]
Refining clone assignment based on light chain pairing : 100%|██████████| 819/819 [00:00<00:00, 2126.98it/s]
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:01)
[9]:
vdj
[9]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id'
    metadata: 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light'
    distance: None
    edges: None
    layout: None
    graph: None

This will return a new column with the column name 'clone_id' as per convention. If a filepath is provided as input, it will also save the file automatically into the base directory of the file name. Otherwise, if a Dandelion object or pandas.DataFrame is provided it will just be returned as a Dandelion object.

Updating metadata example 2 : editing clone_id column

Perhaps you want to have a bit more control with how clones are called. We can edit this directly from the .data slot and retrieve accordingly.

[10]:
# using a list comprehension to remove the light chain clone assignment
vdj.data['clone_id_heavyonly'] = [x.rsplit('_', 1)[0] if x.count('_') == 3 else x for x in vdj.data['clone_id']]
ddl.update_metadata(vdj, retrieve = 'clone_id_heavyonly', split_heavy_light = False, collapse = True)
vdj
[10]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'clone_id_heavyonly'
    metadata: 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'clone_id_heavyonly'
    distance: None
    edges: None
    layout: None
    graph: None

Alternative : Running tl.define_clones

Alternatively, a wrapper to call changeo’s DefineClones.py is also included. To run it, you need to choose the distance threshold for clonal assignment. To facilitate this, the function pp.calculate_threshold will run shazam’s distToNearest function and return a plot showing the length normalized hamming distance distribution and automated threshold value. Again, pp.calculate_threshold will take a file path, pandas DataFrame or Dandelion object as input. If a dandelion object is provided, the threshold value will be inserted into the .threshold slot. For more fine control, please use the DefineClones.py function directly.

[11]:
ddl.pp.calculate_threshold(vdj)
Calculating threshold
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_26_1.png
<ggplot: (371799273)>
 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:00:32)
[12]:
# see the actual value in .threshold slot
vdj.threshold
[12]:
0.21946248392024584

You can also manually select a value as the threshold if you wish.

[13]:
ddl.pp.calculate_threshold(vdj, manual_threshold = 0.1)
Calculating threshold
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_29_1.png
<ggplot: (371001341)>
 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:00:34)
[14]:
# see the updated .threshold slot
vdj.threshold
[14]:
0.1

We can run tl.define_clones to call changeo’s DefineClones.py; see here for more info. Note, if a pandas.DataFrame or file path is provided as the input, the value in dist option (corresponds to threshold value) needs to be manually supplied. If a Dandelion object is provided, it will automatically retrieve it from the threshold slot.

[15]:
ddl.tl.define_clones(vdj, key_added = 'changeo_clone_id')
Finding clones
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:05)

Note that I specified the option key_added and this adds the output from tl.define_clones into a separate column. If left as default (None), it will write into clone_id column. The same option can be specified in tl.find_clones earlier. If there is an existing column for clone_id, the group column for the new key would not be computed; If you want to extract the grouped column, simply run define clones without key_added to replace the existing clone_id and clone_id_group columns.

Visualization of BCR network

dandelion generates a network to facilitate visualisation of results. This uses the full V(D)J contig sequences instead of just the junctional sequences to chart a tree-like network for each clone. The actual visualization will be achieved through scanpy later.

tl.generate_network

First we need to generate the network. The tool function tl.generate_network will take a V(D)J table that has clones defined, specifically under the 'clone_id' column. The default mode is to use amino acid sequences for constructing Levenshtein distance matrices.

If you have a pre-processed table parsed from immcantation’s method, or any other method as long as it’s in a AIRR format, the table can be used as well.

You can specify the clone_key option for generating the network for the clone id definition of choice as long as it exists as a column in the .data slot.

[16]:
ddl.tl.generate_network(vdj)
Generating network
Calculating distances... : 100%|██████████| 4/4 [00:04<00:00,  1.05s/it]
Generating edge list : 100%|██████████| 7/7 [00:00<00:00, 653.23it/s]
Linking edges : 100%|██████████| 821/821 [00:00<00:00, 4375.56it/s]
generating network layout
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
   'distance', heavy and light chain distance matrices
   'edges', network edges
   'layout', network layout
   'graph', network (0:00:11)

This step works reasonably fast here but will take quite a while when a lot of contigs are provided.

You can also downsample the number of cells. This will return a new object as a downsampled copy of the original with it’s own distance matrix.

[17]:
vdj_downsample = ddl.tl.generate_network(vdj, downsample = 500)
vdj_downsample
Generating network
Downsampling to 500 cells.
Calculating distances... : 100%|██████████| 4/4 [00:01<00:00,  2.09it/s]
Generating edge list : 100%|██████████| 4/4 [00:00<00:00, 526.16it/s]
Linking edges : 100%|██████████| 491/491 [00:00<00:00, 3719.38it/s]
generating network layout
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
   'distance', heavy and light chain distance matrices
   'edges', network edges
   'layout', network layout
   'graph', network (0:00:07)
[17]:
Dandelion class object with n_obs = 500 and n_contigs = 1015
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'clone_id_heavyonly', 'changeo_clone_id'
    metadata: 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'changeo_clone_id'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 500 vertices, layout for 13 vertices
    graph: networkx graph of 500 vertices, networkx graph of 13 vertices

check the newly re-initialized Dandelion object

[18]:
vdj
[18]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'clone_id_heavyonly', 'changeo_clone_id'
    metadata: 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'changeo_clone_id'
    distance: 'heavy', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices

The graph/networks can be accessed through the .graph slot as an networkx graph object if you want to extract the data for network statistics or make any changes to the network.

Integration with scanpy

The results can also be ported into the AnnData object for access to more plotting functions provided through scanpy.

To proceed, we first need to initialise the AnnData object with our network. This is done by using the tool function tl.transfer.

[19]:
ddl.tl.transfer(adata, vdj) # this will include singletons. To show only expanded clones, specify expanded_only=True
adata
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:21)
[19]:
AnnData object with n_obs × n_vars = 16492 × 1497
    obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'leiden', 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'changeo_clone_id'
    var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'bcr_QC_pass_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'umap', 'rna_neighbors'
    obsm: 'X_pca', 'X_umap', 'X_bcr'
    varm: 'PCs'
    obsp: 'connectivities', 'distances', 'rna_connectivities', 'rna_distances', 'bcr_connectivities', 'bcr_distances'

You can see that AnnData object now contains a couple more columns in the .obs slot, corresponding to the metadata that is returned after tl.generate_network, and newly populated .obsm and .obsp slots. The original RNA connectivities and distances are now added into the .obsp slot as well.

Plotting in scanpy

pl.clone_network

So now, basically we can plot in scanpy with their plotting modules. I’ve included a plotting function in dandelion, pl.clone_network, which is really just a wrapper of their pl.embedding module.

[20]:
sc.set_figure_params(figsize = [4,4])
ddl.pl.clone_network(adata,
                     color = ['sampleid'],
                     edges_width = 1,
                     size = 50)
... storing 'sample_id' as categorical
... storing 'clone_id' as categorical
... storing 'isotype' as categorical
... storing 'lightchain' as categorical
... storing 'status' as categorical
... storing 'vdj_status' as categorical
... storing 'productive' as categorical
... storing 'umi_counts_light' as categorical
... storing 'v_call_heavy' as categorical
... storing 'v_call_light' as categorical
... storing 'j_call_heavy' as categorical
... storing 'j_call_light' as categorical
... storing 'c_call_heavy' as categorical
... storing 'c_call_light' as categorical
... storing 'changeo_clone_id' as categorical
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_46_1.png

tl.extract_edge_weights

dandelion provides an edge weight extractor tool tl.extract_edge_weights to retrieve the edge weights that can be used to specify the edge widths according to weight/distance.

[21]:
edgeweights = [1/(e+1) for e in ddl.tl.extract_edge_weights(vdj)] # add 1 to each edge weight (e) so that distance of 0 becomes the thickest edge
ddl.pl.clone_network(adata,
                     color = ['isotype'],
                     legend_fontoutline=3,
                     edges_width = edgeweights,
                     size = 50)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_49_0.png

You can interact with pl.clone_network just as how you interact with the rest of the scatterplot modules in scanpy.

[22]:
from scanpy.plotting.palettes import default_28, default_102
sc.set_figure_params(figsize = [4,4])
# plot the 3 largest clones by size
ddl.pl.clone_network(adata, color = ['clone_id_by_size'], groups = ['1', '2', '3'], ncols = 2, legend_loc = 'on data', legend_fontoutline=3, edges_width = edgeweights, size = 50, palette = default_28)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 28, categories length: 821. Some categories will have the same color.
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_51_1.png
[23]:
ddl.pl.clone_network(adata, color = ['status', 'vdj_status'], ncols = 1, legend_fontoutline=3, edges_width = 1, size = 50)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_52_0.png

By specifying expanded_only=True, you will transfer the graph of only expanded clones (> 1 cell in a clone).

[24]:
ddl.tl.transfer(adata, vdj)
edgeweights = [1/(e+1) for e in ddl.tl.extract_edge_weights(vdj)] # add 1 to each edge weight (e) so that distance of 0 becomes the thickest edge
ddl.pl.clone_network(adata, color = ['clone_id_by_size'], groups = ['1', '2', '3', '4', '5', '6', '7'], legend_loc = 'on data', legend_fontoutline=3, edges_width = edgeweights, size = 50)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:18)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_54_1.png

Calculating mutational load

To calculate mutational load, I’ve ported the functions from immcantation suite’s changeo and shazam to work with the dandelion class object.

This can be run immediately after pp.reassign_alleles in during pre-processing because the required germline columns should be present in the genotyped .tsv file. I would reccomend to run this after TIgGER, after the v_calls were corrected.

But say you want to run it now instead, this can be achieved:

[26]:
# let's recreate the vdj object with only the first two samples
subset_data = vdj.data[vdj.data['sample_id'].isin(['sc5p_v2_hs_PBMC_1k', 'sc5p_v2_hs_PBMC_10k'])]
subset_data
[26]:
sequence_id sequence rev_comp productive v_call d_call j_call sequence_alignment germline_alignment junction ... cdr2_aa cdr3_aa sequence_alignment_aa v_sequence_alignment_aa d_sequence_alignment_aa j_sequence_alignment_aa mu_freq duplicate_count clone_id changeo_clone_id
sequence_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 CTGGGCCTCAGGAAGCAGCATCGGAGGTGCCTCAGCCATGGCATGG... F T IGLV3-1*01 IGLJ2*01,IGLJ3*01 TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG... TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG... TGTCAGGCGTGGGACAGCAGCAATGTGGTATTC ... QDN QAWDSSNVV SYELTQPPSVSVSPGQTASITCSGDKLGHKYACWYQQKPGQSPVLV... SYELTQPPSVSVSPGQTASITCSGDKLGHKYACWYQQKPGQSPVLV... VFGGGTKLTVL 0.009434 0 121_8_3 57_0
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 AGCTCTGGGAGAGGAGCCCCAGCCTTGGGATTCCCAAGTGTTTTCA... F T IGHV3-15*01 IGHD4-23*01 IGHJ4*02 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG... GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG... TGTACCACAGAAGCTTTAGACTACGGTGCTAACTCGCGATCCCCGA... ... IKSNTDGATT TTEALDYGANSRSPNFDY EVQLVESGGGLVKPGGSLRLSCAASGFIFSNAWMSWVRQAPGKGLE... EVQLVESGGGLVKPGGSLRLSCAASGFIFSNAWMSWVRQAPGKGLE... DYGANS FDYWGQGTLVTVSS 0.008671 0 121_8_3 57_0
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 AGAGCTCTGGAGAAGAGCTGCTCAGTTAGGACCCAGAGGGAACCAT... F T IGKV3-20*01 IGKJ3*01 GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... TGTCAGCAGTATGGTAGCTCACCTCCATTCACTTTC ... GAS QQYGSSPPFT EIVLTQSPGTLSLSPGERATLSCRASQSVSSSYLAWYQQKPGQAPR... EIVLTQSPGTLSLSPGERATLSCRASQSVSSSYLAWYQQKPGQAPR... FTFGPGTKVDIK 0.000000 0 23_6_2 76_1
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 GAGCTCTGGGAGAGGAGCCCAGCACTAGAAGTCGGCGGTGTTTCCA... F T IGHV3-30*02,IGHV3-30-5*02 IGHD1-26*01 IGHJ6*02 CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG... CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG... TGTGCGAAAGATACTGAAGTGGGAGCGAGCCGATACTACTACTACT... ... IRYDGSNK AKDTEVGASRYYYYYGMDV QVQLVESGGGVVQPGGSLRLSCAASGFTFSSYGMHWVRQAPGKGLE... QVQLVESGGGVVQPGGSLRLSCAASGFTFSSYGMHWVRQAPGKGLE... VGA YYYYYGMDVWGQGTTVTVSS 0.000000 0 23_6_2 76_1
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 GAGCTACAACAGGCAGGCAGGGGCAGCAAGATGGTGTTGCAGACCC... F T IGKV4-1*01 IGKJ5*01 GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG... GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG... TGTCAGCAATATTATAGTACTCCGATCACCTTC ... WAS QQYYSTPIT DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKP... DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKP... ITFGQGTRLEIK 0.000000 0 84_3_2 184_2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 TTTCCTCTCTTCAACTGCGAACCGACTTTCTGCGATGGGGACTCAA... F T IGHV1-8*01 IGHD2-2*02 IGHJ6*02 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGAGATATTGTAGTAGTACCAGCTGCTATACGACCTATTACT... ... MNPNSGNT ARYCSSTSCYTTYYYYYYGMDV QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYDINWVRQATGQGLE... QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYDINWVRQATGQGLE... YCSSTSCYT YYYYYGMDVWGQGTTVTVSS 0.000000 0 28_6_2 757_448
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 GGGAGAGCCCTGGGGAGGAACTGCTCAGTTAGGACCCAGAGGGAAC... F T IGKV3-11*01 IGKJ5*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... TGTCAGCAGCGTAGCAACTGGCTCACCTTC ... DAS QQRSNWLT EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... TFGQGTRLEIK 0.000000 0 159_2_1 425_449
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 ACAACCACACCCCTCCTAAGAAGAAGCCCCTAGACCACAGCTCCAC... F T IGHV7-4-1*02 IGHD3-10*01 IGHJ5*02 CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG... TGTGCGAGAGTTTTTAGACGCTATGGTTCGGGGAGTTATTATAACC... ... INTNTGNP ARVFRRYGSGSYYNL QVQLVQSGSELKKPGASVKVSCKASGYTFTSYAMNWVRQAPGQGLE... QVQLVQSGSELKKPGASVKVSCKASGYTFTSYAMNWVRQAPGQGLE... YGSGSYY LWGQGTLVTVSS 0.003003 0 159_2_1 425_449
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 GAGAGAGGAGCCTTAGCCCTGGATTCCAAGGCCTATCCACTTGGTG... F T IGHV3-21*01 IGHD4-17*01 IGHJ2*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... TGTGCGAGAGATCCCGGTGACTACGTAGAAATTGAGTGGTACTTCG... ... ISSSSSYI ARDPGDYVEIEWYFDL EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE... EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE... GDY WYFDLWGRGTLVTVSS 0.000000 0 10_1_1 357_450
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 GGGAGAGCCCTGGGGAGGAACTGCTCAGTTAGGACCCAGAGGGAAC... F T IGKV3-11*01 IGKJ4*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... TGTCAGCAGCGTAGCAACTGGCCTAGGCTCACTTTC ... DAS QQRSNWPRLT EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... LTFGGGTKVEIK 0.000000 0 10_1_1 357_450

926 rows × 84 columns

[27]:
# create a new Dandelion class with this subset
vdj2 = ddl.Dandelion(subset_data)
vdj2
[27]:
Dandelion class object with n_obs = 454 and n_contigs = 926
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light'
    distance: None
    edges: None
    layout: None
    graph: None
[28]:
# update the germline using the corrected files after tigger
vdj2.update_germline(corrected = 'tutorial_scgp1/tutorial_scgp1_heavy_igblast_db-pass_genotype.fasta', germline = None, org = 'human')
Updating germline reference
 finished: Updated Dandelion object:
   'germline', updated germline reference
 (0:00:00)

pp.create_germlines

Then we run pp.create_germline to (re)create the germline_alignment_d_mask column in the data. If update_germline was run like above, there’s no need to specify the germline option as the function will simply retrieve it from the Dandelion object.

Note: the ability to run the original CreateGermlines.py with –cloned option is not currently possible through pp.create_germlines(). This is possible with pp.external.creategermlines but requires a physical file for CreateGermlines.py to work on. Thus, I would reccomend for you to run CreateGermlines.py separately if you intend to use the –cloned option. Seeherefor more info.

[29]:
ddl.pp.create_germlines(vdj2, v_field = 'v_call_genotyped', germ_types='dmask')
Reconstructing germline sequences
   Building dmask germline sequences: 926it [00:00, 967.80it/s]
 finished: Updated Dandelion object:
   'data', updated germline alignment in contig-indexed clone table
   'germline', updated germline reference
 (0:00:01)

Ensure that the germline_alignment_d_mask column is populated or subsequent steps will fail.

[30]:
vdj2.data[['v_call_genotyped', 'germline_alignment_d_mask']]
[30]:
v_call_genotyped germline_alignment_d_mask
sequence_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 IGLV3-1*01 TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG...
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 IGHV3-15*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG...
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 IGKV3-20*01 GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG...
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 IGHV3-30*02 CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG...
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 IGKV4-1*01 GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG...
... ... ...
sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 IGHV1-8*01 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 IGKV3-11*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 IGHV7-4-1*02 CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 IGHV3-21*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 IGKV3-11*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG...

926 rows × 2 columns

The default behaviour is to mask the D region with Ns with option germ_types = 'dmask'. See here for more info.

``pp.quantify_mutations``

The options for pp.quantify_mutations are the same as the basic mutational load analysis vignette. The default behavior is to sum all mutations scores (heavy and light chains, silent and replacement mutations) for the same cell.

Again, this function can be run immediately after pp.reassign_alleles on the genotyped .tsv files (without loading into pandas or Dandelion). Here I’m illustrating a few other options that may be useful.

[31]:
# switching back to using the full vdj object
ddl.pp.quantify_mutations(vdj)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:05)
[32]:
ddl.pp.quantify_mutations(vdj, combine = False)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:04)

Specifying split_locus = True will split up the results for the different chains.

[33]:
ddl.pp.quantify_mutations(vdj, split_locus = True)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:06)

To update the AnnData object, simply rerun tl.transfer_network with keep_raw option set to False to avoid overwriting the stashed neighborhood graph.

[34]:
ddl.tl.transfer(adata, vdj)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:18)
[35]:
sc.set_figure_params(figsize = [4,4])
ddl.pl.clone_network(adata, color = ['clone_id', 'mu_freq', 'mu_freq_seq_r', 'mu_freq_seq_s', 'mu_freq_IGH', 'mu_freq_IGL'], ncols = 2, legend_loc = 'none', legend_fontoutline=3, edges_width = 1, palette = default_28 + default_102, color_map = 'viridis', size = 50)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 130, categories length: 822. Some categories will have the same color.
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_72_1.png

Calculating size of clones

``tl.clone_size``

Sometimes it’s useful to evaluate the size of the clone. Here tl.quantify_clone_size does a simple calculation to enable that.

[36]:
ddl.tl.clone_size(vdj)
ddl.tl.transfer(adata, vdj)
Quantifying clone sizes
 finished: Updated Dandelion object:
   'metadata', cell-indexed clone table (0:00:00)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:18)
[37]:
ddl.pl.clone_network(adata, color = ['clone_id_size'], legend_loc = 'none', legend_fontoutline=3, edges_width = 1, size = 50)
sc.pl.umap(adata, color = ['clone_id_size'])
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_75_0.png
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_75_1.png

You can also specify max_size to clip off the calculation at a fixed value.

[38]:
ddl.tl.clone_size(vdj, max_size = 3)
ddl.tl.transfer(adata, vdj)
Quantifying clone sizes
 finished: Updated Dandelion object:
   'metadata', cell-indexed clone table (0:00:00)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:19)
[39]:
ddl.pl.clone_network(adata, color = ['clone_id_size'], ncols = 2, legend_fontoutline=3, edges_width = 1, palette = ['grey', 'red', 'blue', 'white'], size = 50)
sc.pl.umap(adata[adata.obs['has_bcr'] == 'True'], color = ['clone_id_size'], palette = ['grey', 'red', 'blue', 'white'])
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_78_0.png
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_78_1.png

Calculating diversity

disclaimer: the functions here are not complete yet. Please look to other sources/methods for doing this properly. Also, would appreciate any help to help me finalise this!

``tl.clone_rarefaction/pl.clone_rarefaction``

We can use pl.clone_rarefaction to generate rarefaction curves for the clones. tl.clone_rarefaction will populate the .uns slot with the results. groupby option must be specified. In this case, I decided to group by sample. The function will only work on an AnnData object and not a Dandelion object.

[40]:
ddl.pl.clone_rarefaction(adata, groupby = 'sampleid')
removing due to zero counts:
Calculating rarefaction curve : 100%|██████████| 4/4 [00:00<00:00, 14.54it/s]
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_80_2.png
[40]:
<ggplot: (349039913)>

``tl.clone_diversity``

tl.clone_diversity allows for calculation of diversity measures such as Chao1, Shannon Entropy and Gini indices.

While the function can work on both AnnData and Dandelion objects, the methods for gini index calculation will only work on a Dandelion object as it requires access to the network.

For Gini indices, we provide several types of measures, inspired by bulk BCRseq analysis methods from Rachael Bashford-Rogers’ works:

Default

i) network cluster/clone size Gini index - ``metric = clone_network``

In a contracted BCR network (where identical BCRs are collapsed into the same node/vertex), disparity in the distribution should be correlated to the amount of mutation events i.e. larger networks should indicate more mutation events and smaller networks should indicate lesser mutation events.

ii) network vertex/node size Gini index - ``metric = clone_network``

In the same contracted network, we can count the number of merged/contracted nodes; nodes with higher count numbers indicate more clonal expansion. Thus, disparity in the distribution of count numbers (referred to as vertex size) should be correlated to the overall clonality i.e. clones with larger vertex sizes are more monoclonal and clones with smaller vertex sizes are more polyclonal.

Therefore, a Gini index of 1 on either measures repesents perfect inequality (i.e. monoclonal and highly mutated) and a value of 0 represents perfect equality (i.e. polyclonal and unmutated).

However, there are a few limitations/challenges that comes with single-cell data:

  1. In the process of contracting the network, we discard the single-cell level information.

  2. Contraction of network is very slow, particularly when there is a lot of clonally-related cells.

  3. For the full implementation and interpretation of both measures, although more evident with cluster/clone size, it requires the BCR repertoire to be reasonably/deeply sampled and we know that this is currently limited by the low recovery from single cell data with current technologies.

Therefore, we implement a few work arounds, and ‘experimental’ options below, to try and circumvent these issues.

Firstly, as a work around for (C), the cluster size gini index can be calculated before or after network contraction. If performing before network contraction (default), it will be calculated based on the size of subgraphs of connected components in the main graph. This will retain the single-cell information and should appropriately show the distribution of the data. If performing after network contraction, the calculation is performed after network contraction, achieving the same effect as the method for bulk BCR-seq as described above. This option can be toggled by use_contracted and only applies to network cluster size gini index calculation.

Alternative

iii) clone centrality Gini index - ``metric = clone_centrality``

Node/vertex closeness centrality indicates how tightly packed clones are (more clonally related) and thus the distribution of the number of cells connected in each clone informs on whether clones in general are more monoclonal or polyclonal.

iv) clone degree Gini index - ``metric = clone_degree``

Node/vertex degree indicates how many cells are connected to an individual cell, another indication of how clonally related cells are. However, this would also highlight cells that are in the middle of large networks but are not necessarily within clonally expanded regions (e.g. intermediate connecting cells within the minimum spanning tree)

v) clone size Gini index - ``metric = clone_size``

This is not to be confused with the network cluster size gini index calculation above as this doesn’t rely on the network, although the values should be similar. This is just a simple implementation based on the data frame for the relevant clone_id column. By default, this metric is also returned when running metric = clone_centrality or metric = clone_degree.

For (i) and (ii), we can specify expanded_only option to compute the statistic for all clones or expanded only clones. Unlike options (i) and (ii), the current calculation for (iii) and (iv) is largely influenced by the amount of expanded clones i.e. clones with at least 2 cells, and not affected by the number of singleton clones because singleton clones will have a value of 0 regardless.

The diversity functions also have the option to perform downsampling to a fixed number of cells, or to the smallest sample size specified via groupby (default) so that sample sizes are even when comparing between groups.

if update_obs_meta=False, a data frame is returned; otherwise, the value gets added to the AnnData.obs or Dandelion.metadata accordingly.

[41]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_network')
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_centrality')
ddl.tl.transfer(adata, vdj)
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished: updated `.metadata` with Gini indices.
 (0:00:12)
Calculating Gini indices
Computing gini indices for clone size using metadata and node closeness centrality using network.
Calculating node closeness centrality
 finished: Updated Dandelion metadata
 (0:00:00)
 finished: updated `.metadata` with Gini indices.
 (0:00:01)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:20)
[42]:
ddl.pl.clone_network(adata, color = ['clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini'], ncols = 2, size = 50)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_84_0.png

With these particular samples, because there is not many expanded clones in general, the gini indices are quite low when calculated within each sample. We can re-run it by specifying expanded_only = True to only factor in expanded_clones. We also specify the key_added option to create a new column instead of writing over the original columns.

[43]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_network', expanded_only = True, key_added = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'])
ddl.tl.transfer(adata, vdj)
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished: updated `.metadata` with Gini indices.
 (0:00:12)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:19)
[44]:
ddl.pl.clone_network(adata, color = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'], ncols = 2, size = 50)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_87_0.png

We can also choose not to update the metadata to return a pandas dataframe.

[45]:
gini = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', update_obs_meta=False)
gini
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished (0:00:12)
[45]:
clone_network_cluster_size_gini clone_network_vertex_size_gini
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
sc5p_v2_hs_PBMC_10k 0.004739 0.001584
vdj_nextgem_hs_pbmc3 0.048583 0.021134
[46]:
gini2 = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', update_obs_meta=False, expanded_only = True, key_added = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'])
gini2
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished (0:00:12)
[46]:
clone_network_cluster_size_gini_expanded clone_network_vertex_size_gini_expanded
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
sc5p_v2_hs_PBMC_10k 0.000000 0.333333
vdj_nextgem_hs_pbmc3 0.467532 0.333333
[47]:
import seaborn as sns
p = sns.scatterplot(x = 'clone_network_cluster_size_gini', y = 'clone_network_vertex_size_gini', data = gini, hue = gini.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p.set(ylim=(-0.1,1), xlim = (-0.1,1))
p
[47]:
<AxesSubplot:xlabel='clone_network_cluster_size_gini', ylabel='clone_network_vertex_size_gini'>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_91_1.png
[48]:
p2 = sns.scatterplot(x = 'clone_network_cluster_size_gini_expanded', y = 'clone_network_vertex_size_gini_expanded', data = gini2, hue = gini2.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p2.set(ylim=(-0.1,1), xlim = (-0.1,1))
p2
[48]:
<AxesSubplot:xlabel='clone_network_cluster_size_gini_expanded', ylabel='clone_network_vertex_size_gini_expanded'>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_92_1.png

We can also visualise what the results for the clone centrality gini indices.

[49]:
gini = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_centrality', update_obs_meta=False)
gini
Calculating Gini indices
Computing gini indices for clone size using metadata and node closeness centrality using network.
Calculating node closeness centrality
 finished: Updated Dandelion metadata
 (0:00:00)
 finished (0:00:01)
[49]:
clone_size_gini clone_centrality_gini
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
sc5p_v2_hs_PBMC_10k 0.004739 0.000000
vdj_nextgem_hs_pbmc3 0.048583 0.045455
[50]:
# not a great example because there's only 1 big clone in 1 sample.
p = sns.scatterplot(x = 'clone_size_gini', y = 'clone_centrality_gini', data = gini, hue = gini.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p.set(ylim=(-0.1,1), xlim = (-0.1,1))
p
[50]:
<AxesSubplot:xlabel='clone_size_gini', ylabel='clone_centrality_gini'>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_95_1.png

Chao1 is an estimator based on abundance

[51]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'chao1', update_obs_meta = False)
Calculating Chao1 estimates
 finished (0:00:01)
[51]:
clone_size_chao1
vdj_v1_hs_pbmc3 703.0
sc5p_v2_hs_PBMC_1k 561.0
sc5p_v2_hs_PBMC_10k 44205.5
vdj_nextgem_hs_pbmc3 9106.0

For Shannon Entropy, we can calculate a normalized (inspired by scirpy’s function) and non-normalized value.

[52]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'shannon', update_obs_meta = False)
Calculating Shannon entropy
 finished (0:00:01)
[52]:
clone_size_normalized_shannon
vdj_v1_hs_pbmc3 1.000000
sc5p_v2_hs_PBMC_1k 1.000000
sc5p_v2_hs_PBMC_10k 0.999849
vdj_nextgem_hs_pbmc3 0.989883
[53]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'shannon', update_obs_meta = False, normalize = False)
Calculating Shannon entropy
 finished (0:00:01)
[53]:
clone_size_shannon
vdj_v1_hs_pbmc3 5.209453
sc5p_v2_hs_PBMC_1k 5.044394
sc5p_v2_hs_PBMC_10k 8.712926
vdj_nextgem_hs_pbmc3 8.285998
[54]:
adata
[54]:
AnnData object with n_obs × n_vars = 16492 × 1497
    obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'leiden', 'sample_id', 'clone_id', 'clone_id_by_size', 'isotype', 'lightchain', 'status', 'vdj_status', 'productive', 'umi_counts_heavy', 'umi_counts_light', 'v_call_heavy', 'v_call_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'changeo_clone_id', 'mu_freq', 'mu_freq_seq_r', 'mu_freq_seq_s', 'mu_freq_seq_r_IGK', 'mu_freq_seq_s_IGK', 'mu_freq_IGK', 'mu_freq_seq_r_IGH', 'mu_freq_seq_s_IGH', 'mu_freq_IGH', 'mu_freq_seq_r_IGL', 'mu_freq_seq_s_IGL', 'mu_freq_IGL', 'clone_id_size', 'clone_id_size_max_3', 'clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_centrality', 'clone_size_gini', 'clone_centrality_gini', 'clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'
    var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'bcr_QC_pass_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'umap', 'rna_neighbors', 'sampleid_colors', 'isotype_colors', 'clone_id_by_size_colors', 'status_colors', 'vdj_status_colors', 'clone_id_colors'
    obsm: 'X_pca', 'X_umap', 'X_bcr'
    varm: 'PCs'
    obsp: 'connectivities', 'distances', 'rna_connectivities', 'rna_distances', 'bcr_connectivities', 'bcr_distances'

Additional plotting functions

barplot

pl.barplot is a generic barplot function that will plot items in the metadata slot as a bar plot. This function will also interact with .obs slot if a scanpy object is used in place of Dandelion object. However, if your scanpy object holds a lot of non-B cells, then the plotting will be just be saturated with nan values.

[55]:
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
ddl.pl.barplot(vdj, variable = 'v_call_heavy', figsize = (12, 4))
[55]:
(<Figure size 1200x400 with 1 Axes>,
 <AxesSubplot:title={'center':'v call heavy usage'}, ylabel='proportion'>)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_103_1.png

You can prevent it from sorting by specifying sort_descending = None. Colours can be changed with palette option.

[56]:
ddl.pl.barplot(vdj, variable = 'v_call_heavy', figsize = (12, 4), sort_descending = None, palette = 'tab20')
[56]:
(<Figure size 1200x400 with 1 Axes>,
 <AxesSubplot:title={'center':'v call heavy usage'}, ylabel='proportion'>)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_105_1.png

Specifying normalize = False will change the y-axis to counts.

[57]:
ddl.pl.barplot(vdj, variable = 'v_call_heavy', normalize = False, figsize = (12, 4), sort_descending = None, palette = 'tab20')
[57]:
(<Figure size 1200x400 with 1 Axes>,
 <AxesSubplot:title={'center':'v call heavy usage'}, ylabel='count'>)
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_107_1.png

stackedbarplot

pl.stackedbarplot is similar to above but can split between specified groups. Some examples below:

[58]:
import matplotlib.pyplot as plt
ddl.pl.stackedbarplot(vdj, variable = 'isotype', groupby = 'status', xtick_rotation =0, figsize = (4,4))
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[58]:
<matplotlib.legend.Legend at 0x169239d50>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_109_1.png
[59]:
ddl.pl.stackedbarplot(vdj, variable = 'v_call_heavy', groupby = 'isotype')
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[59]:
<matplotlib.legend.Legend at 0x16c216750>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_110_1.png
[60]:
ddl.pl.stackedbarplot(vdj, variable = 'v_call_heavy', groupby = 'isotype', normalize = True)
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[60]:
<matplotlib.legend.Legend at 0x16adeda90>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_111_1.png
[61]:
ddl.pl.stackedbarplot(vdj, variable = 'v_call_heavy', groupby = 'vdj_status')
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[61]:
<matplotlib.legend.Legend at 0x16aade450>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_112_1.png

It’s obviously more useful if you don’t have too many groups, but you could try and plot everything and jiggle the legend options and color.

[62]:
ddl.pl.stackedbarplot(vdj, variable = 'v_call_heavy', groupby = 'sample_id')
plt.legend(bbox_to_anchor = (1, 0.5), loc='center left', frameon=False)
[62]:
<matplotlib.legend.Legend at 0x1698e8250>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_114_1.png

spectratype

Spectratype plots contain info displaying CDR3 length distribution for specified groups. For this function, the current method only works for dandelion objects as it requires access to the contig-indexed .data slot.

[63]:
ddl.pl.spectratype(vdj, variable = 'junction_length', groupby = 'c_call', locus='IGH', width = 2.3)
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[63]:
<matplotlib.legend.Legend at 0x1690f9590>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_116_1.png
[64]:
ddl.pl.spectratype(vdj, variable = 'junction_aa_length', groupby = 'c_call', locus='IGH')
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[64]:
<matplotlib.legend.Legend at 0x16b8d9d10>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_117_1.png
[65]:
ddl.pl.spectratype(vdj, variable = 'junction_aa_length', groupby = 'c_call', locus=['IGK','IGL'])
plt.legend(bbox_to_anchor = (1,1), loc='upper left', frameon=False)
[65]:
<matplotlib.legend.Legend at 0x16b793e50>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_118_1.png

clone_overlap

There is now a circos-style clone overlap function where it looks for whather different samples share a clone. If they do, an arc/connection will be drawn between them. This requires the python module nxviz to be installed; at the writing of this notebook, there are some dependencies issues with pip install nxviz, therefore I’ve adjusted the requirements in a forked repository which you can install via: pip install git+https://github.com/zktuong/nxviz.git

[66]:
ddl.tl.clone_overlap(adata, groupby = 'leiden', colorby = 'leiden')
Finding clones
 finished: Updated AnnData:
   'uns', clone overlap table (0:00:00)
[67]:
ddl.pl.clone_overlap(adata, groupby = 'leiden', colorby = 'leiden', return_graph=True, group_label_offset=.5)
[67]:
<nxviz.plots.CircosPlot at 0x16c3e4f90>
../_images/notebooks_3_dandelion_findingclones_and_analysis-10x_data_121_1.png

Other use cases for this would be, for example, to plot nodes as individual samples and the colors as group classifications of the samples. As long as this information is found in the .obs column in the AnnData, or even Dandelion.metadata, this will work.

That sums it up for now! Let me know if you have any ideas at [kt16@sanger.ac.uk] and I can try and see if i can implement it or we can work something out to collaborate on!

[ ]: