Calculating diversity and mutation

dandelion_logo

Calculating mutational load

To calculate mutational load, the functions from immcantation suite’s shazam can be accessed via rpy2 to work with the dandelion class object.

This can be run immediately after pp.reassign_alleles during the reannotation pre-processing stage because the required germline columns should be present in the genotyped .tsv file. I would reccomend to run this after TIgGER, after the v_calls were corrected. Otherwise, if the reannotation was skipped, you can run it now as follows:

Import modules

[1]:
import os
import pandas as pd
import dandelion as ddl
ddl.logging.print_header()
dandelion==0.1.0 pandas==1.1.4 numpy==1.19.4 matplotlib==3.3.3 networkx==2.5 scipy==1.5.3 skbio==0.5.6
[2]:
# change directory to somewhere more workable
os.chdir(os.path.expanduser('/Users/kt16/Downloads/dandelion_tutorial/'))
# I'm importing scanpy here to make use of its logging module.
import scanpy as sc
sc.settings.verbosity = 3
import warnings
warnings.filterwarnings('ignore')
sc.logging.print_header()
scanpy==1.6.0 anndata==0.7.4 umap==0.4.6 numpy==1.19.4 scipy==1.5.3 pandas==1.1.4 scikit-learn==0.23.2 statsmodels==0.12.1 python-igraph==0.8.3 leidenalg==0.8.3

Read in the previously saved files

[3]:
adata = sc.read_h5ad('adata.h5ad')
adata
[3]:
AnnData object with n_obs × n_vars = 16492 × 1497
    obs: 'sampleid', 'batch', 'scrublet_score', 'n_genes', 'percent_mito', 'n_counts', 'is_doublet', 'filter_rna', 'has_bcr', 'filter_bcr_quality', 'filter_bcr_heavy', 'filter_bcr_light', 'bcr_QC_pass', 'filter_bcr', 'leiden', 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id'
    var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'bcr_QC_pass_colors', 'clone_id_by_size_colors', 'hvg', 'isotype_colors', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rna_neighbors', 'sampleid_colors', 'status_colors', 'umap', 'vdj_status_colors'
    obsm: 'X_bcr', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'bcr_connectivities', 'bcr_distances', 'connectivities', 'distances', 'rna_connectivities', 'rna_distances'
[4]:
vdj = ddl.read_h5('dandelion_results.h5')
vdj
[4]:
Dandelion class object with n_obs = 838 and n_contigs = 1700
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status', 'changeo_clone_id'
    distance: 'heavy_0', 'light_0', 'light_1', 'light_2'
    edges: 'source', 'target', 'weight'
    layout: layout for 838 vertices, layout for 24 vertices
    graph: networkx graph of 838 vertices, networkx graph of 24 vertices
[5]:
# let's recreate the vdj object with only the first two samples
subset_data = vdj.data[vdj.data['sample_id'].isin(['sc5p_v2_hs_PBMC_1k', 'sc5p_v2_hs_PBMC_10k'])]
subset_data
[5]:
sequence_id sequence rev_comp productive v_call d_call j_call sequence_alignment germline_alignment junction ... cdr2_aa cdr3_aa sequence_alignment_aa v_sequence_alignment_aa d_sequence_alignment_aa j_sequence_alignment_aa mu_freq duplicate_count clone_id changeo_clone_id
sequence_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 CTGGGCCTCAGGAAGCAGCATCGGAGGTGCCTCAGCCATGGCATGG... F T IGLV3-1*01 IGLJ2*01,IGLJ3*01 TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG... TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG... TGTCAGGCGTGGGACAGCAGCAATGTGGTATTC ... QDN QAWDSSNVV SYELTQPPSVSVSPGQTASITCSGDKLGHKYACWYQQKPGQSPVLV... SYELTQPPSVSVSPGQTASITCSGDKLGHKYACWYQQKPGQSPVLV... VFGGGTKLTVL 0.009434 0 64_8_3 55_0
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 AGCTCTGGGAGAGGAGCCCCAGCCTTGGGATTCCCAAGTGTTTTCA... F T IGHV3-15*01 IGHD4-23*01 IGHJ4*02 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG... GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG... TGTACCACAGAAGCTTTAGACTACGGTGCTAACTCGCGATCCCCGA... ... IKSNTDGATT TTEALDYGANSRSPNFDY EVQLVESGGGLVKPGGSLRLSCAASGFIFSNAWMSWVRQAPGKGLE... EVQLVESGGGLVKPGGSLRLSCAASGFIFSNAWMSWVRQAPGKGLE... DYGANS FDYWGQGTLVTVSS 0.008671 0 64_8_3 55_0
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 AGAGCTCTGGAGAAGAGCTGCTCAGTTAGGACCCAGAGGGAACCAT... F T IGKV3-20*01 IGKJ3*01 GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... TGTCAGCAGTATGGTAGCTCACCTCCATTCACTTTC ... GAS QQYGSSPPFT EIVLTQSPGTLSLSPGERATLSCRASQSVSSSYLAWYQQKPGQAPR... EIVLTQSPGTLSLSPGERATLSCRASQSVSSSYLAWYQQKPGQAPR... FTFGPGTKVDIK 0.000000 0 116_6_2 76_1
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 GAGCTCTGGGAGAGGAGCCCAGCACTAGAAGTCGGCGGTGTTTCCA... F T IGHV3-30*02,IGHV3-30-5*02 IGHD1-26*01 IGHJ6*02 CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG... CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG... TGTGCGAAAGATACTGAAGTGGGAGCGAGCCGATACTACTACTACT... ... IRYDGSNK AKDTEVGASRYYYYYGMDV QVQLVESGGGVVQPGGSLRLSCAASGFTFSSYGMHWVRQAPGKGLE... QVQLVESGGGVVQPGGSLRLSCAASGFTFSSYGMHWVRQAPGKGLE... VGA YYYYYGMDVWGQGTTVTVSS 0.000000 0 116_6_2 76_1
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 GAGCTACAACAGGCAGGCAGGGGCAGCAAGATGGTGTTGCAGACCC... F T IGKV4-1*01 IGKJ5*01 GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG... GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG... TGTCAGCAATATTATAGTACTCCGATCACCTTC ... WAS QQYYSTPIT DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKP... DIVMTQSPDSLAVSLGERATINCKSSQSVLYSSNNKNYLAWYQQKP... ITFGQGTRLEIK 0.000000 0 134_3_1 185_2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 TTTCCTCTCTTCAACTGCGAACCGACTTTCTGCGATGGGGACTCAA... F T IGHV1-8*01 IGHD2-2*02 IGHJ6*02 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGAGATATTGTAGTAGTACCAGCTGCTATACGACCTATTACT... ... MNPNSGNT ARYCSSTSCYTTYYYYYYGMDV QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYDINWVRQATGQGLE... QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYDINWVRQATGQGLE... YCSSTSCYT YYYYYGMDVWGQGTTVTVSS 0.000000 0 36_6_1 758_448
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 GGGAGAGCCCTGGGGAGGAACTGCTCAGTTAGGACCCAGAGGGAAC... F T IGKV3-11*01 IGKJ5*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... TGTCAGCAGCGTAGCAACTGGCTCACCTTC ... DAS QQRSNWLT EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... TFGQGTRLEIK 0.000000 0 155_2_1 432_449
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 ACAACCACACCCCTCCTAAGAAGAAGCCCCTAGACCACAGCTCCAC... F T IGHV7-4-1*02 IGHD3-10*01 IGHJ5*02 CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG... TGTGCGAGAGTTTTTAGACGCTATGGTTCGGGGAGTTATTATAACC... ... INTNTGNP ARVFRRYGSGSYYNL QVQLVQSGSELKKPGASVKVSCKASGYTFTSYAMNWVRQAPGQGLE... QVQLVQSGSELKKPGASVKVSCKASGYTFTSYAMNWVRQAPGQGLE... YGSGSYY LWGQGTLVTVSS 0.003003 0 155_2_1 432_449
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 GAGAGAGGAGCCTTAGCCCTGGATTCCAAGGCCTATCCACTTGGTG... F T IGHV3-21*01 IGHD4-17*01 IGHJ2*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... TGTGCGAGAGATCCCGGTGACTACGTAGAAATTGAGTGGTACTTCG... ... ISSSSSYI ARDPGDYVEIEWYFDL EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE... EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE... GDY WYFDLWGRGTLVTVSS 0.000000 0 52_1_1 357_450
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 GGGAGAGCCCTGGGGAGGAACTGCTCAGTTAGGACCCAGAGGGAAC... F T IGKV3-11*01 IGKJ4*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG... TGTCAGCAGCGTAGCAACTGGCCTAGGCTCACTTTC ... DAS QQRSNWPRLT EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... EIVLTQSPATLSLSPGERATLSCRASQSVSSYLAWYQQKPGQAPRL... LTFGGGTKVEIK 0.000000 0 52_1_1 357_450

926 rows × 84 columns

[6]:
# create a new Dandelion class with this subset
vdj2 = ddl.Dandelion(subset_data)
vdj2
[6]:
Dandelion class object with n_obs = 454 and n_contigs = 926
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_support', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_freq', 'duplicate_count', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_heavy', 'locus_light', 'productive_heavy', 'productive_light', 'v_call_genotyped_heavy', 'v_call_genotyped_light', 'j_call_heavy', 'j_call_light', 'c_call_heavy', 'c_call_light', 'umi_count_heavy_0', 'umi_count_light_0', 'umi_count_light_1', 'umi_count_light_2', 'junction_aa_heavy', 'junction_aa_light', 'status', 'productive', 'isotype', 'vdj_status_detail', 'vdj_status'
    distance: None
    edges: None
    layout: None
    graph: None

update_germline

We can store the corrected germline fasta files (after running TIgGER) in the Dandelion class as a dictionary.

[7]:
# update the germline using the corrected files after tigger
vdj2.update_germline(corrected = 'tutorial_scgp1/tutorial_scgp1_heavy_igblast_db-pass_genotype.fasta', germline = None, org = 'human')
Updating germline reference
 finished: Updated Dandelion object:
   'germline', updated germline reference
 (0:00:00)

pp.create_germlines

Then we run pp.create_germline to (re)create the germline_alignment_d_mask column in the data. If update_germline was run like above, there’s no need to specify the germline option as the function will simply retrieve it from the Dandelion object.

Note: the ability to run the original CreateGermlines.py with –cloned option is not currently possible through pp.create_germlines(). This is possible with pp.external.creategermlines but requires a physical file for CreateGermlines.py to work on. Thus, I would reccomend for you to run CreateGermlines.py separately if you intend to use the –cloned option. See [(https://changeo.readthedocs.io/en/stable/examples/germlines.html)] for more info.

[8]:
ddl.pp.create_germlines(vdj2, v_field = 'v_call_genotyped', germ_types='dmask')
Reconstructing germline sequences
   Building dmask germline sequences: 926it [00:01, 739.83it/s]
 finished: Updated Dandelion object:
   'data', updated germline alignment in contig-indexed clone table
   'germline', updated germline reference
 (0:00:01)

Ensure that the germline_alignment_d_mask column is populated or subsequent steps will fail.

[9]:
vdj2.data[['v_call_genotyped', 'germline_alignment_d_mask']]
[9]:
v_call_genotyped germline_alignment_d_mask
sequence_id
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_1 IGLV3-1*01 TCCTATGAGCTGACTCAGCCACCCTCA...GTGTCCGTGTCCCCAG...
sc5p_v2_hs_PBMC_1k_AACTGGTTCTCTAAGG_contig_2 IGHV3-15*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTAAAGCCTG...
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_1 IGKV3-20*01 GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG...
sc5p_v2_hs_PBMC_1k_AATCCAGTCAGTTGAC_contig_2 IGHV3-30*02 CAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCGTGGTCCAGCCTG...
sc5p_v2_hs_PBMC_1k_AATCGGTGTTAGGGTG_contig_1 IGKV4-1*01 GACATCGTGATGACCCAGTCTCCAGACTCCCTGGCTGTGTCTCTGG...
... ... ...
sc5p_v2_hs_PBMC_10k_TTTCCTCTCTTCAACT_contig_1 IGHV1-8*01 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_2 IGKV3-11*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCAGAGCTT_contig_1 IGHV7-4-1*02 CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_1 IGHV3-21*01 GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG...
sc5p_v2_hs_PBMC_10k_TTTGGTTTCGGTGTCG_contig_2 IGKV3-11*01 GAAATTGTGTTGACACAGTCTCCAGCCACCCTGTCTTTGTCTCCAG...

926 rows × 2 columns

The default behaviour is to mask the D region with Ns with option germ_types = 'dmask'. See here for more info.

pp.quantify_mutations

The options for pp.quantify_mutations are the same as the basic mutational load analysis vignette. The default behavior is to sum all mutations scores (heavy and light chains, silent and replacement mutations) for the same cell.

Again, this function can be run immediately after pp.reassign_alleles on the genotyped .tsv files (without loading into pandas or Dandelion). Here I’m illustrating a few other options that may be useful.

[10]:
# switching back to using the full vdj object
ddl.pp.quantify_mutations(vdj)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:08)
[11]:
ddl.pp.quantify_mutations(vdj, combine = False)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:05)

Specifying split_locus = True will split up the results for the different chains.

[12]:
ddl.pp.quantify_mutations(vdj, split_locus = True)
Quantifying mutations
 finished: Updated Dandelion object:
   'data', contig-indexed clone table
   'metadata', cell-indexed clone table
 (0:00:07)

To update the AnnData object, simply rerun tl.transfer.

[13]:
ddl.tl.transfer(adata, vdj)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:31)
[14]:
from scanpy.plotting.palettes import default_28, default_102
sc.set_figure_params(figsize = [4,4])
ddl.pl.clone_network(adata, color = ['clone_id', 'mu_freq', 'mu_freq_seq_r', 'mu_freq_seq_s', 'mu_freq_IGH', 'mu_freq_IGL'], ncols = 2, legend_loc = 'none', legend_fontoutline=3, edges_width = 1, palette = default_28 + default_102, color_map = 'viridis', size = 50)
WARNING: Length of palette colors is smaller than the number of categories (palette length: 130, categories length: 822. Some categories will have the same color.
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_24_1.png

Calculating diversity

Disclaimer: the functions here are experimental. Please look to other sources/methods for doing this properly. Also, would appreciate any help to help me finalise this!

tl.clone_rarefaction and pl.clone_rarefaction

We can use pl.clone_rarefaction to generate rarefaction curves for the clones. tl.clone_rarefaction will populate the .uns slot with the results. groupby option must be specified. In this case, I decided to group by sample. The function will only work on an AnnData object and not a Dandelion object.

[15]:
ddl.pl.clone_rarefaction(adata, groupby = 'sampleid')
removing due to zero counts:
Calculating rarefaction curve : 100%|██████████| 4/4 [00:00<00:00, 15.28it/s]
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_26_2.png
[15]:
<ggplot: (343225997)>

tl.clone_diversity

tl.clone_diversity allows for calculation of diversity measures such as Chao1, Shannon Entropy and Gini indices.

While the function can work on both AnnData and Dandelion objects, the methods for gini index calculation will only work on a Dandelion object as it requires access to the network.

For Gini indices, we provide several types of measures, inspired by bulk BCRseq analysis methods from Rachael Bashford-Rogers’ works:

Default

i) network cluster/clone size Gini index - metric = clone_network

In a contracted BCR network (where identical BCRs are collapsed into the same node/vertex), disparity in the distribution should be correlated to the amount of mutation events i.e. larger networks should indicate more mutation events and smaller networks should indicate lesser mutation events.

ii) network vertex/node size Gini index - metric = clone_network

In the same contracted network, we can count the number of merged/contracted nodes; nodes with higher count numbers indicate more clonal expansion. Thus, disparity in the distribution of count numbers (referred to as vertex size) should be correlated to the overall clonality i.e. clones with larger vertex sizes are more monoclonal and clones with smaller vertex sizes are more polyclonal.

Therefore, a Gini index of 1 on either measures repesents perfect inequality (i.e. monoclonal and highly mutated) and a value of 0 represents perfect equality (i.e. polyclonal and unmutated).

However, there are a few limitations/challenges that comes with single-cell data:

  1. In the process of contracting the network, we discard the single-cell level information.

  2. Contraction of network is very slow, particularly when there is a lot of clonally-related cells.

  3. For the full implementation and interpretation of both measures, although more evident with cluster/clone size, it requires the BCR repertoire to be reasonably/deeply sampled and we know that this is currently limited by the low recovery from single cell data with current technologies.

Therefore, we implement a few work arounds, and ‘experimental’ options below, to try and circumvent these issues.

Firstly, as a work around for (C), the cluster size gini index can be calculated before or after network contraction. If performing before network contraction (default), it will be calculated based on the size of subgraphs of connected components in the main graph. This will retain the single-cell information and should appropriately show the distribution of the data. If performing after network contraction, the calculation is performed after network contraction, achieving the same effect as the method for bulk BCR-seq as described above. This option can be toggled by use_contracted and only applies to network cluster size gini index calculation.

Alternative

iii) clone centrality Gini index - metric = clone_centrality

Node/vertex closeness centrality indicates how tightly packed clones are (more clonally related) and thus the distribution of the number of cells connected in each clone informs on whether clones in general are more monoclonal or polyclonal.

iv) clone degree Gini index - metric = clone_degree

Node/vertex degree indicates how many cells are connected to an individual cell, another indication of how clonally related cells are. However, this would also highlight cells that are in the middle of large networks but are not necessarily within clonally expanded regions (e.g. intermediate connecting cells within the minimum spanning tree)

v) clone size Gini index - metric = clone_size

This is not to be confused with the network cluster size gini index calculation above as this doesn’t rely on the network, although the values should be similar. This is just a simple implementation based on the data frame for the relevant clone_id column. By default, this metric is also returned when running metric = clone_centrality or metric = clone_degree.

For (i) and (ii), we can specify expanded_only option to compute the statistic for all clones or expanded only clones. Unlike options (i) and (ii), the current calculation for (iii) and (iv) is largely influenced by the amount of expanded clones i.e. clones with at least 2 cells, and not affected by the number of singleton clones because singleton clones will have a value of 0 regardless.

The diversity functions also have the option to perform downsampling to a fixed number of cells, or to the smallest sample size specified via groupby (default) so that sample sizes are even when comparing between groups.

if update_obs_meta=False, a data frame is returned; otherwise, the value gets added to the AnnData.obs or Dandelion.metadata accordingly.

[16]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_network')
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_centrality')
ddl.tl.transfer(adata, vdj)
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished: updated `.metadata` with Gini indices.
 (0:00:14)
Calculating Gini indices
Computing gini indices for clone size using metadata and node closeness centrality using network.
Calculating node closeness centrality
 finished: Updated Dandelion metadata
 (0:00:00)
 finished: updated `.metadata` with Gini indices.
 (0:00:01)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:22)
[17]:
ddl.pl.clone_network(adata, color = ['clone_network_cluster_size_gini', 'clone_network_vertex_size_gini', 'clone_size_gini', 'clone_centrality_gini'], ncols = 2, size = 50)
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_30_0.png

With these particular samples, because there is not many expanded clones in general, the gini indices are quite low when calculated within each sample. We can re-run it by specifying expanded_only = True to only factor in expanded_clones. We also specify the key_added option to create a new column instead of writing over the original columns.

[18]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_network', expanded_only = True, key_added = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'])
ddl.tl.transfer(adata, vdj)
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished: updated `.metadata` with Gini indices.
 (0:00:14)
Transferring network
converting matrices
Updating anndata slots
 finished: updated `.obs` with `.metadata`
added to `.uns['neighbors']` and `.obsp`
   'distances', cluster-weighted adjacency matrix
   'connectivities', cluster-weighted adjacency matrix (0:00:20)
[19]:
ddl.pl.clone_network(adata, color = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'], ncols = 2, size = 50)
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_33_0.png

We can also choose not to update the metadata to return a pandas dataframe.

[20]:
gini = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', update_obs_meta=False)
gini
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished (0:00:16)
[20]:
clone_network_cluster_size_gini clone_network_vertex_size_gini
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
vdj_nextgem_hs_pbmc3 0.048583 0.021134
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_10k 0.004739 0.001584
[21]:
gini2 = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', update_obs_meta=False, expanded_only = True, key_added = ['clone_network_cluster_size_gini_expanded', 'clone_network_vertex_size_gini_expanded'])
gini2
Calculating Gini indices
Computing Gini indices for cluster and vertex size using network.
 finished (0:00:15)
[21]:
clone_network_cluster_size_gini_expanded clone_network_vertex_size_gini_expanded
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
vdj_nextgem_hs_pbmc3 0.467532 0.333333
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_10k 0.000000 0.333333
[22]:
ddl.pl.clone_network(adata, color = 'sampleid', size = 50)
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_37_0.png
[23]:
import seaborn as sns
p = sns.scatterplot(x = 'clone_network_cluster_size_gini', y = 'clone_network_vertex_size_gini', data = gini, hue = gini.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p.set(ylim=(-0.1,1), xlim = (-0.1,1))
p
[23]:
<AxesSubplot:xlabel='clone_network_cluster_size_gini', ylabel='clone_network_vertex_size_gini'>
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_38_1.png
[24]:
p2 = sns.scatterplot(x = 'clone_network_cluster_size_gini_expanded', y = 'clone_network_vertex_size_gini_expanded', data = gini2, hue = gini2.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p2.set(ylim=(-0.1,1), xlim = (-0.1,1))
p2
[24]:
<AxesSubplot:xlabel='clone_network_cluster_size_gini_expanded', ylabel='clone_network_vertex_size_gini_expanded'>
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_39_1.png

We can also visualise what the results for the clone centrality gini indices.

[25]:
gini = ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'gini', metric = 'clone_centrality', update_obs_meta=False)
gini
Calculating Gini indices
Computing gini indices for clone size using metadata and node closeness centrality using network.
Calculating node closeness centrality
 finished: Updated Dandelion metadata
 (0:00:00)
 finished (0:00:01)
[25]:
clone_size_gini clone_centrality_gini
sc5p_v2_hs_PBMC_1k 0.029412 0.000000
vdj_nextgem_hs_pbmc3 0.048583 0.045455
vdj_v1_hs_pbmc3 0.026316 0.000000
sc5p_v2_hs_PBMC_10k 0.004739 0.000000
[26]:
# not a great example because there's only 1 big clone in 1 sample.
p = sns.scatterplot(x = 'clone_size_gini', y = 'clone_centrality_gini', data = gini, hue = gini.index, palette = dict(zip(adata.obs['sampleid'].cat.categories, adata.uns['sampleid_colors'])))
p.set(ylim=(-0.1,1), xlim = (-0.1,1))
p
[26]:
<AxesSubplot:xlabel='clone_size_gini', ylabel='clone_centrality_gini'>
../_images/notebooks_5_dandelion_diversity_and_mutation-10x_data_42_1.png

Chao1 is an estimator based on abundance

[27]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'chao1', update_obs_meta = False)
Calculating Chao1 estimates
 finished (0:00:01)
[27]:
clone_size_chao1
sc5p_v2_hs_PBMC_1k 561.0
vdj_nextgem_hs_pbmc3 9106.0
vdj_v1_hs_pbmc3 703.0
sc5p_v2_hs_PBMC_10k 44205.5

For Shannon Entropy, we can calculate a normalized (inspired by scirpy’s function) and non-normalized value.

[28]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'shannon', update_obs_meta = False)
Calculating Shannon entropy
 finished (0:00:01)
[28]:
clone_size_normalized_shannon
sc5p_v2_hs_PBMC_1k 1.000000
vdj_nextgem_hs_pbmc3 0.989883
vdj_v1_hs_pbmc3 1.000000
sc5p_v2_hs_PBMC_10k 0.999849
[29]:
ddl.tl.clone_diversity(vdj, groupby = 'sample_id', method = 'shannon', update_obs_meta = False, normalize = False)
Calculating Shannon entropy
 finished (0:00:01)
[29]:
clone_size_shannon
sc5p_v2_hs_PBMC_1k 5.044394
vdj_nextgem_hs_pbmc3 8.285998
vdj_v1_hs_pbmc3 5.209453
sc5p_v2_hs_PBMC_10k 8.712926

That sums it up for now! Let me know if you have any ideas at [kt16@sanger.ac.uk] and I can try and see if i can implement it or we can work something out to collaborate on!

[ ]: