dandelion.preprocessing.filter_bcr

dandelion.preprocessing.filter_bcr(data, adata, filter_bcr=True, filter_rna=True, filter_poorqualitybcr=False, rescue_igh=True, umi_foldchange_cutoff=5, filter_lightchains=True, filter_missing=True, productive_only=True, parallel=True, ncpu=None, save=None)[source]

Filters doublets and poor quality cells and corresponding contigs based on provided V(D)J DataFrame and AnnData objects. Depends on a AnnData.obs slot populated with ‘filter_rna’ column. If the aligned sequence is an exact match between contigs, the contigs will be merged into the one with the highest umi count, adding the summing the umi count of the duplicated contigs to duplicate_count column. After this check, if there are still multiple contigs, cells with multiple IGH contigs are filtered unless rescue_igh is True, where by the umi counts for each IGH contig will then be compared. The contig with the highest umi that is > umi_foldchange_cutoff (default is empirically set at 5) from the lowest will be retained. If there’s multiple contigs that survive the ‘rescue’, then all contigs will be filtered. The default behaviour is to also filter cells with multiple lightchains but this may sometimes be a true biological occurrence; toggling filter_lightchains to False will rescue the mutltiplet light chains. Lastly, contigs with no corresponding cell barcode in the AnnData object is filtered if filter_missing is True. However, this may be useful to toggle to False if more contigs are preferred to be kept or for integrating with bulk reperotire seq data.

Parameters
  • data (Dandeion, pd.DataDrame, str) – V(D)J airr/changeo data to filter. Can be pandas DataFrame object or file path as string.

  • adata (AnnData) – AnnData object to filter.

  • filter_bcr (bool) – If True, V(D)J DataFrame object returned will be filtered. Default is True.

  • filter_rna (bool) – If True, AnnData object returned will be filtered. Default is True.

  • filter_poorqualitybcr (bool) – If True, barcodes marked with poor quality BCR contigs will be filtered. Default is False; only relevant contigs are removed and RNA barcodes are kept.

  • rescue_igh (bool) – If True, rescues IGH contigs with highest umi counts with a requirement that it passes the umi_foldchange_cutoff option. In addition, the sum of the all the heavy chain contigs must be greater than 3 umi or all contigs will be filtered. Default is True.

  • umi_foldchange_cutoff (int) – related to minimum fold change required to rescue heavy chain contigs/barcode otherwise they will be marked as doublets. Default is empirically set at 5-fold.

  • filter_lightchains (bool) – cells with multiple light chains will be marked to filter. Default is True.

  • productive_only (bool) – whether or not to retain only productive contigs.

  • filter_missing (bool) – cells in V(D)J data not found in AnnData object will be marked to filter. Default is True. This may be useful for toggling to False if integrating with bulk data.

  • parallel (bool) – whether or not to use parallelization. Default is True.

  • ncpu (int) – number of cores to use if parallel is True. Default is all available - 1.

  • save (str, optional) – Only used if a pandas dataframe or dandelion object is provided. Specifying will save the formatted vdj table.

Returns

Return type

V(D)J DataFrame object in airr/changeo format and AnnData object.