Single cell CRISPR screen gRNA library mutation profiling

Configuration

Prepare ConfigFile for girafr gRNA_mutation.

see example ConfigFile

gRNA_bam_file

gRNA library alignment bam file, mapped to custom reference with gRNA as artificial chromosome (starting ‘_chrom’)

filtered_barcode

filtered barcode list of expression library mapping results (unzipped). Cellranger corrected cell barcodes (CB tag) is saved as filtered barcodes in the cellranger output. Two alignments can cause cellranger correct same cell barcodes in different ways and cause less cell pass bam filter.

min_reads

Integer number. By default = 1. UMIs with less than min_reads will be filtered out.

auto

Boolean. When True, girafr will fit model to determine UMI threshold for each gRNA. When False, fixed min_umi will be used. By default is True.

min_umi

Integer number. By default = 3. Cells with UMIs >= min_umi will be assigned with gRNA. Not used when auto is True.

pool

Boolean. By default is True. When True, girafr will pool both intact UMIs and UMIs with mutation together to determine UMI threshold for each gRNA. When False, girafr will give UMI threshold seperately for intact and mutated gRNA.

ref_fasta

refence gRNA cassette sequence.

genome_gtf

gRNA cassette annotation in gtf format. ref_fasta and genome_gtf should be also used to build reference genome, to which the gRNA library sequence is mapped. See build for more information.

Requirements

  • samtools

  • 2bit genome downloaded from UCSC

gRNA mutation profile

girafr gRNA_mutation -f absolute_path/ConfigFile

Simplified process

  • step 1: gRNA bam file filtration.

utils.gRNA_bam_filter(filename, samtools_path)

Script will remove secondary alignments and those which are not aligned to designed gRNA cassette. This step is time consuming.

Parameters
  • filename (string) – input file gRNA library alignments by cellranger or dropseq_tools

  • samtools_path (string) – path to samtools

Returns

generate filtered bam file in output folder named as gRNA.sorted.mapped.removedSecondaryAlignment.onlyMappedToGrnaChrom.bam

  • Step2: Construct consensus sequence for each UMI-Cell barcode combination

consensus_sequence.generate_consensus_sequence_gRNA(bam_in, barcodes):

Next, script will construct consensus sequence for each UMI. We take the most reads supported sequence as the consensus sequence for the UMI. More details see methods in citation.

Parameters
  • bam_in (string) – path to bam file, alignment file of gRNA library after removed secondary alignment and mapped not on gRNA reference gRNA.sorted.mapped.removedSecondaryAlignment.onlyMappedToGrnaChrom.bam

  • barcodes (string) – filtered barcode list, unzipped file.

Returns

file: consensus.sequence.gRNA.txt: consensus sequence supported by more than min_reads (default = 1) reads. consensus.seqeunce.gRNA.all_umi.txt: all UMI detected consensus sequence, including UMI with only min_reads (default = 1) read. consensus.bam: consensus sequence in bam file format. Non-consensus.bam: alignment not the same as consensus sequence.

  • Step3: Call mutations from consensus.bam:

variant.call_gRNA_variant(consensus_seq_file, ref_fasta, structure_gtf):

Then, we compare consensus sequence of each UMI with its reference and annotate where the mutation is by structure annotation. Variances are encoded in a similar way like CIGAR in sam format.

Parameters
  • consensus_seq_file (string) – consensus.sequence.gRNA.txt generated by previous step

  • ref_fasta (string) – path to file: oligo_pool_plasmid.fa, specified in ConfigFile

  • structure_gtf (string) – path to file oligo_pool_plasmid_structure.gtf, specified in ConfigFile

Returns

file: consensus.sequence.gRNA.variant.txt

  • Step4: Assign gRNAs to cells

assign_gRNA.assign_gRNA_to_cell():

In the end, we assign found guides to cells. It is required for a cell to have more than min_umi molecule of gRNA so that the script will assign the gRNA to that cell. This umi threshold can be defined by min_umi as fixed threshold for all guides or it can be automatically calculated by fitting a two model mixed gaussian model when auto is set as true in the configuration file. :param string in_file: file consensus.sequence.gRNA.variant.txt generated by previous step :param integer min_umi: minum number of UMI, default is 3 :param boolean auto: boolean, whether use autodetection or fixed min_umi, default is false :param boolean pool: boolean, whether calculate min umi thresholds together with variant gRNA of the same guide, default is false :return: Write cells.gRNA.txt and cells.gRNA.single.txt

consensus.sequence.matrix gRNA.umi.threshold.txt

  • Step5 (optional):

    Functions assign_gRNA.add_variant_type and profile_MT_pattern.py add mutation details for downstream analysis.

Output files

See section output files formats

Additional information

**Build custom reference (optional): **

Oligo_pool.csv: two columns: oligo_name and sequence, no header. prepare.py: generate oligo_pool_plasmid.fa, oligo_pool_plasmid.gtf and oligo_pool_plasmid_structure.gtf

This part gives instruction to build a custom CellRanger reference with designed cassette as artificial chromosome. utils.write_annotation function generates oligo_pool_plasmid.fa and oligo_pool_plasmid.gtf which will be used to generate cellranger reference (see build note as example), and oligo_pool_plasmid_structure.gtf which will be used to profile where the mutations are on the cassatte. This script is modified from .._code: https://github.com/epigen/crop-seq

CIGAR-like string:

  • Digit numbers represents exact matches, and nucleotides followed are mutated bases. 0 represents no nucleotide.

  • Digit numbers followed by insertions (I), deletions (D) and soft clippings (S) show the number of nucleotides of those events. Hard clippings (H) are not included. The major difference between this string and CIGAR-string is it replaces matches (M) into mismatches and encode detailed mutated nucleotides [ATGC] into the string.

**Mutation structure annotation: **

  • Annotations begin with oligo structures such as gRNA which are consistent with user input oligo_pool_plasmid_structure.gtf. Then each mutation annotation follows oligo structure with semicolon as separator. Comma separates individual mutation event. Digit numbers represents the distance to the beginning of the structure. Nucleotides followed are mutated bases. 0 represents no nucleotide. Digit numbers in bracket followed insertions (I), deletions (D) and soft clippings (S) represent the number of nucleotides of those events.