Single cell CRISPR screen gRNA library mutation profiling
Configuration
Prepare ConfigFile for girafr gRNA_mutation.
see example ConfigFile
gRNA_bam_file
gRNA library alignment bam file, mapped to custom reference with gRNA as artificial chromosome (starting ‘_chrom’)
filtered_barcode
filtered barcode list of expression library mapping results (unzipped). Cellranger corrected cell barcodes (CB tag) is saved as filtered barcodes in the cellranger output. Two alignments can cause cellranger correct same cell barcodes in different ways and cause less cell pass bam filter.
min_reads
Integer number. By default = 1. UMIs with less than min_reads will be filtered out.
auto
Boolean. When True, girafr will fit model to determine UMI threshold for each gRNA. When False, fixed min_umi will be used. By default is True.
min_umi
Integer number. By default = 3. Cells with UMIs >= min_umi will be assigned with gRNA. Not used when auto is True.
pool
Boolean. By default is True. When True, girafr will pool both intact UMIs and UMIs with mutation together to determine UMI threshold for each gRNA. When False, girafr will give UMI threshold seperately for intact and mutated gRNA.
ref_fasta
refence gRNA cassette sequence.
genome_gtf
gRNA cassette annotation in gtf format.
ref_fasta
andgenome_gtf
should be also used to build reference genome, to which the gRNA library sequence is mapped. See build for more information.
Requirements
samtools
2bit genome downloaded from UCSC
gRNA mutation profile
girafr gRNA_mutation -f absolute_path/ConfigFile
Simplified process
step 1: gRNA bam file filtration.
- utils.gRNA_bam_filter(filename, samtools_path)
Script will remove secondary alignments and those which are not aligned to designed gRNA cassette. This step is time consuming.
- Parameters
filename (string) – input file gRNA library alignments by cellranger or dropseq_tools
samtools_path (string) – path to samtools
- Returns
generate filtered bam file in output folder named as
gRNA.sorted.mapped.removedSecondaryAlignment.onlyMappedToGrnaChrom.bam
Step2: Construct consensus sequence for each UMI-Cell barcode combination
- consensus_sequence.generate_consensus_sequence_gRNA(bam_in, barcodes):
Next, script will construct consensus sequence for each UMI. We take the most reads supported sequence as the consensus sequence for the UMI. More details see methods in citation.
- Parameters
bam_in (string) – path to bam file, alignment file of gRNA library after removed secondary alignment and mapped not on gRNA reference
gRNA.sorted.mapped.removedSecondaryAlignment.onlyMappedToGrnaChrom.bam
barcodes (string) – filtered barcode list, unzipped file.
- Returns
file:
consensus.sequence.gRNA.txt
: consensus sequence supported by more than min_reads (default = 1) reads.consensus.seqeunce.gRNA.all_umi.txt
: all UMI detected consensus sequence, including UMI with only min_reads (default = 1) read.consensus.bam
: consensus sequence in bam file format.Non-consensus.bam
: alignment not the same as consensus sequence.
Step3: Call mutations from
consensus.bam
:
- variant.call_gRNA_variant(consensus_seq_file, ref_fasta, structure_gtf):
Then, we compare consensus sequence of each UMI with its reference and annotate where the mutation is by structure annotation. Variances are encoded in a similar way like CIGAR in sam format.
- Parameters
consensus_seq_file (string) –
consensus.sequence.gRNA.txt
generated by previous stepref_fasta (string) – path to file:
oligo_pool_plasmid.fa
, specified in ConfigFilestructure_gtf (string) – path to file
oligo_pool_plasmid_structure.gtf
, specified in ConfigFile
- Returns
file:
consensus.sequence.gRNA.variant.txt
Step4: Assign gRNAs to cells
- assign_gRNA.assign_gRNA_to_cell():
In the end, we assign found guides to cells. It is required for a cell to have more than min_umi molecule of gRNA so that the script will assign the gRNA to that cell. This umi threshold can be defined by min_umi as fixed threshold for all guides or it can be automatically calculated by fitting a two model mixed gaussian model when auto is set as true in the configuration file. :param string in_file: file
consensus.sequence.gRNA.variant.txt
generated by previous step :param integer min_umi: minum number of UMI, default is 3 :param boolean auto: boolean, whether use autodetection or fixed min_umi, default is false :param boolean pool: boolean, whether calculate min umi thresholds together with variant gRNA of the same guide, default is false :return: Writecells.gRNA.txt
andcells.gRNA.single.txt
consensus.sequence.matrix
gRNA.umi.threshold.txt
- Step5 (optional):
Functions assign_gRNA.add_variant_type and profile_MT_pattern.py add mutation details for downstream analysis.
Output files
See section output files formats
Additional information
**Build custom reference (optional): **
Oligo_pool.csv: two columns: oligo_name and sequence, no header.
prepare.py: generate oligo_pool_plasmid.fa
, oligo_pool_plasmid.gtf
and oligo_pool_plasmid_structure.gtf
This part gives instruction to build a custom CellRanger reference with designed cassette as artificial chromosome. utils.write_annotation function generates oligo_pool_plasmid.fa and oligo_pool_plasmid.gtf which will be used to generate cellranger reference (see build note as example), and oligo_pool_plasmid_structure.gtf which will be used to profile where the mutations are on the cassatte. This script is modified from .._code: https://github.com/epigen/crop-seq
CIGAR-like string:
Digit numbers represents exact matches, and nucleotides followed are mutated bases. 0 represents no nucleotide.
Digit numbers followed by insertions (I), deletions (D) and soft clippings (S) show the number of nucleotides of those events. Hard clippings (H) are not included. The major difference between this string and CIGAR-string is it replaces matches (M) into mismatches and encode detailed mutated nucleotides [ATGC] into the string.
**Mutation structure annotation: **
Annotations begin with oligo structures such as gRNA which are consistent with user input oligo_pool_plasmid_structure.gtf. Then each mutation annotation follows oligo structure with semicolon as separator. Comma separates individual mutation event. Digit numbers represents the distance to the beginning of the structure. Nucleotides followed are mutated bases. 0 represents no nucleotide. Digit numbers in bracket followed insertions (I), deletions (D) and soft clippings (S) represent the number of nucleotides of those events.