Loading a VCF file into GEMINI

Annotate with snpEff or VEP

Note

Annotate your VCF with SnpEff/VEP, prior to loading it into GEMINI, otherwise the gene/transcript features would be set to None.

GEMINI supports gene/transcript level annotations (we do not use pre-computed values here) from snpEff and VEP and hence we suggest that you first annotate your VCF with either of these tools, prior to loading it into GEMINI. The related database columns would be populated, which would otherwise be set to None if an unannotated VCF file is loaded into GEMINI.

Note

Choose the annotator as per your requirement! Some gene/transcript annotations are available with only one tool (e.g. Polyphen/Sift with VEP and amino_acid length/biotype with SnpEff). As such these values would be set to None, if an alternate annotator is used during the load step.

Instructions for installing and running these tools can be found in the following section:

Annotation with snpEff or VEP

The basics

Before we can use GEMINI to explore genetic variation, we must first load our VCF file into the GEMINI database framework. We expect you to have first annotated the functional consequence of each variant in your VCF using either VEP or snpEff (Note that v3.0+ of snpEff is required to track the amino acid length of each impacted transcript). Logically,the loading step is done with the gemini load command. Below are two examples based on a VCF file that we creatively name my.vcf. The first example assumes that the VCF has been pre-annotated with VEP and the second assumes snpEff.

# VEP-annotated VCF
$ gemini load -v my.vcf -t VEP my.db

# snpEff-annotated VCF
$ gemini load -v my.vcf -t snpEff my.db

As each variant is loaded into the GEMINI database framework, it is being compared against several annotation files that come installed with the software. We have developed an annotation framework that leverages tabix, bedtools, and pybedtools to make things easy and fairly performant. The idea is that, by augmenting VCF files with many informative annotations, and converting the information into a sqlite database framework, GEMINI provides a flexible database-driven API for data exploration, visualization, population genomics and medical genomics. We feel that this ability to integrate variation with the growing wealth of genome annotations is the most compelling aspect of GEMINI. Combining this with the ability to explore data with SQL using a database design that can scale to 1000s of individuals (genotypes too!) makes for a nice, standardized data exploration system.

Many variant callers set filter flags in the VCF file to flag possible problem variants. By default GEMINI will leave these variants in the database during loading but they can be filtered out during the loading step by passing the --passonly flag to load.

Using multiple CPUS for loading

Now, the loading step is very computationally intensive and thus can be very slow with just a single core. However, if you have more CPUs in your arsenal, you specify more cores. This provides a roughly linear increase in speed as a function of the number of cores. On our local machine, we are able to load a VCF file derived from the exomes of 60 samples in about 10 minutes. With a single core, it takes a few hours.

Note

Using multiple cores requires that you have both the bgzip tool from tabix and the grabix tool installed in your PATH.

$ gemini load -v my.vcf -t snpEff --cores 20 my.db

Using LSF, SGE, SLURM and Torque schedulers

One can load VCF files into GEMINI in parallel using many cores on LSF, SGE, SLURM or Torque clusters. One must simply specify the type of job scheduler your cluster uses and the queue name to which your jobs should be submitted.

For example, let’s assume you use LSF and a queue named preempt_everyone. Here is all you need to do:

$ gemini load -v my.vcf \
         -t snpEff \
         --cores 50 \
         --queue preempt_everyone \
         --scheduler lsf \
         my.db

Describing samples with a PED file

GEMINI also accepts PED files in order to establish the familial relationships and phenotypic information of the samples in the VCF file.

$ gemini load -v my.vcf -p my.ped -t snpEff my.db

The PED file format is documented here: PED. An example PED file looks like this:

1 M10475 None None 1 1
1 M10478 M10475 M10500 2 2
1 M10500 None None 2 2
1 M128215 M10475 M10500 1 1

The columns are family_id, name, paternal_id, maternal_id, sex and phenotype.

You can also provide a PED file with a heading starting with #, and include extra fields, like this:

#family_id name paternal_id maternal_id sex phenotype hair_color
1 M10475 None None 1 1 brown
1 M10478 M10475 M10500 2 2 brown
1 M10500 None None 2 2 black
1 M128215 M10475 M10500 1 1 blue

This will add the extra columns to the samples table and allow for you to use those extra columns during queries.

Load GERP base pair conservation scores

By default, GERP scores at base pair resolution are not computed owing to the roughly 2X increasing in loading time. However, one can optionally ask GEMINI to compute these scores by using the --load-gerp-bp option.

$ gemini load -v my.vcf --load-gerp-bp -t snpEff my.db

Updating the samples table in a database

If, after loading a database, you find more information about your samples or want to add a column to the samples table to query on, you can reload the samples table with a new PED file with gemini amend --sample. This is also useful if you forgot to load a PED file when initially loading your database. This file must have the standard first six columns of a PED file, but after that other columns can be added. The top of the PED file also must have a header starting with # which names all of the columns if there are more than the standard six PED file columns:

$ gemini amend --sample your_new_ped_file your.db

Loading VCFs without genotypes.

To do.

comments powered by Disqus

Edit and improve this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to Loading a VCF file into GEMINI on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.