Note
Annotate your VCF with SnpEff/VEP, prior to loading it into GEMINI, otherwise the gene/transcript features would be set to None.
GEMINI supports gene/transcript level annotations (we do not use pre-computed values here) from snpEff and VEP and hence we suggest that you first annotate your VCF with either of these tools, prior to loading it into GEMINI. The related database columns would be populated, which would otherwise be set to None if an unannotated VCF file is loaded into GEMINI.
Note
Choose the annotator as per your requirement! Some gene/transcript annotations are available with only one tool (e.g. Polyphen/Sift with VEP and amino_acid length/biotype with SnpEff). As such these values would be set to None, if an alternate annotator is used during the load step.
Instructions for installing and running these tools can be found in the following section:
Before we can use GEMINI to explore genetic variation, we must first load our VCF file into the GEMINI database framework. We expect you to have first annotated the functional consequence of each variant in your VCF using either VEP or snpEff (Note that v3.0+ of snpEff is required to track the amino acid length of each impacted transcript). Logically,the loading step is done with the gemini load command. Below are two examples based on a VCF file that we creatively name my.vcf. The first example assumes that the VCF has been pre-annotated with VEP and the second assumes snpEff.
# VEP-annotated VCF
$ gemini load -v my.vcf -t VEP my.db
# snpEff-annotated VCF
$ gemini load -v my.vcf -t snpEff my.db
As each variant is loaded into the GEMINI database framework, it is being compared against several annotation files that come installed with the software. We have developed an annotation framework that leverages tabix, bedtools, and pybedtools to make things easy and fairly performant. The idea is that, by augmenting VCF files with many informative annotations, and converting the information into a sqlite database framework, GEMINI provides a flexible database-driven API for data exploration, visualization, population genomics and medical genomics. We feel that this ability to integrate variation with the growing wealth of genome annotations is the most compelling aspect of GEMINI. Combining this with the ability to explore data with SQL using a database design that can scale to 1000s of individuals (genotypes too!) makes for a nice, standardized data exploration system.
Now, the loading step is very computationally intensive and thus can be very slow with just a single core. However, if you have more CPUs in your arsenal, you specify more cores. This provides a roughly linear increase in speed as a function of the number of cores. On our local machine, we are able to load a VCF file derived from the exomes of 60 samples in about 10 minutes. With a single core, it takes a few hours.
Note
Using multiple cores requires that you have both the bgzip tool from tabix and the grabix tool installed in your PATH.
$ gemini load -v my.vcf -t snpEff --cores 20 my.db
Thanks to some great work from Brad Chapman and Rory Kirchner, one can also load VCF files into GEMINI in parallel using many cores on LSF, SGE or Torque clusters. One must simply specify the type of job scheduler your cluster uses and the queue name to which your jobs should be submitted.
For example, let’s assume you use LSF and a queue named preempt_everyone. Here is all you need to do:
$ gemini load -v my.vcf \
-t snpEff \
--cores 50 \
--lsf-queue preempt_everyone \
my.db
If you use SGE, it would look like:
$ gemini load -v my.vcf \
-t snpEff \
--cores 50 \
--sge-queue preempt_everyone \
my.db
If you use Torque, it would look like: (you guessed it):
$ gemini load -v my.vcf \
-t snpEff \
--cores 50 \
--torque-queue preempt_everyone \
my.db
GEMINI also accepts PED files in order to establish the familial relationships and phenotypic information of the samples in the VCF file.
$ gemini load -v my.vcf -p my.ped -t snpEff my.db
The PED file format is documented here: PED. An example PED file looks like this:
1 M10475 None None 1 1 1 M10478 M10475 M10500 2 2 1 M10500 None None 2 2 1 M128215 M10475 M10500 1 1
The columns are family_id, name, paternal_id, maternal_id, sex and phenotype.
You can also provide a PED file with a heading starting with #, and include extra fields, like this:
#family_id name paternal_id maternal_id sex phenotype hair_color 1 M10475 None None 1 1 brown 1 M10478 M10475 M10500 2 2 brown 1 M10500 None None 2 2 black 1 M128215 M10475 M10500 1 1 blue
This will add the extra columns to the samples table and allow for you to use those extra columns during queries.
By default, GERP scores at base pair resolution are not computed owing to the roughly 2X increasing in loading time. However, one can optionally ask GEMINI to compute these scores by using the --load-gerp-bp option.
$ gemini load -v my.vcf --load-gerp-bp -t snpEff my.db
To do.
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.