GEMINI contains an automated installation script which installs GEMINI along with required Python dependencies, third party software and data files.
$ wget https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py
If wget isn’t available, you can use curl instead:
$ curl \
https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py \
> gemini_install.py
Once you have downloaded the above installation script, you can proceed as follows:
$ python gemini_install.py /usr/local /usr/local/share/gemini
$ export PATH=$PATH:/usr/local/gemini/bin
# it would be wise to add the above line to your ``.bashrc`` or ``.bash_profile``
This installs the GEMINI executable as /usr/local/bin/gemini, other required third party dependencies in /usr/local/bin, and associated data files in /usr/local/share/gemini. Please note that this is merely an example: one can place the GEMINI executables and annotation files in any directories one wishes.
Note
The automated installtion script typically requires ~30 minutes, primarily owing to the time required to download the GEMINI genome annotation files. Also, please note that the annotation files requires ~15Gb of storage, so please ensure that the data directory (/usr/local/share/gemini in the example above) has sufficient space.
Tip
Some tips and tricks for installation issues:
The installer requires:
- Python 2.7.x
- git
- a working C / C++ compiler such as gcc
- zlib (including headers)
These dependencies already exist on most UNIX/LINUX/OS X machines. However, on minimalist machines such as fresh Amazon Cloud instances, one may have to install these dependencies before running the automated installer. In the case of Amazon instances, the following command should take care of all of the above dependencies:
$ sudo yum -y install python27 git gcc gcc-c++ zlib-devel
As many users do not have root or sudo access, the automated installer also has options to install in “non-root” environments:
$ python gemini_install.py ~/gemini ~/gemini --nosudo
At this point, you will have a self-contained installation of GEMINI, including both the software and its associated genome annotations. However, if you have done a custom install in a “non-root” enviornment, you will first need to update your PATH environment variable to include the path to the bin directory that you just created by running the automated installer.
For example, if, as above, you placed you custom install in ~/gemini, you would need to update your PATH as follows. It would be wise to also add this to your .bashrc or .bash_profile:
$ export PATH=$PATH:~/gemini/bin
Note that this change will only last for the life of your current terminal session. To make this more permanent, update your .bash_profile so that this change is made each time you login.
If successful, you should be able to run the following command from anywhere on your system:
$ gemini -v
gemini 0.3.0b
If successfully installed, you should be able to change directories into the gemini directory within the directory into which you installed the GEMINI source code and run a script of unit tests that will ensure that GEMINI is running appropriately on your system. For example, if, as above, you installed the GEMINI executables to ~/gemini, you would issue the following commands to run the unit tests:
$ cd ~/gemini/gemini
$ bash master-test.sh
Once installed with the automated installer, it is easy to upgrade the GEMINI programs and associated data files to the latest released version with:
$ gemini update
GEMINI depends upon several widely-used genomics command line software as well as multiple Python packages. We recognize that the dependency stack is quite deep and are working on ways to minimize dependencies in the interest of the most streamlined installation process possible. Nonetheless, the following are core dependencies:
- Python 2.7.x
- grabix
- samtools
- tabix
- bedtools
- pybedtools
Once the above dependencies have been installed, one can begin installing GEMINI itself. To install you should download the latest source code from GitHub, either by going to:
http://github.com/arq5x/gemini
and clicking on “Downloads”, or by cloning the git repository with:
$ git clone https://github.com/arq5x/gemini.git
Once you have the source code, run:
$ cd gemini
$ sudo python setup.py install
to install it. If you don’t have permission to install it in the default directory, you can simply build the source in-place and use the package from the git repository:
$ python setup.py build_ext --inplace
One of the more appealing features in GEMINI is that it automatically annotates variants in a VCF file with several genome annotations. However, you must first install these data files on your system. It’s easy enough — you just need to run the following script and tell it in which what full path you’d like to install the necessary data files. The recommended path is /usr/local/share, but you can install the data files wherever you want.
$ python gemini/install-data.py /usr/local/share/
If you have installed GEMINI with the annotation files on a server and you can NFS mount the annotation files, you can tell a local install of GEMINI where those annotation files are by making the file ~/.gemini/gemini-config.yaml:
annotation_dir: /path/to/nfs_mounted/gemini/data
versions:
GRCh37-gms-mappability.vcf.gz: 2
hg19.rmsk.bed.gz: 2
GEMINI comes with a full test suite to make sure that everything has installed correctly on your system. We strongly encourage you to run these tests.
$ bash master-test.sh
GEMINI depends upon external tools to predict the functional consequence of variants in a VCF file. We currently support annotations produced by both SnpEff and VEP. Recommended instructions for annotating existing VCF files with these tools are available here. In addition, we have attempted to standardize the terms used to describe the functional consequence of a given variant, as each annotation tool uses different vocabulary.
The variant consequence columns in the variant table are populated either by snpEff or VEP as defined by the user using the -t option while running gemini load (To populate these columns the input VCF file should have been annotated either by snpEff or VEP):
$ gemini load -v my.vcf -t VEP -d my.db
$ gemini load -v my.vcf -t snpEFF -d my.db
By default the following columns in the variant table would be set to null:
The table below shows the alternate GEMINI terms for the consequences from snpEff and VEP, for SQL queries. The last column represents the severity terms associated with the impacts:
Gemini terms | snpEff terms | VEP terms | Impact severity |
---|---|---|---|
splice_acceptor | SPLICE_SITE_ACCEPTOR | splice_acceptor_variant | HIGH |
splice_donor | SPLICE_SITE_DONOR | splice_donor_variant | HIGH |
stop_gain | STOP_GAINED | stop_gained | HIGH |
stop_loss | STOP_LOST | stop_lost | HIGH |
frame_shift | FRAME_SHIFT | frameshift_variant | HIGH |
start_loss | START_LOST | null | HIGH |
exon_deleted | EXON_DELETED | null | HIGH |
non_synonymous_start | NON_SYNONYMOUS_START | null | HIGH |
transcript_codon_change | null | initiator_codon_variant | HIGH |
non_syn_coding | NON_SYNONYMOUS_CODING | missense_variant | MED |
inframe_codon_gain | CODON_INSERTION | inframe_insertion | MED |
inframe_codon_loss | CODON_DELETION | inframe_deletion | MED |
inframe_codon_change | CODON_CHANGE | null | MED |
codon_change_del | CODON_CHANGE_PLUS_CODON_DELETION | null | MED |
codon_change_ins | CODON_CHANGE_PLUS_CODON_INSERTION | null | MED |
UTR_5_del | UTR_5_DELETED | null | MED |
UTR_3_del | UTR_3_DELETED | null | MED |
other_splice_variant | null | splice_region_variant | MED |
mature_miRNA | null | mature_miRNA_variant | MED |
regulatory_region | null | regulatory_region_variant | MED |
TF_binding_site | null | TF_binding_site_variant | MED |
regulatory_region_ablation | null | regulatory_region_ablation | MED |
regulatory_region_amplification | null | regulatory_region_amplification | MED |
TFBS_ablation | null | TFBS_ablation | MED |
TFBS_amplification | null | TFBS_amplification | MED |
synonymous_stop | SYNONYMOUS_STOP | stop_retained_variant | LOW |
synonymous_coding | SYNONYMOUS_CODING | synonymous_variant | LOW |
UTR_5_prime | UTR_5_PRIME | 5_prime_UTR_variant | LOW |
UTR_3_prime | UTR_3_PRIME | 3_prime_UTR_variant | LOW |
intron | INTRON | intron_variant | LOW |
CDS | CDS | coding_sequence_variant | LOW |
upstream | UPSTREAM | upstream_gene_variant | LOW |
downstream | DOWNSTREAM | downstream_gene_variant | LOW |
intergenic | INTERGENIC, INTERGENIC_CONSERVED | intergenic_variant | LOW |
intragenic | INTRAGENIC | null | LOW |
gene | GENE | null | LOW |
transcript | TRANSCRIPT | null | LOW |
exon | EXON | null | LOW |
start_gain | START_GAINED | null | LOW |
synonymous_start | SYNONYMOUS_START | null | LOW |
intron_conserved | INTRON_CONSERVED | null | LOW |
nc_transcript | null | nc_transcript_variant | LOW |
NMD_transcript | null | NMD_transcript_variant | LOW |
incomplete_terminal_codon | null | incomplete_terminal_codon_variant | LOW |
nc_exon | null | non_coding_exon_variant | LOW |
transcript_ablation | null | transcript_ablation | LOW |
transcript_amplification | null | transcript_amplification | LOW |
feature elongation | null | feature elongation | LOW |
feature truncation | null | feature truncation | LOW |
Note: “null” refers to the absence of the corresponding term in the alternate database
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.