VCRU Bioinformatics - Install notes

This page was last updated on Monday, 11-Jan-2016 09:25:20 CST

Installation notes for Annovar version 2015Dec14

Home Page

Prerequisites

None

Installation

$ cd /programinstallers/
Must register to get download link
$ wget -N http://www.openbioinformatics.org/annovar/download/xxxxxx/annovar.latest.tar.gz
$ cd /usr/local/bin/
$ tar -zxvf /programinstallers/annovar.latest.tar.gz
Add to default PATH for all users
$ sudo nano /etc/profile
…
PATH="$PATH:/usr/local/bin/annovar"
…

Custom Databases

From http://annovar.openbioinformatics.org/en/latest/misc/faq/#othergenome

How to handle E. coli, Arabidopsis thaliana and other genomes not in UCSC?

For gene-based annotations (say for example, -dbtype refGene), ANNOVAR requires 2 files: a refGene file specifying gene model, and a FASTA file with sequence for each transcript. You can make 3 files for the genome using the following rules:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.

For refLink file, you can make anything. The file will be ignored. (It is important for very old genome annotations when name2 field is not present in refGene, but it is not really useful today as people will not use old genome assembly nowadays).

For FASTA file, make sure that the $name in ">$name" matches the refGene file, in a case-sensitive manner. You can build the file yourself, or you can directly use retrieve_seq_from_db.pl in ANNOVAR to generate this file, given a FASTA file for the genome. Make sure that strand is correct in the cDNA if you build the file yourself.

After you have three three files, you can directly run ANNOVAR by specifying -buildver argument to match your file prefix.

If you have GFF3 files, then convert it to UCSC compatile format first (try the http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/gff3ToGenePred tool). This is the easiest thing to do and multiple users have reported success on multiple novel species.

Trouble shooting: If you can generate variant_function annotation but not exonic_variant_function annotation, then double check the GFF file. The gff3ToGenePred requires gene/mRNA/CDS/exon notation, but some GFF3 files use "transcript" rather than "mRNA" resulting in lack of coding information in output files. Manually change "transcript" to "mRNA" in GFF3 will solve this problem.