How to report a bug or to ask a question?
The best way is: First, do read the website thoroughly especially FAQ. I received too many emails whose answers can easily be found in FAQ. Second, if you cannot find answer in FAQ or do not understand the answer well, then drop me an email as which should contain (1) command line argument (2) error message in screen or the LOG file (3) sometimes example inputfile (4) in case you use Mac or Windows, let me know. The reason is that I can fix something or diagnose something only if I can understand the question and reproduce the results. So do yourself a favor and do me a favor, include details in your email to avoid wasting our mutual time sending multiple emails.
There is no such thing as "ANNOVAR development team", as I am the only person who reply user emails, address user questions, and fix bugs. As of April 2015, I have communicated over 13,399 emails with ANNOVAR users. If you read FAQ #1 before sending me an email, it will save both of us a lot of valuable time.
How to annotate variants in a VCF file?
The easiest way is to use
table_annovar.pl: just add
-vcfinputargument and supply a VCF file as input file, and your ouput file will be in VCF format with INFO field populated with ANNOVAR annotations that you have specified in
-protocolargument. One additional output file called *multianno.txt will be in tab-delimited text format for easier manual examination in Excel or other programs.
It is also possible to handle VCF file manually when retrieving a subset of records from VCF file without altering its content. For example, I want to find out all novel variants (not in dbSNP135 and not in 1000G and not in NHLBI-ESP5400) in a VCF file, but without changing the VCF format. This can be done using
-includeinfoargument, so that you convert VCF file to ANNOVAR inputfile without losing any VCF-specific information. Then annotate the inputfile by a series of filter operation, then convert the outputfile to VCF file using the
cut -f 3-command in Linux system.
Why I cannot download the databases listed in your download page?
What is your command line? Did you add "-webfrom annovar"?
How to find frequency information from 1000 Genomes Projec data?
The instructions were described in this page. But one important thing to emphasize is that due to historical reasons, one must use something like
-dbtype 1000g2015aug_eurfor European population and
-dbtype 1000g2015aug_afrfor African population), not
-dbtype ALL.sites.2014_10for annotation.
How to annotate copy number variations (CNV)?
The REF and ALT in the input file can be 0. You can then annotate the file by gene-based and region-based annotation.
What is the difference between vcf4 and vcf4old format in convert2annovar.pl?
In August 2013, I changed the VCF4 conversion subroutine in
convert2annovar.pl, but I kept the vcf4old format for users who like the "old-fashion" conversion. The difference is that nowadays people tend to do multi-sample calling or candidate variant calling, so that the variants listed in the VCF4 file do not necessarily have mutations for a specific sample. This happens when genotype call is 0/0 (reference/reference). I got some complaints from users about the inability to process multi-sample VCF files, so I decided to make this change.
By default "vcf4" will only process the first sample, and will only print out mutations that exist in the first sample. So if you have a multi-sample VCF file, then usually only a subset of lines will exist in the output file. The
-format vcf4can be combined with
-allsampleargument, which will print out a separate output file for each sample in the VCF4 file (again by default, only the first sample in the VCF4 file will be processed). More importantly, if you use
-format vcf4 -allsample -withfreq, then all input lines from VCF will be kept in output lines, yet an allele frequency measure is included in each line calculating the frequency of each variant among all the samples in the VCF file.
-format vcf4oldshould be considered as obselete and should not be used by most users, since
-format vcf4can now accomplish everything that
-format vcf4oldcan do with appropriate combinations of arguments.
How to back convert cDNA coordinate such as c.385A>G to genomic coordinate such as chr1:123456A>G?
Read "all variants in a transcript" section from this page.
Why my run of gene-based annotation differ slightly from those shown in website?
UCSC database updates constantly and ANNOVAR executable also updates constantly, so it is expected that ANNOVAR output format or the annotations may change slightly over time.
Why the gene name from ANNOVAR output is wrong?
The official gene symbol for human genome is maintained by HGNC, and they change gene name in a constant basis. Every other database tries to synchronize with HGNC, but there is usually a delay. ANNOVAR annotation uses gene name defined in RefSeq (default) or Ensembl or UCSC Gene or GENCODE, so they may differ from the "official" gene symbol in rare occasions. Similarly, OMIM and other clinical databases will also use names that differ from "official" names, depending on how updated they are. For example, if you use early 2016 version of ANNOVAR's RefSeq gene annotation, the CASC5 gene will be there, but in late 2016, this gene was renamed as KNL1 in RefSeq. Similarly, the gene is called CASC5 in OMIM, with an annotation that "HGNC Approved Gene Symbol: KNL1" in OMIM records.
To make sure that you capture all OMIM genes in your results, you will have to maintain a gene name table that has both OMIM gene names and the official HGNC gene names for those OMIM genes, and then search result files generated by ANNOVAR. This is because depending on the date/version/source of ANNOVAR's database, different types of gene names could be in the output file.
Why ANNOVAR produced different non-synonymous SNP annotations than another software?
For example, ANNOVAR may report a mutation as W185R mutation, but another software may report the same mutation as R285W mutation. This could be due to a variety of reasons: (1) the use of different gene-definition systems. Depending on your command line argument, ANNOVAR always use the lastest refGene, knownGene or ensGene to ensure that the information is up to date. You should check what gene definition system is used by the other annotation software. (2) Even if both software tools are using Ensembl, they could be using different versions of the gene definition. (3) ANNOVAR automatically excludes any transcript in gene definition file that does not have a complete coding sequence or has a premature stop codon (since this means the protein annotation is wrong). Each gene definition (especially Ensembl) has a lot of such transcripts. (4) ANNOVAR uses precedence rules, so if a variant is intronic for one transcript but coding for another transcript, it will be reported as coding only. You need to use
-separateargument to show all annotations if this is of interest to you. (5) This also could be due to the presence of bugs in one software or the other. If there is a potential bug that you find in ANNOVAR, please report to me.
How to infer the version number for RefSeq transcripts in ANNOVAR annotation results?
Updated 2017 since UCSC changed their MySQL schema again: Run this command (for human hg19 build): mysql --user=genomep --password=password --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select distinct hg19.refGene.name,hgFixed.gbCdnaInfo.version from hg19.refGene,hgFixed.gbCdnaInfo WHERE hg19.refGene.name=hgFixed.gbCdnaInfo.acc' > refseq_version.txt
Starting from Nov 2014, when you download refGene for human (hg18/hg19/hg38), the corresponding
refGeneVersion.txtfile will be automatically downloaded to help users who cannot figure out how to run mysql. However, you will need to run the MySQL command manually for other species.
Starting from June 2017, we include hg19_refGeneWithVer.txt and hg19_refGeneWithVerMrna.fa file int he ANNOVAR package. Therefore, users can directly use
-dbtype refGeneWithVerto annotate genetic mutations with RefSeq version number.
What is the difference between comma and semicolon when they are used to separate gene names in gene annotation?
The semicolon (";") separate different annotations, for example, coding variants for one gene and splice variants for another gene (but these two genes may have the same name, since one gene may have multiple transcripts). The comma (",") separates different genes with the same annotation, for example, multiple genes may have overlapping exons, so a variant may be annotated as exonic in multiple genes.
Why I cannot run ANNOVAR in my web browser such as Chrome?
ANNOVAR is a command-line software that requires a Perl interpreter in your system. Typically, Linux systems already include a Perl interpreter by default, yet you need to install one yourself in Windows (use strawberry perl or activeperl). The table_annovar.pl is a perl script that you will execute using a perl interpreter, and it is not a URL that you visit in your web browser.
Why a very common variant has very low frequency in filter annotation in hg38?
This very rare situation happens for some ANNOVAR filter databases, that were generated by lifting over the corresponding hg19 databases. In some genomic positions, the nucleotide identity differs between hg19 and hg38, resulting in this problem. For example, chrX:152652814 has allele frequencies near 50:50, and from build 37 to build 38, they switched which allele was reference and which allele was alt. So for SNPs where the ref/alt alleles were swapped in build 38, Annovar does not annotate these SNPs's frequency if you use a lift over allele frequency database in ANNOVAR (currently, the list of hg38 databases generated by liftOver is annotated in the download page). This is a very rare event, but extra caution is always good to examine your results if you happen to use one of the liftOver filter databases provided by ANNOVAR.
It should be amino acid X in this position but ANNOVAR reports Y in this position!
For example, ANNOVAR reports p.X100Z as the amino acid change, but another web resource shows that position 100 should have wildtype of Y not X.
Whichever website you use, regardless of whether it is swissprot, refseq or whatever, remember that they always have their own way to collect proteome data and compile data, and these ways may result in slight discordance to theoretical protein sequence. Sometimes, these websites may directly translate a RefSeq transcript such as NM_123456 to a protein sequence and present it, but although ANNOVAR uses NM_123456 as well it uses the "theroteical mRNA sequence" inferred by ANNOVAR as opposed to those provided in RefSeq. By "theoretical", I mean a protein sequence that is translated from the "theoretical" mRNA sequence which is specified by a gene model as well as a whole-genome DNA sequence given a specific genome build. ANNOVAR is a software that produces this "theoretical" protein sequence, so if you want to stick with a specific genome build and a specific gene definition system, then ANNOVAR gives the correct results.
Exceptions exist when the gene model is not annotated correctly. In other word, when the exon start site, end site, splicing site have some slight errors. In this case, the protein sequence produced by ANNOVAR may be wrong and may contain pre-mature stop codons. (There are many many reasons this may happen) If you ever encounter such a variant, just try a different gene model (for example, using
-dbtype ensgene) to reannotate this variant. If you want to investigate this variant even more closely, considering using the
coding_change.plprogram in ANNOVAR, which will print out the theoretical protein sequence before mutation and after mutation, and will flag any potentially wrong theoretical protein sequence with WARNING messages.
In Nov 2011, I updated ANNOVAR so that any reference transcripts with premature stop codon (potential gene model annotation error or transcript-to-genome mapping error) will no longer be used in
Why ANNOVAR reports a 3-bp deletion as frameshift deletion?
For example, "9 5720612 5720614 AGT -" (hg19 coordinate) is annotated as non-frameshift deletion by CLCbio but ANNOVAR thinks it is a frameshift deletion. Biologically, a 3-bp frameshift deletion is indeed possible: This could happen, for example, when the 3-bp deletion covers only 1 or 2 bp in exons, and indeed this is the case for this deletion. ANNOVAR knows how to handle these types of complicated situations but other software may not.
Why ANNOVAR reports T182A,T190A,T300A as the amino acid change but another web server reports only T300A?
Alternative splicing is prevalent in human genome and as a result, it is best to annotate amino acid change with respect to a certain transcript rather than gene. Other servers or software may randomly pick one script as the representative "gene" and gives one single answer. ANNOVAR tries to be comprehensive and always accompany annotation by transcript names, and it is up to the user which representative transcript they want to use or if they want to use all.
There has never been a concensus in the field which transcript should be used to represent a gene when multiple transcripts are available. The most popular approach is to use the longest transcript nowadays. However, in the medical genetics field, for certain specific diseases and specific genes, there are 'canonical' transcripts that everybody uses by default for historical reasons, and you will need to manually select this canonical transcript from ANNOVAR output file to communicate with the rest of the field.
Why ANNOVAR reports c.C100T when my input is G to A change?
The c.C100T is a cDNA (actually, mRNA) level change. ANNOVAR input (G to A) has to be in the forward strand, and if the transcript is in the reverse strand, there will be a C to T change in the mRNA.
Why ANNOVAR reports c.T5997G when my input is T to C change in chr14:31582550-31582550 in hg19 coordinate?
First, this transcript is in the reverse strand, so the mutation is changed to "G". Second, your input is wrong: this position should be A in hg19, so c.T5997 should be the reference base. Maybe you used a wrong genome build, or your genotype calling software has a bug. ANNOVAR did it correctly. Starting from September 2011, ANNOVAR will try to print out WARNING messages telling user that they used wrong reference alleles in their input file for exonic variants.
Why my mutation gets lost by ANNOVAR?
A user reported that the input "17 16256671 16256671 C G" in hg19 coordinate was reported as a "CENPV:NM_181716:exon1:c.C80C " mutation, so the C->G change gets lost by ANNOVAR. Read the FAQ item above: the input is wrong, as this position should be a G wildtype in reference genome, so the C80C mutation is the correct mutation in the opposite strand.
Why only one isoform is in exonic_variant_function but two in variant_function?
A user reporetd that the input "1 14143003 14143003 A G" in hg19 coordinate was reported to hit "NM_001135610,NM_012231" in variant function, but only NM_001135610 in exonic_variant_function file (when
-transcriptargument was used). If you add
-separateargument, you'll see that the change on NM_012231 is synonymous, so it is not printed out due to precedence rule.
Why ANNOVAR reports "unknown" in exonic_variant_function?
"unknown" means that the gene structure is not correctly annotated (complete ORF information is not available). Previous versions of ANNOVAR will always give an answer such as non-synonymous SNVs, etc, but I got too many user emails complaining about "bugs" (even though ANNOVAR is innocent in this case). So after December 2011, if errors exists in gene structure annotation (RefSeq, Ensembl, UCSC, etc), ANNOVAR will just report unknown for exonic_variant_function; in other word, although the variant is clearly within an exon, we cannot say for sure how it affects protein sequence as the ORF annotation is not correct.
Why ANNOVAR reports the same function for two different mutations in two sites?
Sometimes different mutations are reported to have the same function in gene-based annotation. For example, these mutations at chromsome 4 at coordiante 8945506, 8950251, 8954996, 8959741, 8964486, 8969231, 8973977 are all reported to be USP17:NM_001105662:exon1:c.A25G:p.R9G. There is nothing wrong: if you check the USP17 gene in genome browser, you'll see that there are at least 9 copies of the gene in each haplotype. So all the mutations (if they are real) all have the same function. In reality, it is likely that these mutations are not real, but are rather artifacts of base-level differences between any random two copies of the same gene.
Why ANNOVAR complains "exonic SNPs have WRONG reference alleles " in gene-based annotation?
This happens when ANNOVAR thinks the "reference allele" in your input does not fit the "reference allele" in the mRNA FASTA file in ANNOVAR's database. This could be due to several reason, (1) wrong
-buildver, or (2) you did not specify the correct reference allele, or (3) mRNA FASTA file is outdated as the gene model gets updated pretty quickly by UCSC.
To solve this problem, first check (1) and (2) to make sure that you did have the correct input. If you cannot find an error, then update the FASTA file by
retrieve_seq_from_fasta.plcommand, with more details here.
Why FASTA sequence in ANNOVAR differ from those in public databases?
For example, the mRNA of the MYBPC3 gene (NM_000256) extracted from ucsc and the other one extracted from annovar
hg18_refGeneMrna.txtfile differ. The reason is simple: FASTA in ANNOVAR is built from ANNOVAR using chr:start-end records, not copied/pasted from any public database. Any errors in chr:start-end will lead to errors in ANNOVAR-compiled FASTA. To avoid future complaints, FASTA sequences with premature stop codon will no longer be used in exonic annotation, although they still exists in the FASTA file.
Why ANNOVAR's TFBS annotation differ from what I have from another web server?
There are MANY different transcription binding sites (TFBS) annotations generated by hundreds of research groups in the world. ANNOVAR used a keyword "TFBS" for only one specific type of annotation that have a long history in Genome Browser, but it does not mean that this is the ultimate solution for TFBS prediction. ANNOVAR can certainly take many other types of TFBS annotations for but it won't use the keyword "tfbs" for that. In fact, as you can see from the region-based annotation page, ANNOVAR can also annotate TFBS ChIP-Seq from the ENCODE project.
The take home message is that there are many annotations on TFBS, and they may differ from each other substantially. Use caution when interpreting the data. Ultimately, it is the biologist himself/herself who can decide whether or not the annotation makes sense; ANNOVAR faciliate this process but it cannot make the decision for you.
Where are the values for -protocol and -argument come from in table_annovar.pl?
The protocol values corresponds to file names that are stored in the directory specified by the user in command line (with a couple of exceptions such as 1000g-related files). They are generally referred to as database files, and they can either come from ANNOVAR's own repository (via
-downdb -webfrom annovarargument), or from UCSC's annotation databases (via
-downdbargument), or provided/compiled by users. Therefore, there are unlimited possibilities for protocols, and there is not a comprehensive list that we can provide.
The argument values correspond to each of the protocols, as optional argument that you would use for annotate_variation.pl on this specific protocol. In other words, -protocol, -operation and -arg are all parallel lists of corresponding entries and should have equal comma-delimited number of entries.
How to handle huge multi-sample VCF files?
You can just just cut the first sample (basically the first ~10 columns), then annotate this file by table_annovar. Then just "paste" the annotation with the rest. For example,
cut -f 1-10 input.vcf | grep -v -P '^#' > input1.vcf; cut -f 11- input.vcf | grep -v -P '^#' > genotype, then annotate input1.vcf, generate input1.anno.vcf, then
paste input1.anno.vcf genotype > input.anno.vcfto generate the combined output file. You may want to add the VCF header back in.
Why the SIFT/PolyPhen scores in ANNOVAR differ from those obtained from another website?
The AVSIFT scores (now obselete!) in ANNOVAR was based on Ensembl55 database, and sometimes there are major differences from those computed from ensembl63 (default in SIFT website). If you selecte ensembl55 from SIFT website you'll see that the scores are consistent and identical. The LJB_SIFT scores in ANNOVAR was based on the original Liu et al paper, so read the paper for details on how they compile the scores. In most recent version we use the dbnsfp* keyword, and all scores are directly taken from the dbNSFP database.
But in general, calculation of scores depend on version of software, parameters of program, source of data files, definition of gene structure, handling of alternative transcripts and multiple scores, so there are many reasons why there are differences in scores calculated by different people. ANNOVAR now tries to be syncrhonized with the ljb* database, so the scores may be different from another web server.
Can ANNOVAR identify all SNPs annotated within dbSNP in a given region (say chr1:3751541-3751607)?
In ANNOVAR, filter annotation identifes exact matches including base pair identity, yet region annotation identify overlapping regions. When you use
--filter, the program will tell whether the region chr1:3751541-3751607 is a SNP within dbSNP (highly unlikely to be the case). In more recent versions of ANNOVAR, region annotation can handle snp130 now. For example, just try
annotate_variation.pl ex1.human humandb/ -region -dbtype snp130. However, this command require about 10GB memory to run.
However, if you are only looking at one single specific region, a simple script can be used to address this question, after using
-downdb snp130in ANNOVAR: perl -ne '@a=split(/\t/,$_); $a eq "chr1" and $a>=3751541 and $a<=3751607 and print $a,"\n"' < hg18_snp130.txt.
How to annotate simple repeat regions in human genome?
Read these pages: http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=199336701&c=chr1&g=simpleRepeat, http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=199336701&c=chr1&g=rmsk, http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=199336701&c=chr1&g=rmskRM327, then pick one that matches your goal, then annotate by ANNOVAR.
How to handle E. coli, Arabidopsis thaliana and other genomes not in UCSC?
For gene-based annotations (say for example,
-dbtype refGene), ANNOVAR requires 2 files: a refGene file specifying gene model, and a FASTA file with sequence for each transcript. You can make 3 files for the genome using the following rules:
For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.
For refLink file, you can make anything. The file will be ignored. (It is important for very old genome annotations when name2 field is not present in refGene, but it is not really useful today as people will not use old genome assembly nowadays).
For FASTA file, make sure that the $name in ">$name" matches the refGene file, in a case-sensitive manner. You can build the file yourself, or you can directly use
retrieve_seq_from_db.plin ANNOVAR to generate this file, given a FASTA file for the genome. Make sure that strand is correct in the cDNA if you build the file yourself.
After you have three three files, you can directly run ANNOVAR by specifying
-buildverargument to match your file prefix.
If you have GFF3 files, then convert it to UCSC compatile format first (try the http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/gff3ToGenePred tool). This is the easiest thing to do and multiple users have reported success on multiple novel species.
Trouble shooting: If you can generate variant_function annotation but not exonic_variant_function annotation, then double check the GFF file. The gff3ToGenePred requires gene/mRNA/CDS/exon notation, but some GFF3 files use "transcript" rather than "mRNA" resulting in lack of coding information in output files. Manually change "transcript" to "mRNA" in GFF3 will solve this problem.
Why the total number of homozygous and heterozygous variants is more than the number of variant site (convert2annovar.pl)?
Suppose we see 30 reads in a site, and 10 are A, 10 are G and 10 are C. This is one site, but may be presented as two heterozygous mutations from the genotype calling algorithm. This could be due to a tri-allelic SNP, or a genomic duplication, or just sequencing error.
Can ANNOVAR handle IUPAC code in input?
No. ANNOVAR is a variant annotation program, not a genotype annotation program. It needs to see A, C, G, T, not a IUPAC code representing ambiguity of an allele, or an IUPAC code representing a genotype call.
Can ANNOVAR handle genotype calls in input?
No. ANNOVAR is a variant annotation program, not a genotype annotation program. You can only specify the allele of an observed variant (such as A, G, etc), not a genotype on a specific position (such as AG genotype).
How to handle two very close SNPs in the same codon?
If two SNPs are separetd by only one or two nucleotides, it is best to treat them as a block substituion, rather than two separate variants. Otherwise, the annotation may not be correct if the two SNPs happen to impact the same codon.
How to select the X-way phastCons conservation track in ANNOVAR?
This totally depends on the genome build, and you need to check genome browser for the number of tracks. For example, for chicken genome, if you select galGal3 as the --buildver, then you'll see in the genome brower page (by hovering mouse on top of "Most Conserved") that it is 7way.
Can ANNOVAR print out translated protein sequence?
annotate_variation.plcannot do that directly, and it is very difficult to modify the existing exonic annotation subroutine to do this. Therefore, in June 2011 version of ANNOVAR, I added the
coding_change.plprogram to infer translated protein sequence before and after mutation occur.
Is it possible to add column names to the input file that are carried through the processing?
Some users routinely use extra columns and would like to include the column headers rather than having to edit the resulting ANNOVAR output (usually ANNOVAR will treat the line with column names as "invalid" line and put it into the invalid_input file). This can be done with the
-commentargument, which treats any input line starting with "#" as the comment line and do not discard it.
Can ANNOVAR call genotypes from sequencing data?
ANNOVAR does NOT generate "genotype calling". Dozens of other software tools can perform SNP calling from sequencing data. However, if the user refers to "assigning rs identifiers to SNPs", ANNOVAR can certainly be very helpful (see the example on filtering against dbSNP).
How to check if new version of ANNOVAR is available?
Either go to ANNOVAR website to see what's the latest version and compare to your current version (type
annotate_variation.plwithout argument will print out version information). Or use
annotate_variation.pl -downdb null .to enable automatic web-based checking of new version without downloading any database.
How to list all annotation databases in ANNOVAR web server?
You can use
-webfrom annovar -downdb avdblistto see a list of files, file sizes and time stamp. This only works on human genome though.
How to handle OMIM data?
Many people studying Mendelian diseases perhaps are interested in annotating variants against the OMIM database. However, the 16 June 2011 News from UCSC shows that although they released newly re-engineered OMIM tracks for both hg18 and hg19, "the OMIM data are the property of Johns Hopkins University and will not be available for download from UCSC". For now, you can just go to http://omim.org/downloads, fill out the forms and get a copy of the data. I cannot make a derivative database for you, per their guideline. If you only need a gene symbol to OMIM ID mapping, you can get that data from HGNC here: http://www.genenames.org/cgi-bin/hgnc_downloads
What is the version of ENSEMBL used in ANNOVAR?
ANNOVAR retrieves ensGene definition from UCSC, so it depends on the version that UCSC has used. For human hg19 build, just go to http://www.genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=211204337&g=ensGene, and see what is the latest release for Ensembl gene prediction.
How to annotate ENSEMBL gene on the hg38 genome coordinate?
The ensGene file for hg38 is not provided by ANNOVAR because UCSC did not generate this file. However, a user pointed out that UCSC have replaced the ensGene.txt using GENCODEV26 (wgEncodeGencodeCompV26.txt track). Both files contain the same information. Therefore, if you want to annotate Ensemble genes based on hg38, you should use the GENCODE file instead. Detailed instructions are given in the gene-based annotation page.
How ANNOVAR handles different coordinate systems for mitochondria?
UCSC's build (for example, hg19) differ from NCBI's build (for example NCBI 37) in a few subtle manners, for example, replacing contigs by chr_random, and the use of different mitochondia assemblies. UCSC's hg19 assembly used the old version mitochondria genome (NC_001807), but 1000 genomes cosortium has replace the chrM with the latest Cambridge Reference Sequence version (NC_012920). So if you align your sequence data and call variants against the NC_012920, then you cannot really annotate your variants using UCSC's gene definition. It is necessary to stick with the identical coordinate. For autosomes and chrX/Y, this is not a real issue as they are pretty consistent.
In addition, For most organisms the "stop codons" are “UAA”, “UAG”, and “UGA”. In vertebrate mitochondria “AGA” and “AGG” are also stop codons, but not “UGA”, which codes for tryptophan instead. “AUA” codes for isoleucine in most organisms but for methionine in vertebrate mitochondrial mRNA.
How to get -downdb to work if I am behind a proxy server?
The -downdb use
wgetby default without any argument. You can add
-nowgetin the command line, so that Perl HTTP/FTP modules will be used instead which should handle proxy well. Or you can modify the ANNOVAR source code to use wget with proxy functionality.
How to download databases not stored in UCSC or ANNOVAR-DB?
In general, you just need to manually download these databases, and reformat them to standard ANNOVAR genericdb format (Chr, Start, End, Ref, Alt, and other information), and use them. Occasionally, you may also automate the process by supplying the URL directly; for example, to download Regulome, you can do
perl annotate_variation.pl --downdb --webfrom http://www.regulomedb.org/downloads/ RegulomeDB.dbSNP141 /Users/user/Desktop/annovar/humandb.
How to handle MAF files from TCGA?
You can use this script to convert MAF to ANNOVAR input format and then annotate the file.
How to further speed up ANNOVAR?
You can use the
-threadargument (if your operating system and your perl build support it), so that multi-threading functionality is used to process the input files in parallel. However, it is extremely important that your database directory (for example,
humandb/directory) can accormodate random disk access well. Typically, if you use a very large number of threads, you have to use SSD drive to achieve satisfactory performance. Mechanical drives cannot tolerate it for most large databases. Additionally, borrowing ideas from an ANNOVAR user, if you have a machine with large memory, you can also just simply create a RAM disk to treat a portion of the memory as a hard drive and then copy the
humandbinto this RAM disk. For example, do a
mount -t tmpfs -o size=100G tmpfs /tmp/newhumandb/, followed by
sysctl vm.swappiness=1to reduce swappiness, and then use the
/tmp/newhumandbto store databases and perform annotation.