I submitted a SeqMule job to SGE, it aborted without any error, what's wrong here?
One possible cause is that SeqMule exceeded the requested memory limit. How to troubleshoot? First, run
qacct -j JOBID to identify job exit status. Make sure to replace
JOBID with your actual job ID assigned by SGE. Then you will see something like this(assume
accounting_summary to set to
$ qacct -j 584552 ============================================================== qname all.q hostname compute-0-3.local group wanglab owner user project NONE department defaultdepartment jobname STDIN jobnumber 584552 taskid undefined account sge priority 0 qsub_time Sat Jan 30 14:53:20 2016 start_time Sat Jan 30 14:53:31 2016 end_time Sat Jan 30 21:11:33 2016 granted_pe smp slots 12 failed 100 : assumedly after job exit_status 137 #137-128=9, which is SIGKILL ru_wallclock 22682 ru_utime 0.080 ru_stime 0.037 ru_maxrss 2760 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 16265 ru_majflt 0 ru_nswap 0 ru_inblock 16 ru_oublock 32 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 337 ru_nivcsw 62 cpu 66629.510 mem 338134.237 io 1547.102 iow 0.000 maxvmem 52.969G arid undefined
exit_status indicates that exit code is
137 - 128 = 9 which means
SIGKILL. The line
maxvmem tells us maximum memory used is
52.969G, if we only asked for
4G for each CPU, we were supposed to use only a total of
4G * 12 = 48G memory. This explains why SGE killed our job without letting SeqMule complain. Under such circumstances, we can request more memory by adjusting
-l h_vmem option in
qsub command, for example we can change
-l h_vmem=4G to
-l h_vmem=6G. Alternatively, we can decrease SeqMule memory usage by changing
-jmem option in
seqmule pipeline command, for example, we can change
-jmem 1750m (the default) to
-jmem 1024m. WARNING: changing
-jmem only works if GATK is using too much memory and may have a side effect of insufficient memory for GATK.
Is there a log file showing the runtime error?
SeqMule only saves runtime parameters to
*.log file. If you want to check the runtime output after running SeqMule in the background (or submitted to a cluster), please use
nohup your_seqmule_command > output.txt &.
nohup can run your command even after you log out. All messages that were printed on screen will be saved in
output.txt file. Alternatively, you can append
2>stderr.txt to your SeqMule command. The STDERR message (all error messages) will be saved in output.txt and stderr.txt, respectively.
How to debug a runtime error?
When SeqMule aborts due to a runtime error, you will see a message like this:
----------ERROR---------- [ => SeqMule Execution Status: step 63 FAILED at Mon Apr 4 13:51:12 PDT 2016, Merge split VCF] ERROR: command failed /home/yunfeiguo/projects/seqmule_working/bin/secondary/../../bin/secondary/worker /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs 63 "/home/yunfeiguo/projects/seqmule_working/bin/secondary/../../bin/seqmule stats -tmpdir /tmp -ref /home/yunfeiguo/projects/seqmule_working/database/human_g1k_v37.fasta -jmem 1750m -jexe java -t 2 -u-vcf ..." !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! After fixing the problem, please execute 'cd /home/yunfeiguo/projects/variant_call/20160404' and 'seqmule run myAnalysisName.script' to resume analysis. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
This message does not tell you the specific reason of error. To find out the exact error, one has to refer to the error message of the failing program. In the above example, the failed program is step 63. All runtime information is stored in
/home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs. Therefore its messages can be found in
/home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs/63. For example:
$ ls /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs/63 JOBID.625242 MSG PID.10739 STATUS.error stderr stdout
JOBID.625242 shows the job ID assigned by SGE. If you are not running SeqMule via SGE, don't worry about it. If you are running SeqMule via SGE, you can run
qacct -j 625242 (substitute 625242 by the real job ID) to check whether SGE has anything to say about this execution. Watch out for memory usage, time limit. If this program used more resources than requested, SGE will kill it silently.
MSG is a human-readable message telling us what step 63 is. Here
MSG says, 'Merge split VCF'.
PID.10739 is the process ID of the program.
STATUS.error tells SeqMule that this program exited with an error.
STDERR output from the program, and usually that can help you understand why the program failed.
STDOUT can tell you some important things like the progress of execution, but most of time, it is less helpful for debugging.
Why did I get set: Variable name must begin with a letter. error when I tried to run the analysis script by qsub?
set -e at the begining of script to make it exit at first error. If your job scheduling system (e.g. SGE) tries to execute the script by csh or tcsh, it will return the above error. Please use bash at qsub instead. For example, under SGE, you can use:
echo 'your_seqmule_command' | qsub -V -cwd -S /bin/bash -N sample
How to solve GATK
ERROR MESSAGE: Unable to parse header with error: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub116224981.tmp (Too many open files), for input source: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub116224981.tmp?
This happens because Linux limits the number of files opened. The easiest way is to use
-gatknt option in seqmule to specify a lower number of CPUs (eg 1 or 2) used for GATK genotyping. Another option is to request your system administer to increase the limit.
Does SeqMule support merging of BAM files?
seqmule pipeline -m with other arguments. In this case, all BAM files will be merged and therefore only one sample prefix is needed.
What if GATK complains for insufficient memory?
-jmem option to give GATK more memory. Note, however, you have to request more memory than specified by
-jmem if you run your analysis on a computation cluster as there is memory overhead for java virtual machine.
How does the QUICK mode work?
Under QUICK mode, total region to be called will be split into N parts. N is the number of threads. Variant calling is then performed on each part simultaneously. Afterwards, the resulting VCF will be merged. For GATK, variant filtering is not done until after merging because VQSR filtering requires a lot of variants to calculate some statistics.
Why are the numbers on Venn diagram different from the statistics I got from the consensus VCF file?
Venn Diagram is plotted this way: ANNOVAR is used to convert VCF to AVINPUT; then for each variant, chromosome+start+end+reference allele+observed allele is used as an ID (alleles are not used for indels) to determine whether two variants overlap. Statistics for consensus result is calculated this way: GATK CombineVariants is used to extract consensus calls from multiple VCFs; if a VCF contains multiple samples, consensus extraction will be carried out on a per-sample basis; at last, VCFtools is used to calculate statistics for the resulting consensus VCF. Both the Venn Diagram and the statistics consist of two parts: SNV and indels. For SNV, the difference is very small (usually >0.5%). For indels, this is expected since there is no unique way to represent them in some cases. Besides, indels are intervalized (ignoring alleles) and extended by 10bp towards both ends for calculating statistics. Intervalization and extension are not done for obtaining consensus results as the alleles must be reported.
/tmp is too small or I got error message
No space left on device?
You can specify a larger folder for storing temporary files. There are 2 ways to do this: first, set
$TMPDIR in your environment variables; second, use
-tmpdir YOUR_TMP_DIRECTORY option in
seqmule pipeline or
seqmule stats program.
After I finish seqmule analysis, I realized that I used a wrong capture file. How should I address this? Can I just manually change 'finished' to 'waiting' in that step (in the script file)?
Capture file is typically used in multiple stages of analysis (variant calling, stats calculation). Therefore it might not enough to change only one step of the analysis. One way you can try is that run
seqmule pipeline -norun with the correct capture file, and then rerun analysis right after alignment
seqmule run -n X (X stands for the step number after alignment). If you understand the script, you can modify the bed file name and rerun steps involving that file manually. Note however, right now it is not possible to rerun one particular step (unless that step is the last step) by
seqmule run, SeqMule will run the script from a specified step (or resume where it stops) to the end.
How to solve the
Some users may see the following error message with GATK or GATKLite:
##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ExceptionInInitializerError at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.<init>(GenomeAnalysisEngine.java:160) at org.broadinstitute.sting.gatk.CommandLineExecutable.<init>(CommandLineExecutable.java:53) at org.broadinstitute.sting.gatk.CommandLineGATK.<init>(CommandLineGATK.java:54) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:90) Caused by: java.lang.NullPointerException at org.reflections.Reflections.scan(Reflections.java:220) at org.reflections.Reflections.scan(Reflections.java:166) at org.reflections.Reflections.<init>(Reflections.java:94) at org.broadinstitute.sting.utils.classloader.PluginManager.<clinit>(PluginManager.java:77) ... 4 more
This error is likely to be caused by Java version incompatibility. Please make sure you are using Java 1.7 and put make it your default java program (put the folder containing java at the beginning of your PATH variable).
How are default exome regions defined? Where do they come from?
The default exome defintions (to use them, specify
-capture default with
seqmule pipeline command) came from UCSC genome browser's RefSeq Gene track (refGene table). Only exons were included plus 5bp at each end of each region. The regions were sorted and further processed to only retain chromosomes 1 to 22, X and Y for hg18 and hg19 genome builds. Overlapping regions were merged. Exonic variants and splicing variants can therefore be included in the results. However, variants in UTR and other regions may not show up. If you want to restrict your analysis for a particular capture kit, please refer to
seqmule download -help to download common region definitions or use your own
How to solve
ERROR MESSAGE: Bad input: Error during negative model training. Minimum number of variants to use in training is larger than the whole call set. One can attempt to lower the --minNumBadVariants arugment but this is unsafe. issue?
Here GATK is complaining about too few variants used in model training. This statistical model is used for VQSR (variant quality score recalibration). It is a better method for variant filtering. However, when your capture region is small (e.g. only a few genes), or your average depth is very low, the input might be not sufficient for model training. Hard filtering should be used instead. SeqMule has limited support for automatic hard filtering (when input
BAM is smaller than 1GB for SNP, and 15GB for INDEL). The automatic detection is not guaranteed to work, so a safe option is to set
forceINDELHardFilter to 1 (in
advanced_config) to enforce hard filtering. Note, there are a separate pair of such flags for each calling method of GATK, namely GATK UnifiedGenotype caller, GATK HaplotypeCaller and GATKLite UnifiedGenotype caller. They work independently.
How much space is needed for SeqMule databases?
hg18all EACH needs approximately 60GB of space after decompressing, which includes the reference genome, index files for BWA, Bowtie, Bowtie2, SNAP and SOAP, databases for GATK. However, because compressed files and decompressed files must co-exist during decompression, one may need another 30 to 40GB of space during download. Note, the space requirement just mentioned is only for database, your input data typically has over 10GB, 20GB or 100GB of sizes depending on your situation, and two times or three times of additional space (relative to raw data) is needed to store intermediate analysis files. If space is limited, one option is to only download hg19 and its index for bwa (
seqmule download -d hg19,hg19ibwa), and disable GATK VQSR filtering (set
forceSNPHardFilter to 1 in
advanced_config) such that auxillary variant databases are not needed. Using the above option, SeqMule only needs about 10GB of space for databases.
If I have to manually download all the databases due to unstable Internet connection, how should I place them in the
database/ folder should have the following structure (for hg19) after unzipping:
database/ |-- 1000G_omni2.5.b37.vcf |-- 1000G_omni2.5.b37.vcf.idx |-- bowtie | |-- human_g1k_v37.1.ebwt | |-- human_g1k_v37.2.ebwt | |-- human_g1k_v37.3.ebwt | |-- human_g1k_v37.4.ebwt | |-- human_g1k_v37.rev.1.ebwt | `-- human_g1k_v37.rev.2.ebwt |-- bowtie2 | |-- human_g1k_v37.1.bt2 | |-- human_g1k_v37.2.bt2 | |-- human_g1k_v37.3.bt2 | |-- human_g1k_v37.4.bt2 | |-- human_g1k_v37.rev.1.bt2 | `-- human_g1k_v37.rev.2.bt2 |-- bwa | |-- human_g1k_v37.fasta -> /absolute_path_to/database/human_g1k_v37.fasta | |-- human_g1k_v37.fasta.amb | |-- human_g1k_v37.fasta.ann | |-- human_g1k_v37.fasta.bwt | |-- human_g1k_v37.fasta.fai | |-- human_g1k_v37.fasta.pac | `-- human_g1k_v37.fasta.sa |-- dbsnp_hg19_138.vcf |-- dbsnp_hg19_138.vcf.idx |-- hapmap_3.3.b37.vcf |-- hapmap_3.3.b37.vcf.idx |-- human_g1k_v37.dict |-- human_g1k_v37.fasta |-- human_g1k_v37.fasta.fai |-- Mills_and_1000G_gold_standard.indels.b37.vcf |-- Mills_and_1000G_gold_standard.indels.b37.vcf.idx |-- snap | |-- human_g1k_v37.fasta | | |-- Genome | | |-- GenomeIndex | | |-- GenomeIndexHash | | `-- OverflowTable `-- soap |-- human_g1k_v37.fasta.index.amb |-- human_g1k_v37.fasta.index.ann |-- human_g1k_v37.fasta.index.bwt |-- human_g1k_v37.fasta.index.fmv |-- human_g1k_v37.fasta.index.hot |-- human_g1k_v37.fasta.index.lkt |-- human_g1k_v37.fasta.index.pac |-- human_g1k_v37.fasta.index.rev.bwt |-- human_g1k_v37.fasta.index.rev.fmv |-- human_g1k_v37.fasta.index.rev.lkt |-- human_g1k_v37.fasta.index.rev.pac |-- human_g1k_v37.fasta.index.sa `-- human_g1k_v37.fasta.index.sai
Why do I see 'Could not find any mapped reads in target region ...'?
It is a warning message from FreeBayes. SeqMule will supply FreeBayes a BED file specifying which region to look at. If no reads are in some of regions, FreeBayes will complain with the above message.
I am confused about file naming, which one should I use for downstream analysis?
The file format for storing variants is
VCF. The name of each VCF file implies how it is generated. For example, suppose we have the following files:
patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call.extract.vcf patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call_union.vcf.idx patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call_var_stat.txt patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call.vcf
patientX is your sample name.
0 is a number used internally.
bwamem is the aligner.
rmdup mean the alignments have been sorted and duplicate reads have been removed after alignment.
varscan is the variant caller. The
somatic-call means the variants are generated under somatic variant calling mode.
extract means the VCF file has been extracted based on the region definition by
*.idx file is VCF index, you can ignore it.
*_var_stat.txt denotes variant statistics file. If you have multiple variant callers and enabled consensus variant calling, then
consensus will appear in the file name. Usually,
extract will be the file you want to use.
Which file is used as default argument for
misc/advanced_config is the default advanced configuration file, it enables BWA-MEM, GATKLite, SAMtools, and FreeBayes. Please refer to the file for detailed parameters and steps.
Copyright 2014 USC Wang Lab