Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.
#!/bin/bash
FILES='*.vcf.gz'
for file in $FILES
do
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv
done
Breaking this down we have:
“zcat $file” : read to stdout each line of a gzipped file
“grep -v “GL000*” : exclude any variant that doesn’t map to a named chromosome
“grep -v “FILTER” : exclude filter header lines
“grep “PASS””: include all lines that pass mutect2 filters
“cut -d$’\t’ -f 1-5” : cut on tabs and keep fields one through five
“awk ‘$3=$2’ : set column 3 equal to column 2, i.e., start and end position are equal
“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)
Eleni asked:
Thank you for your informative and helpful blog. I was wondering whether you could explain how the grep “PASS” works on mutect2 vcf files that you recommend as part of the code for converting vcf files to Oncotator format. I am new to this area and thought that mutect 2 vcf files have a dot instead of PASS in their annotations. Is that correct?
My response:
To answer your question, the VCF output I was working with looked like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H1 H2
1 14653 . C T . PASS BaseQRankSum=1.736;ClippingRankSum=0;DP=97;ECNT=1;FS=1.576;HCNT=4;MAX_ED=.;MIN_ED=.;MQ=47.4;MQ0=8;MQRankSum=-0.952;NLOD=2.42;ReadPosRankSum=-1.204;TLOD=7.01;ANN=T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000515242|transcribed_unprocessed_pseudogene||n.*241C>T|||||241|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000518655|transcribed_unprocessed_pseudogene||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000450305|transcribed_unprocessed_pseudogene||n.*983C>T|||||983|,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000488147|unprocessed_pseudogene|10/10|n.1254-152G>A||||||,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000538476|unprocessed_pseudogene|12/12|n.1492-151G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000438504|unprocessed_pseudogene|12/12|n.1493G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000541675|unprocessed_pseudogene|9/9|n.1126G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000423562|unprocessed_pseudogene|10/10|n.1379G>A|||||| GT:AD:AF:DP 0/1:35,4:0.108:39 0/0:44,3:0.045:47
You’ll notice the FILTER field. This contained either ‘PASS’ or ‘FAIL’ if I recall. The VCF header specifies FILTER to mean:
##fileformat=VCFv4.2
##FILTER=ID=PASS,Description="All filters passed"
##FILTER=ID=alt_allele_in_normal,Description="Evidence seen in the normal sample"
##FILTER=D=clustered_events,Description="Clustered events observed in the tumor"
##FILTER=ID=germline_risk,Description="Evidence indicates this site is germline, not somatic"
##FILTER=ID=homologous_mapping_event,Description="More than three events were observed in the tumor"
##FILTER=ID=multi_event_alt_allele_in_normal,Description="Multiple events observed in tumor and normal"
##FILTER=ID=panel_of_normals,Description="Seen in at least 2 samples in the panel of normals"
##FILTER=ID=str_contraction,Description="Site filtered due to contraction of short tandem repeat region"
##FILTER=ID=t_lod_fstar,Description="Tumor does not meet likelihood threshold"
##FILTER=ID=triallelic_site,Description="Site filtered because more than two alt alleles pass tumor LOD"
I just took variants that passed these filters since we had to find a way to narrow down those we wanted to analyze further. Mutect2 is still in beta status and has probably changed in the year since I did this work, so this may no longer be current.
Hope this helps and thanks for reading!
-Michael