Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.
for file in $FILES
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv
Breaking this down we have:
“zcat $file” : read to stdout each line of a gzipped file
“grep -v “GL000*” : exclude any variant that doesn’t map to a named chromosome
“grep -v “FILTER” : exclude filter header lines
“grep “PASS””: include all lines that pass mutect2 filters
“cut -d$’\t’ -f 1-5” : cut on tabs and keep fields one through five
“awk ‘$3=$2’ : set column 3 equal to column 2, i.e., start and end position are equal
“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)
One Reply to “Unix one-liner to convert VCF to Oncotator format”
Thank you for your informative and helpful blog. I was wondering whether you could explain how the grep “PASS” works on mutect2 vcf files that you recommend as part of the code for converting vcf files to Oncotator format. I am new to this area and thought that mutect 2 vcf files have a dot instead of PASS in their annotations. Is that correct?
To answer your question, the VCF output I was working with looked like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H1 H2
You’ll notice the FILTER field. This contained either ‘PASS’ or ‘FAIL’ if I recall. The VCF header specifies FILTER to mean:
##FILTER=ID=PASS,Description="All filters passed"
##FILTER=ID=alt_allele_in_normal,Description="Evidence seen in the normal sample"
##FILTER=D=clustered_events,Description="Clustered events observed in the tumor"
##FILTER=ID=germline_risk,Description="Evidence indicates this site is germline, not somatic"
##FILTER=ID=homologous_mapping_event,Description="More than three events were observed in the tumor"
##FILTER=ID=multi_event_alt_allele_in_normal,Description="Multiple events observed in tumor and normal"
##FILTER=ID=panel_of_normals,Description="Seen in at least 2 samples in the panel of normals"
##FILTER=ID=str_contraction,Description="Site filtered due to contraction of short tandem repeat region"
##FILTER=ID=t_lod_fstar,Description="Tumor does not meet likelihood threshold"
##FILTER=ID=triallelic_site,Description="Site filtered because more than two alt alleles pass tumor LOD"
I just took variants that passed these filters since we had to find a way to narrow down those we wanted to analyze further. Mutect2 is still in beta status and has probably changed in the year since I did this work, so this may no longer be current.
Hope this helps and thanks for reading!