Unix one-liner to convert VCF to Oncotator format

Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.

#!/bin/bash
FILES='*.vcf.gz'
for file in $FILES
do
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv
done

Breaking this down we have:

“zcat $file” :  read to stdout each line of a gzipped file

“grep -v “GL000*” :  exclude any variant that doesn’t map to a  named chromosome

“grep -v “FILTER” : exclude filter header lines

“grep “PASS””:  include all lines that pass mutect2 filters

“cut -d$’\t’ -f 1-5”  : cut on tabs and keep fields one through five

“awk ‘$3=$2’ :  set column 3 equal to column 2, i.e., start and end position are equal

“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)

 

One Reply to “Unix one-liner to convert VCF to Oncotator format”

  1. Eleni asked:
    Thank you for your informative and helpful blog. I was wondering whether you could explain how the grep “PASS” works on mutect2 vcf files that you recommend as part of the code for converting vcf files to Oncotator format. I am new to this area and thought that mutect 2 vcf files have a dot instead of PASS in their annotations. Is that correct?

    My response:
    To answer your question, the VCF output I was working with looked like this:

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H1 H2
    1 14653 . C T . PASS BaseQRankSum=1.736;ClippingRankSum=0;DP=97;ECNT=1;FS=1.576;HCNT=4;MAX_ED=.;MIN_ED=.;MQ=47.4;MQ0=8;MQRankSum=-0.952;NLOD=2.42;ReadPosRankSum=-1.204;TLOD=7.01;ANN=T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000515242|transcribed_unprocessed_pseudogene||n.*241C>T|||||241|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000518655|transcribed_unprocessed_pseudogene||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000450305|transcribed_unprocessed_pseudogene||n.*983C>T|||||983|,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000488147|unprocessed_pseudogene|10/10|n.1254-152G>A||||||,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000538476|unprocessed_pseudogene|12/12|n.1492-151G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000438504|unprocessed_pseudogene|12/12|n.1493G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000541675|unprocessed_pseudogene|9/9|n.1126G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000423562|unprocessed_pseudogene|10/10|n.1379G>A|||||| GT:AD:AF:DP 0/1:35,4:0.108:39 0/0:44,3:0.045:47

    You’ll notice the FILTER field. This contained either ‘PASS’ or ‘FAIL’ if I recall. The VCF header specifies FILTER to mean:

    ##fileformat=VCFv4.2
    ##FILTER=ID=PASS,Description="All filters passed"
    ##FILTER=ID=alt_allele_in_normal,Description="Evidence seen in the normal sample"
    ##FILTER=D=clustered_events,Description="Clustered events observed in the tumor"
    ##FILTER=ID=germline_risk,Description="Evidence indicates this site is germline, not somatic"
    ##FILTER=ID=homologous_mapping_event,Description="More than three events were observed in the tumor"
    ##FILTER=ID=multi_event_alt_allele_in_normal,Description="Multiple events observed in tumor and normal"
    ##FILTER=ID=panel_of_normals,Description="Seen in at least 2 samples in the panel of normals"
    ##FILTER=ID=str_contraction,Description="Site filtered due to contraction of short tandem repeat region"
    ##FILTER=ID=t_lod_fstar,Description="Tumor does not meet likelihood threshold"
    ##FILTER=ID=triallelic_site,Description="Site filtered because more than two alt alleles pass tumor LOD"

    I just took variants that passed these filters since we had to find a way to narrow down those we wanted to analyze further. Mutect2 is still in beta status and has probably changed in the year since I did this work, so this may no longer be current.

    Hope this helps and thanks for reading!

    -Michael

Leave a Reply

Your email address will not be published. Required fields are marked *