Unix one-liner to convert VCF to Oncotator format

Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.

#!/bin/bash
FILES='*.vcf.gz'
for file in $FILES
do
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv
done

Breaking this down we have:

“zcat $file” :  read to stdout each line of a gzipped file

“grep -v “GL000*” :  exclude any variant that doesn’t map to a  named chromosome

“grep -v “FILTER” : exclude filter header lines

“grep “PASS””:  include all lines that pass mutect2 filters

“cut -d$’\t’ -f 1-5”  : cut on tabs and keep fields one through five

“awk ‘$3=$2’ :  set column 3 equal to column 2, i.e., start and end position are equal

“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)

 

A Unix one-liner to scrape GI numbers from a SAM file

I recently had a situation where I needed to scrape out all of the GI numbers from a SAM alignment file.  My first instinct was to turn to python to accomplish this, but I forced myself to find a command line tool or set of tools to quickly do this task as a one-liner.  First, here is the format of the first two lines of the file:

…..

HWI-D00635:61:C6RH0ANXX:4:1101:3770:8441 0 gi|599154892|gb|EYE94125.1| 1066 255 41M * 0 0 SNPDEMDGNILPWMVHLKRMALEVLKHLWSSKLAAFFTLSE * AS:i:88 NM:i:0 ZL:i:2534 ZR:i:217 ZE:f:3.9e-15 ZI:i:100 ZF:i:-2 ZS:i:125 MD:Z:41

HWI-D00635:61:C6RH0ANXX:4:1101:3770:8441 0 gi|115387347|ref|XP_001211179.1| 1065 255 41M * 0 0 SNPDEMDGNILPWMVHLKRMALEVLKHLWSSKLAAFFTLSE * AS:i:77 NM:i:8 ZL:i:1670 ZR:i:188 ZE:f:8.9e-12 ZI:i:80 ZF:i:-2 ZS:i:125 MD:Z:1Y3V3V11F11SSY3A1

….

You can see that what I want is the information between the pipes in the field “gi|#########|” .   Here is how I solved this with a bash script:


#!/bin/bash
for file in *.sam
do
echo ${file}
cat ${file} | egrep -o  "\|\S*\|(\S*)\|" | sed 's/|/,/g' | cut -f 2 -d ',' > ${file}.out
done

To unpack this briefly, the “cat” command outputs each line of the file to stdout, which is redirected to “egrep.”   Egrep looks for the regular expression “\|\S*\|(\S*)\|”.   This expression searches for a pipe, followed by any number of characters, with another pipe, then more characters, then another pipe.  The pipes are escaped with a backslash “\”.

The next step is to pipe to “sed”, which takes the incoming stream and replaces the pipes with commas.  This output is sent to “cut”, which uses the commas as delimiters, and takes the second field.

There are probably shorter ways to do this (cutting on pipes, for example), but already attempting this at the command line saved me a lot of time over coding this in python.