Decode SAM files with these handy references

I recently had to inspect some genomic alignments as part of a project.  Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.

In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.

The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….).   This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.

In my case, I still had the SAM files around to inspect.  If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:

1)  This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:

SAM file structure explained in this handy chart.

 

2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:

3)  And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:

Decode SAM flags with this handy online tool from the Broad Institute.

 

 

A Unix one-liner to scrape GI numbers from a SAM file

I recently had a situation where I needed to scrape out all of the GI numbers from a SAM alignment file.  My first instinct was to turn to python to accomplish this, but I forced myself to find a command line tool or set of tools to quickly do this task as a one-liner.  First, here is the format of the first two lines of the file:

…..

HWI-D00635:61:C6RH0ANXX:4:1101:3770:8441 0 gi|599154892|gb|EYE94125.1| 1066 255 41M * 0 0 SNPDEMDGNILPWMVHLKRMALEVLKHLWSSKLAAFFTLSE * AS:i:88 NM:i:0 ZL:i:2534 ZR:i:217 ZE:f:3.9e-15 ZI:i:100 ZF:i:-2 ZS:i:125 MD:Z:41

HWI-D00635:61:C6RH0ANXX:4:1101:3770:8441 0 gi|115387347|ref|XP_001211179.1| 1065 255 41M * 0 0 SNPDEMDGNILPWMVHLKRMALEVLKHLWSSKLAAFFTLSE * AS:i:77 NM:i:8 ZL:i:1670 ZR:i:188 ZE:f:8.9e-12 ZI:i:80 ZF:i:-2 ZS:i:125 MD:Z:1Y3V3V11F11SSY3A1

….

You can see that what I want is the information between the pipes in the field “gi|#########|” .   Here is how I solved this with a bash script:


#!/bin/bash
for file in *.sam
do
echo ${file}
cat ${file} | egrep -o  "\|\S*\|(\S*)\|" | sed 's/|/,/g' | cut -f 2 -d ',' > ${file}.out
done

To unpack this briefly, the “cat” command outputs each line of the file to stdout, which is redirected to “egrep.”   Egrep looks for the regular expression “\|\S*\|(\S*)\|”.   This expression searches for a pipe, followed by any number of characters, with another pipe, then more characters, then another pipe.  The pipes are escaped with a backslash “\”.

The next step is to pipe to “sed”, which takes the incoming stream and replaces the pipes with commas.  This output is sent to “cut”, which uses the commas as delimiters, and takes the second field.

There are probably shorter ways to do this (cutting on pipes, for example), but already attempting this at the command line saved me a lot of time over coding this in python.