I recently had to inspect some genomic alignments as part of a project. Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.
In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.
The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….). This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.
In my case, I still had the SAM files around to inspect. If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:
1) This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:
2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:
3) And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:
With the growing popularity of single-cell RNA-Seq analysis, the t-SNE projection of multi-dimensional data is appearing more often in publications and online. If you’ve ever wanted to develop a better intuitive feel for what exactly t-SNE does and where it can go wrong, this interactive tutorial (by Martin Wattenberg and Fernanda Viegas) is extremely compelling and useful.
In addition to providing a wonderful, interactive plotting function, the authors go on to provide an informative tutorial explains the pitfalls and challenges of the optimization and hyper-parameter tuning of t-SNE projections and how to get the most from the plots. Here is an example:
In the example above, tuning the “perplexity” of the t-SNE projection causes the correct reconstruction of the data when values are between 30-50, but the same method fails when the parameter falls outside those ranges (i.e., too small or too large).
Go check out this distill.pub site. It’s worth your time.
Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets. Instead of thinking of them as powerful machine learning tools, the authors instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’ Hence the term, “Software 2.0.”
It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks. — Andrej Karpathy, Medium.com
I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive. The fact is that combinations of artificial neurons can be used to model any logical operation. Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify. You provide the inputs and desired outputs, and the model searches for the optimal program.
This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior. While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.