Excited to announce that I’ve been working with some fantastic computational biologists (that I met at ISMB2018) on a “10 simple rules” style paper for creating and promoting effective bioinformatics collaborations with wet-lab biologists. We will leverage our many years of combined bioinformatics core experience to create these “10 simple rules.”
We will touch on:
–experiment planning and design.
–data management plans, data QC, and record keeping.
–avoiding batch effects and contamination.
–managing expectations and developing clear communications.
–handling low-quality data when things do go wrong.
Hope to have this out by the end of 2018…watch this space!
tl;dr: It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement. If you’re already using kallisto in your pipelines there’s no reason to switch. Now onto the blog post:
Another RNA-seq read aligner
In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.” The title seems to imply that Matataki is much faster than anything else out there. After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method? The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way. Is this true? Let’s dive into the details and find out.
The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets. Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.
Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.
Counting unique k-mers
The method itself eschews traditional alignment for a k-mer based read counting approach. This approach takes two steps: build an index and quantify the reads against it. Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms). Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index. If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene. If there is more than one match to several genes, the read is discarded.
If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:
Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.
Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read. Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user. The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification. These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.
But does it work?
Well, it appears to. The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below). In Figure 2, they describe the parameter tuning that led to those choices. For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.” The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time. It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.
Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated). The one simulated sample was created by RSEM from the real sample plus three other related samples. So perhaps it’s not surprising that RSEM does the best in this comparison overall.
I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters. We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…) It would be enlightening to find out.
So how fast is ‘ultrafast’ anyway?
We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold. Yep, that’s it. It’s only twice as fast. This is an incremental improvement, but not a paradigm shift. In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out. It’s been three years and nothing has substantially improved on it.
Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.
Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.
For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool. Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.
To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis. Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire. It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well. And finally, kallisto integrates nicely with sleuth for DE testing and visualization.
I recently had to inspect some genomic alignments as part of a project. Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.
In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.
The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….). This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.
In my case, I still had the SAM files around to inspect. If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:
1) This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:
2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:
3) And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:
With the growing popularity of single-cell RNA-Seq analysis, the t-SNE projection of multi-dimensional data is appearing more often in publications and online. If you’ve ever wanted to develop a better intuitive feel for what exactly t-SNE does and where it can go wrong, this interactive tutorial (by Martin Wattenberg and Fernanda Viegas) is extremely compelling and useful.
In addition to providing a wonderful, interactive plotting function, the authors go on to provide an informative tutorial explains the pitfalls and challenges of the optimization and hyper-parameter tuning of t-SNE projections and how to get the most from the plots. Here is an example:
In the example above, tuning the “perplexity” of the t-SNE projection causes the correct reconstruction of the data when values are between 30-50, but the same method fails when the parameter falls outside those ranges (i.e., too small or too large).
Go check out this distill.pub site. It’s worth your time.