Matataki, a new ultrafast read mapper: hope or hype?

tl;dr:  It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement.  If you’re already using kallisto in your pipelines there’s no reason to switch.   Now onto the blog post: 

Another RNA-seq read aligner

In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.”   The title seems to imply that Matataki is much faster than anything else out there.  After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method?  The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way.  Is this true?  Let’s dive into the details and find out.

The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets.   Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.

Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.

Counting unique k-mers

The method itself eschews traditional alignment for a k-mer based read counting approach.  This approach takes two steps:  build an index and quantify the reads against it.  Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms).  Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index.  If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene.  If there is more than one match to several genes, the read is discarded.

If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:

Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.

Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read.  Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user.   The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification.   These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.

But does it work?

Well, it appears to.  The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below).  In Figure 2,  they describe the parameter tuning that led to those choices.  For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.”  The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time.  It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.

Fig 2. Comparison of TPM when k was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing k from 16 to 56 with a step of 8

Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods  (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated).  The one simulated sample was created by RSEM from the real sample plus three other related samples.  So perhaps it’s not surprising that RSEM does the best in this comparison overall.

I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters.   We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…)  It would be enlightening to find out.

Figure 1. Summary of the results using simulation data. a Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. b Means of absolute difference from the expected expression levels.

So how fast is ‘ultrafast’ anyway?

We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold.  Yep, that’s it.  It’s only twice as fast.   This is an incremental improvement, but not a paradigm shift.  In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out.  It’s been three years and nothing has substantially improved on it.

Fig 3. CPU time comparison.

Limitations?

Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.

Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.

For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool.   Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.

To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis.   Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire.  It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well.   And finally, kallisto integrates nicely with sleuth for DE testing and visualization.

Decode SAM files with these handy references

I recently had to inspect some genomic alignments as part of a project.  Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.

In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.

The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….).   This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.

In my case, I still had the SAM files around to inspect.  If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:

1)  This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:

SAM file structure explained in this handy chart.

 

2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:

3)  And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:

Decode SAM flags with this handy online tool from the Broad Institute.

 

 

Get hands-on with t-SNE plots

With the growing popularity of single-cell RNA-Seq analysis, the t-SNE projection of multi-dimensional data is appearing more often in publications and online.  If you’ve ever wanted to develop a better intuitive feel for what exactly t-SNE does and where it can go wrong, this interactive tutorial (by Martin Wattenberg and Fernanda Viegas) is extremely compelling and useful.

A screen capture of the interactive t-SNE interface.

In addition to providing a wonderful, interactive plotting function, the authors go on to provide an informative tutorial explains the pitfalls and challenges of the optimization and hyper-parameter tuning of t-SNE projections and how to get the most from the plots.  Here is an example:

An example of how hyperparameter tuning affects the final plot.

In the example above, tuning the “perplexity” of the t-SNE projection causes the correct reconstruction of the data when values are between 30-50, but the same method fails when the parameter falls outside those ranges (i.e., too small or too large).

Go check out this distill.pub site.  It’s worth your time.

Are deep neural nets “Software 2.0”?

Image from: https://cdn.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png

Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets.  Instead of thinking of them as powerful machine learning tools, the authors  instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’   Hence the term, “Software 2.0.”

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks.   — Andrej Karpathy, Medium.com

I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive.  The fact is that combinations of artificial neurons can be used to model any logical operation.  Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify.  You provide the inputs and desired outputs, and the model searches for the optimal program.

This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior.   While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.