Bcbio RNA-seq ‘under the hood’

Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.

Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.

The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.

Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.

Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):

Step 1. Align with Hisat2

Step 2/3. Pipe to bamsormadup and redirect to sorted BAM

Step 4. Index BAM

Step 5. Samtools sort by read names

Step 6. Sambamba view to select only primary alignments

Step 7. FeatureCounts to count primary alignments in BAM

Step 8. Gffread to write a fasta file with spliced exons

Step 9. Build the salmon index

Step 10. Pseudo-alignment and quantification

Step 11. Convert salmon output to sleuth format

Step 12. Downsample BAM file with samtools view

Step 13. FASTQC on downsampled BAM

Step 14. Run Qualimap RNAseq on BAM

Step 15. A SED command (not sure exactly what it does)

Step 16. Mark duplicates on the BAM file

Step 17. Index de-duplicated BAM file

Step 18. Use Sambamba view to create duplicate metrics

Step 19. Use Sambamba to create mapping metrics

Step 20. Samtools stats on sorted BAM

Step 21. Samtools idxstats on sorted BAM

To impute or not to impute scRNA-seq datasets?

Single-cell RNA-Seq methods, which sequence and barcode the transcripts within individual cells in a sample, hold enormous promise for understanding transcriptional networks in development and disease. Single-cell investigation of biological phenomena is taking the life sciences world by storm. For example, Science magazine selected single-cell methods as the 2018 “Breakthrough of the Year.”

Closer to home, our bioinformatics group here at the University of Iowa is also seeing a rapid increase in the number of scRNA-seq projects in the research pipeline. Yet with all of this interest and funding, scRNA-seq is still an emerging field with little agreement on best practices.

We see evidence of this when considering one of the main problems of scRNA-seq datasets: dropouts. ‘Dropouts’ are zero-values in the data arising from technical and biological noise. Often the dropout rate can reach up to 90% or more, degrading the ability of the analysis to detect fine structure in the data and low- and moderately expressed DE genes between cell types.

One way to combat this problem is to borrow information across genes within a sample and use that to predict imputed expression values for the missing genes. Another related approach is called data ‘smoothing,’ that attempts to lower the noise in observed values. There are several methods (MAGIC, scImpute, DrImpute, and SAVER) that have been published recently that attempt to do one or both of these approaches. While the authors of each method focus on the advantages of imputation, there can also be drawbacks caused by an increase in false-positives and loss of specificity.

A recent paper by Andrews and Hemberg address the potential drawbacks with imputation in a very concise and clear way using both simulated and real-world data. Figure 1 (below) from this paper shows very clearly the perils of doing imputation on false positive rates and spurious gene-gene correlations.

Performance on simulated scRNA-seq data

Figure 1A. Gene-gene correlations before (left) and after imputation with five methods (right). Red bars are highly-expressed DE genes, and blue bars are lowly-expressed DE genes. Gray bar are non-DE genes in this simulated dataset.

Somewhat dramatically, DrImpute and MAGIC introduce strong false positive correlations, while SAVER only strengthens existing correlations between lowly expressed DE genes. As you can see in part B of this figure below, parameter tuning also has a dramatic effect on the false positive rate in some cases. Increasing the k-neighbors for MAGIC and KNN methods increases smoothing and also false positives. SAVER and scImpute are relatively immune to changes in FPR with parameter space.

Figure 1B. False positive gene correlation rates as a function of algorithm parameters.

You can’t have your cake and eat it, too

In this next figure, the authors look at the trade-off between sensitivity and specificity in imputation methods on simulated datasets. It shows clearly that any improvements to sensitivity of DE gene detection come at a significant cost of specificity, and vice versa.

Detection of DE genes in simulated data.

The authors go on to show that on real data, every method including SAVER generates large numbers of false positives. In summary, imputation, while potentially promising, is limited owing to the lack of an independent reference (as in the case of GWAS imputation methods) to impute from. Since single-cell imputation methods rely only on the dataset itself, one cannot escape the sensitivity/specificity tradeoff and false-positive problem.

Gene expression boxplots with ggplot2

The ubiquitous RNAseq analysis package, DESeq2, is a very useful and convenient way to conduct DE gene analyses.  However, it lacks some useful plotting tools.   For example, there is no convenience function in the library for making nice-looking boxplots from normalized gene expression data.

There are other packages one can rely on, for example ‘pcaExplorer’, but I like a simple approach sometimes to plot just a couple of genes.  So below I show you how to quickly plot your favorite gene using only ggplot2 (there is no “one weird trick” though…):

As you can see above, first we must grab the normalized counts at the row corresponding with the Traf1 Ensembl ID using the ‘counts‘ function that operates on the ‘ddsTxi’ DESeqDataSet object.

In order to create a dataframe (well, a tibble to be specific) for plotting, we first create a list (‘m’) that combines the counts (as a numeric vector) and metadata group.  These two vectors will form the columns of the tibble for plotting, and we must give them names (i.e., “counts” and “group”) so the tibble conversion doesn’t complain.

The list, m, is then converted to a tibble with ‘as.tibble‘ and plotted with ggplot2, using an ‘aes(group,counts)‘ aesthetic plus a boxplot aesthetic.  The rest of the code is just modifying axis labels and tickmarks.  The final product looks like this:

Boxplot of normalized Traf1 expression in 5 different conditions (3 replicates each).

Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.

 

Matataki, a new ultrafast read mapper: hope or hype?

tl;dr:  It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement.  If you’re already using kallisto in your pipelines there’s no reason to switch.   Now onto the blog post: 

Another RNA-seq read aligner

In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.”   The title seems to imply that Matataki is much faster than anything else out there.  After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method?  The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way.  Is this true?  Let’s dive into the details and find out.

The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets.   Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.

Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.

Counting unique k-mers

The method itself eschews traditional alignment for a k-mer based read counting approach.  This approach takes two steps:  build an index and quantify the reads against it.  Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms).  Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index.  If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene.  If there is more than one match to several genes, the read is discarded.

If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:

Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.

Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read.  Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user.   The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification.   These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.

But does it work?

Well, it appears to.  The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below).  In Figure 2,  they describe the parameter tuning that led to those choices.  For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.”  The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time.  It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.

Fig 2. Comparison of TPM when k was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing k from 16 to 56 with a step of 8

Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods  (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated).  The one simulated sample was created by RSEM from the real sample plus three other related samples.  So perhaps it’s not surprising that RSEM does the best in this comparison overall.

I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters.   We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…)  It would be enlightening to find out.

Figure 1. Summary of the results using simulation data. a Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. b Means of absolute difference from the expected expression levels.

So how fast is ‘ultrafast’ anyway?

We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold.  Yep, that’s it.  It’s only twice as fast.   This is an incremental improvement, but not a paradigm shift.  In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out.  It’s been three years and nothing has substantially improved on it.

Fig 3. CPU time comparison.

Limitations?

Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.

Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.

For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool.   Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.

To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis.   Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire.  It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well.   And finally, kallisto integrates nicely with sleuth for DE testing and visualization.