To impute or not to impute scRNA-seq datasets?

Single-cell RNA-Seq methods, which sequence and barcode the transcripts within individual cells in a sample, hold enormous promise for understanding transcriptional networks in development and disease. Single-cell investigation of biological phenomena is taking the life sciences world by storm. For example, Science magazine selected single-cell methods as the 2018 “Breakthrough of the Year.”

Closer to home, our bioinformatics group here at the University of Iowa is also seeing a rapid increase in the number of scRNA-seq projects in the research pipeline. Yet with all of this interest and funding, scRNA-seq is still an emerging field with little agreement on best practices.

We see evidence of this when considering one of the main problems of scRNA-seq datasets: dropouts. ‘Dropouts’ are zero-values in the data arising from technical and biological noise. Often the dropout rate can reach up to 90% or more, degrading the ability of the analysis to detect fine structure in the data and low- and moderately expressed DE genes between cell types.

One way to combat this problem is to borrow information across genes within a sample and use that to predict imputed expression values for the missing genes. Another related approach is called data ‘smoothing,’ that attempts to lower the noise in observed values. There are several methods (MAGIC, scImpute, DrImpute, and SAVER) that have been published recently that attempt to do one or both of these approaches. While the authors of each method focus on the advantages of imputation, there can also be drawbacks caused by an increase in false-positives and loss of specificity.

A recent paper by Andrews and Hemberg address the potential drawbacks with imputation in a very concise and clear way using both simulated and real-world data. Figure 1 (below) from this paper shows very clearly the perils of doing imputation on false positive rates and spurious gene-gene correlations.

Performance on simulated scRNA-seq data

Figure 1A. Gene-gene correlations before (left) and after imputation with five methods (right). Red bars are highly-expressed DE genes, and blue bars are lowly-expressed DE genes. Gray bar are non-DE genes in this simulated dataset.

Somewhat dramatically, DrImpute and MAGIC introduce strong false positive correlations, while SAVER only strengthens existing correlations between lowly expressed DE genes. As you can see in part B of this figure below, parameter tuning also has a dramatic effect on the false positive rate in some cases. Increasing the k-neighbors for MAGIC and KNN methods increases smoothing and also false positives. SAVER and scImpute are relatively immune to changes in FPR with parameter space.

Figure 1B. False positive gene correlation rates as a function of algorithm parameters.

You can’t have your cake and eat it, too

In this next figure, the authors look at the trade-off between sensitivity and specificity in imputation methods on simulated datasets. It shows clearly that any improvements to sensitivity of DE gene detection come at a significant cost of specificity, and vice versa.

Detection of DE genes in simulated data.

The authors go on to show that on real data, every method including SAVER generates large numbers of false positives. In summary, imputation, while potentially promising, is limited owing to the lack of an independent reference (as in the case of GWAS imputation methods) to impute from. Since single-cell imputation methods rely only on the dataset itself, one cannot escape the sensitivity/specificity tradeoff and false-positive problem.