Gene expression boxplots with ggplot2

The ubiquitous RNAseq analysis package, DESeq2, is a very useful and convenient way to conduct DE gene analyses.  However, it lacks some useful plotting tools.   For example, there is no convenience function in the library for making nice-looking boxplots from normalized gene expression data.

There are other packages one can rely on, for example ‘pcaExplorer’, but I like a simple approach sometimes to plot just a couple of genes.  So below I show you how to quickly plot your favorite gene using only ggplot2 (there is no “one weird trick” though…):

traf1_counts <- counts(ddsTxi['ENSG00000056558',], normalized = TRUE)
m <- list(counts = as.numeric(traf1_counts), group = as.factor(samples$group))
m <- as.tibble(m)
q <- ggplot(m, aes(group, counts)) + geom_boxplot() + geom_jitter(width = 0.1)
q <- q + labs(x = "Experimental Group", y = "Normalized Counts ", title = "Normalized Expression of TRAF1")
q <- q + scale_x_discrete(labels=c("00hn" = "PMN, 0hrs", "06hn" = "PMN, 6hrs",
                                   "06hy" = "PMN+Hp, 6hrs", "24hn" = "PMN, 24hrs", "24hy" = "PMN+Hp, 24hrs"))
q

As you can see above, first we must grab the normalized counts at the row corresponding with the Traf1 Ensembl ID using the ‘counts‘ function that operates on the ‘ddsTxi’ DESeqDataSet object.

In order to create a dataframe (well, a tibble to be specific) for plotting, we first create a list (‘m’) that combines the counts (as a numeric vector) and metadata group.  These two vectors will form the columns of the tibble for plotting, and we must give them names (i.e., “counts” and “group”) so the tibble conversion doesn’t complain.

The list, m, is then converted to a tibble with ‘as.tibble‘ and plotted with ggplot2, using an ‘aes(group,counts)‘ aesthetic plus a boxplot aesthetic.  The rest of the code is just modifying axis labels and tickmarks.  The final product looks like this:

Boxplot of normalized Traf1 expression in 5 different conditions (3 replicates each).

Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/"))
naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive")

lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/"))
lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv")

pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/"))
pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp")

cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv")
cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp")

cd4.combined@raw.data <- as.matrix(cd4.combined@raw.data)
cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

cellnames <- rownames(pData(cd4.monocle))
orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1)
pData(cd4.monocle)$orig.ident <- orig_ident

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

library(cowplot)
p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident')
p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior
plot_grid(p1,p2)

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.

 

Matataki, a new ultrafast read mapper: hope or hype?

tl;dr:  It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement.  If you’re already using kallisto in your pipelines there’s no reason to switch.   Now onto the blog post: 

Another RNA-seq read aligner

In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.”   The title seems to imply that Matataki is much faster than anything else out there.  After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method?  The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way.  Is this true?  Let’s dive into the details and find out.

The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets.   Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.

Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.

Counting unique k-mers

The method itself eschews traditional alignment for a k-mer based read counting approach.  This approach takes two steps:  build an index and quantify the reads against it.  Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms).  Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index.  If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene.  If there is more than one match to several genes, the read is discarded.

If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:

Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.

Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read.  Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user.   The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification.   These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.

But does it work?

Well, it appears to.  The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below).  In Figure 2,  they describe the parameter tuning that led to those choices.  For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.”  The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time.  It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.

Fig 2. Comparison of TPM when k was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing k from 16 to 56 with a step of 8

Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods  (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated).  The one simulated sample was created by RSEM from the real sample plus three other related samples.  So perhaps it’s not surprising that RSEM does the best in this comparison overall.

I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters.   We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…)  It would be enlightening to find out.

Figure 1. Summary of the results using simulation data. a Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. b Means of absolute difference from the expected expression levels.

So how fast is ‘ultrafast’ anyway?

We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold.  Yep, that’s it.  It’s only twice as fast.   This is an incremental improvement, but not a paradigm shift.  In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out.  It’s been three years and nothing has substantially improved on it.

Fig 3. CPU time comparison.

Limitations?

Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.

Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.

For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool.   Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.

To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis.   Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire.  It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well.   And finally, kallisto integrates nicely with sleuth for DE testing and visualization.

Conference report: ISMB 2018 Chicago

In early July, I attended the ISMB 2018 meeting, a computational biology-focused meeting held by the International Society for Computational Biology (ISCB).  The meeting was held in the beautiful Hyatt Regency hotel in downtown Chicago, just across the street from the river and blocks from Navy Pier and Lakeshore Drive.

ISMB 2018 was a huge meeting, with at least 1500 attendees and up to ten parallel meeting tracks (called “COSIs” in ISCB parlance) at any one time.  The meeting was so big I always felt as if I was missing something good, no matter which talk I went to (except for the keynotes, where nothing else was happening).

Here is Steven Salzberg’s excellent keynote on a historical overview of finding genes in the human genome:

Obviously, at a meeting this large, one cannot explore more than a tiny fraction of the talks and posters (but you can watch many of the ISMB2018 talks on youtube now if you’re interested).

I want to briefly summarize three talks that I particularly enjoyed and found very interesting:

1) Michael Seiler, H3 biomedicine.  “Selective small molecule modulation of splicing in cancer”

Up to 60% of hematologic neoplasms (CLL/AML/MDS) contain heterozygous hotspot mutations in key splicing factor genes involved with 3′ splice site recognition.   One such gene is SF3B1, an RNA splicing factor, which is recurrently mutated around the HEAT repeats.  The cryo-EM structure of SF3B1 revealed the mutations cluster in the premRNA-interacting region.   All of the observed mutations lead to alternate 3′ splice sites being chosen during splicing; so-called “cryptic 3′ splice sites (AG)”. In particular, the K700E mutant slips the spliceosome upstream to an internal AG almost 18 nucleotides from the proper splice site.

RNA-seq of 200 CLL patients helped to discover further in-frame delections in SF3B1, SRSF2, and URAF1.   All of these mutations are HET, and the cells rely on the WT copy for survival.

Seiler asked the question whether one could exploit this weakness to attack cancer cells?  The idea is to modulate activity of SF3B1 using small molecules to disturb the WT function.  At H3 they took a natural product compound library and optimized for chemistry and binding to find candidates.  One molecule, H3B-8800, showed promise.  The compound was optimized toward selective induction of apoptosis in SF3B1 mutant cells in vitro. When resistance mutations occur, they are all at points of contact with the drug.  At only 13 nM dose, around 1% of splicing events were affected which was enough to cause lethality to the cancer cell.   RNA-seq demonstrated that it was mainly the spliceosome genes themselves that were affected by the alternate splicing in the presence of the inhibitor.  This demonstrated a delicate feedback loop between correct splicing and spliceosome gene expression.

2) Curtis Huttenhower,  Harvard University.  “Methods for multi-omics in microbial community population studies”

Huttenhower’s group at Harvard is well known for their contributions to methods for microbiome analysis, collectively known as the “biobakery.”  In this talk, Huttenhower addressed the fact that the microbiome is increasingly of interest when looking at population heath, and that chronic immune disease incidence is rising around the world over the last few decades.

Huttenhower also spent a good amount of time describing the IBD-MDB, the inflammatory bowel disease multi ‘omics database.  The database contains ~200 samples that are complete with six orthogonal datatypes, including RNA-seq, metagenome, and metabolome.   He talked about how they can associate bacteria enriched in IBD with covariation in the prevalence of metabolites.   For example, he showed how sphingolipids, carboximidic acids, cholesteryl esters and more are enriched while lactones, beta-diketones, and others are depleted in the guts of crohns disease suffers.

3) Olga Troyanskaya, Princeton University.  “ML approaches for the data-driven study of human disease”

Troyanskaya’s group at Princeton aims to develop accurate models of the complex processes underlying cellular function.  In this talk, she described efforts in her lab to use machine learning methods to understand how single nucleotide changes (SNPs) in non-coding regions can affect gene regulation and expression.  She is also interested in how pathways and networks change in different tissues and whether tissue-specific maps can be used to identify disease genes.

In particular, Dr. Troyanskaya described the “DeepSEA” software, which uses a convolution neural network (CNN) to attempt to predict the chromatin remodeling consequences of a SNP in a genomic context.   The model is a three layer CNN that takes 1000 base pairs of sequence as input.  The 1000 bp region is centered around known TF binding locations (200 bp bins).  The training data consists of a length 919 vector that contains binary values for the presence of a TF binding event across the genome (1 = binds, 0 = does not bind).  The output of the model is the probability that the specific sequence variant will affect TF binding at each of the 919 bins.

Schematic of the DeepSEA model, showing input data, training data, and output.

The DeepSEA model can be used to predict important sequence variants and even eQTLs for further study or to prioritize a list of known non-coding sequence variants.

 

 

 

Developers versus consumers of bioinformatics analysis tools

Life in the middle

As a bioinformatics applications scientist, I work in a middle ground between those who develop code for analyzing next-gen sequencing data and those who consume that analysis.   The developers are often people trained in computer science, mathematics, and statistics.   The consumers are often people trained in biology and medicine.    There is some overlap, of course, but if you’ll allow a broad generalization, I think the two groups (developers and consumers) are separated by large cultural differences in science.

Ensemble focus vs. single-gene focus

I think one of the biggest differences from my experience is the approach to conceptualizing the results of a next-gen seq experiment (e.g., RNA-seq).    People on the methods side tend to think in terms of ensembles and distributions.   They are interested in how the variation observed across all 30,000 genes can be modeled and used to estimate differential expression.   They are interested in concepts like shrinkage estimators, bayesian priors, and hypothesis weighting.  The table of differentially expressed features is thought to have meaning mainly in the statistical analysis of pathway enrichment (another ensemble).

Conversely, biologists have a drastically different view.   Often they care about a single gene or a handful of genes and how those genes vary across conditions of interest in the system that they are studying.  This is a rational response to the complexity of biological systems; no one can keep the workings of hundreds of genes in mind.  In order to make progress, you must focus.  However, this narrow focus leads investigators to sometimes cherry pick results pertaining to their ‘pet’ genes.  This can invest those results with more meaning than is warranted from a single experiment.

The gene-focus of biologists also leads to clashes with the ensemble-focus of bioinformatics software.  For example, major DE analysis packages like DESeq2 do not have convenience functions to make volcano plots, even though I’ve found that those kinds of plots are the most useful and easiest to understand for biologists.   Sleuth does have a volcano plotting function, but doesn’t allow for labeling genes.  In my experience, however, biologists want a high-res figure with common name gene labels (not ensemble transcript IDs) that they can circle and consider for wet lab validation.

New software to address the divide?

I am hopeful about the recent release of the new “bcbioRNASeq” package from Harvard Chan school bioinformatics (developers of ‘bcbio’ RNA-seq pipeline software).   It appears to be a step towards making it easier for people like me to walk on both sides of this cultural divide.    The package takes all of the outputs of the ‘bcbio’ pipeline and transforms them into an accessible S4 object that can be operated on quickly and simply within R.

Most importantly, the ‘bcbioRNASeq’ module allows for improved graphics and plotting that make communication with biologists easier.  For example, the new MA and volcano plots appear to be based on ‘ggplot2’ graphics and are quite pretty (and have text label options(!), not shown here):

I still need to become familiar with the ‘bcbioRNASeq’ package, but it looks quite promising.   ‘pcaExplorer‘ is another R/Bioconductor package that I feel does a great job of making accessible RNA-seq reports quickly and easy.  I suspect we will see continued improvements to make plots prettier and more informative, with an increasing emphasis on interactive, online plots and notebooks rather than static images.