Screencast tutorial: Sleuth for RNA-Seq
Beware of biological variability in your *-Seq experiments
From this excellent paper on biological variability in RNA-Seq experiments (bold highlights are mine):
“Biological variability has important implications for the design, analysis and interpretation of RNA-sequencing experiments. […] If only a few biological replicates are available, it will be impossible to estimate the level of biological variability in expression for each gene in a study. Supplementary Table 1 summarizes a large number of published RNA-sequencing studies over the past three years. In every case, except for the two studies we analyzed here, conclusions were based on a small number (n ≤ 2) of biological replicates. One goal of RNA-sequencing studies may be simply to identify and catalog expression of new or alternative transcripts. However, all of these studies make broader biological statements on the basis of a very small set of biological replicates.
Our analysis has two important implications for studies performed with a small number of biological replicates. First, significant results in these studies may be due to biological variation and may not be reproducible; and second, it is impossible to know whether expression patterns are specific to the individuals in the study or are a characteristic of the study populations. These ideas are now widely accepted for DNA microarray experiments, where a large number of biological replicates are now required to justify scientific conclusions. Our analysis suggests that as biological variability is a fundamental characteristic of gene expression, sequencing experiments should be subject to similar requirements.”
If you are doing RNA-Seq, be very vigilant in your experimental design and find a way to incorporate more replicates, even at the expense of testing fewer comparisons. It’s better to test one comparison (tissue X vs. Y, for example) with 5 or more replicates than to test three comparisons (Tissue X vs. Y, Y vs. Z, and X vx Z) with only 2 replicates for each tissue type.
Create a volcano plot on EBSeq output
The differential expression analysis program EBSeq produces a number of data objects as part of the workflow, but there aren’t many options for visualization of the data.
The authors suggest the use of heatmap.2 in R:
heatmap.2(NormalizedMatrix[
However, this depends on knowing ahead of time your genes of interest. It is not practical to generate a heatmap with hundreds or thousands of DE genes.
I wanted to produce something approximating a volcano plot for EBSeq results. What I came up with initially was the following:
To make this plot, I had to grab some data arrays from the large object “EBOut” that is produced by calling the “EBTest” method and from the “GeneFC” object as below:
plot(GeneFC$PostFC, EBOut$PPDE, xlim =c(0,5), ylim=c(0,1), main="Control/Experimental FC vs. PPDE", sub=GeneFC$Direction, xlab="EBSeq Posterior Fold Change", ylab="EBSeq posterior prob of DE") abline(h=0.95)
The “abline” command places a horizontal line where the PPDE is equal to or greater than 95%. This would be equivalent to an FDR of 0.05.
If you want to inspect the plot interactively in R to identify gene names above the threshold and/or with large posterior fold changes you would use:
identify(GeneFC$PostFC, EBOut$PPDE, labels=names(GeneFC$PostFC))
To make it look more like a canonical volcano plot, I then tried:
plot(log2(GeneFC$RealFC), EBOut$PPDE, xlim =c(-5,5), ylim=c(0,1), main="Log2FoldChange vs. PPDE", xlab="EBSeq Log2 Fold Change", ylab="EBSeq PPDE")
Creating the following plot:
This is good, except I want to subset the data and add colors. To do this I need to create a new dataframe from the EBOut$PPDE and GeneFC$PostFC objects:
volc_df = data.frame(EBOut$PPDE, GeneFC$PostFC)
With everything in one dataframe, plotting and subsetting the data is easier. Inspired by this post at Stephen Turner’s “Getting Genetics Done” blog, I prepared my final colored volcano plot as follows:
with(volc_df, plot(log2(PostFC), PPDE, pch=20, main="Volcano Plot EBSeq", xlim=c(-5,5))) abline(h=0.95) with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="orange")) with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="red"))
The final plot looks like this: