From our IIHG Bioinformatics Workshop series:
Seurat is a popular R/Bioconductor package for working with single-cell RNA-seq data. As part of the very first steps of filtering and quality-controlling scRNA-seq data in Seurat, you calculate the % mitochondrial gene expression in each cell, and filter out cells above a threshold. The tutorial provides the following code for doing this in human cells:
mito.genes = grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE) percent.mito = Matrix::colSums(firstname.lastname@example.org[mito.genes, ])/Matrix::colSums(email@example.com) pbmc = AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
Creating a catalog of mitochondrial genes by searching with ‘grep’ for any gene names that start with “MT-” works just fine for the human reference transcriptome. Unfortunately, it doesn’t work for mouse (at least for mm10, which is the reference assembly I’m working with). There are two workarounds for this, in my opinion.
The easiest is to change the regular expression in the “grep” command from “^MT-” to “^mt-” since a search through the mm10 reference (version 3.0.0) in the cellranger reference files reveals that for whatever reason, the MT genes are labeled with lowercase ‘mt’ instead.
A second, and perhaps more thorough, approach is to take advantage of the Broad Institute’s “Mouse Mitocarta 2.0” encyclopedia of mitochondrial genes (note that you could do this same procedure for human MT genes too).
By creating a list of the top 100-200 genes with the strongest evidence for MT expression, it seems likely that you more accurately capture true mitochondrial gene expression. Below is some code to use the “MitoCarta 2.0” (downloaded as a CSV file) for this procedure. You will need to import “tidyverse” to work with tibbles:
library(tidyverse) library(seurat) mouse_mito = as.tibble(read.csv("Mouse.MitoCarta2.0_page2.csv", header = TRUE)) mouse_mito = mouse_mito %>% select(c(Symbol, MCARTA2.0_score)) %>% slice(1:100) mito.genes = as.character(mouse_mito$Symbol) mito.genes = mito.genes[mito.genes %in% rownames(firstname.lastname@example.org)] percent.mito = Matrix::colSums(email@example.com[mito.genes,]) / Matrix::colSums(firstname.lastname@example.org)
The ubiquitous RNAseq analysis package, DESeq2, is a very useful and convenient way to conduct DE gene analyses. However, it lacks some useful plotting tools. For example, there is no convenience function in the library for making nice-looking boxplots from normalized gene expression data.
There are other packages one can rely on, for example ‘pcaExplorer’, but I like a simple approach sometimes to plot just a couple of genes. So below I show you how to quickly plot your favorite gene using only ggplot2 (there is no “one weird trick” though…):
traf1_counts <- counts(ddsTxi['ENSG00000056558',], normalized = TRUE) m <- list(counts = as.numeric(traf1_counts), group = as.factor(samples$group)) m <- as.tibble(m) q <- ggplot(m, aes(group, counts)) + geom_boxplot() + geom_jitter(width = 0.1) q <- q + labs(x = "Experimental Group", y = "Normalized Counts ", title = "Normalized Expression of TRAF1") q <- q + scale_x_discrete(labels=c("00hn" = "PMN, 0hrs", "06hn" = "PMN, 6hrs", "06hy" = "PMN+Hp, 6hrs", "24hn" = "PMN, 24hrs", "24hy" = "PMN+Hp, 24hrs")) q
As you can see above, first we must grab the normalized counts at the row corresponding with the Traf1 Ensembl ID using the ‘
counts‘ function that operates on the ‘ddsTxi’
In order to create a dataframe (well, a tibble to be specific) for plotting, we first create a list (‘m’) that combines the counts (as a numeric vector) and metadata group. These two vectors will form the columns of the tibble for plotting, and we must give them names (i.e., “counts” and “group”) so the tibble conversion doesn’t complain.
The list, m, is then converted to a tibble with ‘
as.tibble‘ and plotted with ggplot2, using an ‘
aes(group,counts)‘ aesthetic plus a boxplot aesthetic. The rest of the code is just modifying axis labels and tickmarks. The final product looks like this:
I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses. Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis. I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something). It looks as if you have to import a pre-merged seurat dataset.
Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:
naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/")) naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive") lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/")) lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv") pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/")) pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp") cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv") cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp") email@example.com <- as.matrix(firstname.lastname@example.org) cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)
Here, I am reading in 10X data using Seurat (v2) w/ the
Read10X function and then creating the Seurat object with
Once this done I use
MergeSeurat to merge the first two experiments, and then
AddSamples to add in the final experiment. Then we can take advantage of the monocle function
importCDS to import the combined object into monocle.
Now there is one final problem and that is that the “orig.ident” field is blank:
To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”). We just need to split these and recover which experiment each cell came from with:
cellnames <- rownames(pData(cd4.monocle)) orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1) pData(cd4.monocle)$orig.ident <- orig_ident
We do a
strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.
Now you’re ready to continue with the normal downstream analysis in monocle. With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):
library(cowplot) p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident') p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior plot_grid(p1,p2)
And we get:
Life in the middle
As a bioinformatics applications scientist, I work in a middle ground between those who develop code for analyzing next-gen sequencing data and those who consume that analysis. The developers are often people trained in computer science, mathematics, and statistics. The consumers are often people trained in biology and medicine. There is some overlap, of course, but if you’ll allow a broad generalization, I think the two groups (developers and consumers) are separated by large cultural differences in science.
Ensemble focus vs. single-gene focus
I think one of the biggest differences from my experience is the approach to conceptualizing the results of a next-gen seq experiment (e.g., RNA-seq). People on the methods side tend to think in terms of ensembles and distributions. They are interested in how the variation observed across all 30,000 genes can be modeled and used to estimate differential expression. They are interested in concepts like shrinkage estimators, bayesian priors, and hypothesis weighting. The table of differentially expressed features is thought to have meaning mainly in the statistical analysis of pathway enrichment (another ensemble).
Conversely, biologists have a drastically different view. Often they care about a single gene or a handful of genes and how those genes vary across conditions of interest in the system that they are studying. This is a rational response to the complexity of biological systems; no one can keep the workings of hundreds of genes in mind. In order to make progress, you must focus. However, this narrow focus leads investigators to sometimes cherry pick results pertaining to their ‘pet’ genes. This can invest those results with more meaning than is warranted from a single experiment.
The gene-focus of biologists also leads to clashes with the ensemble-focus of bioinformatics software. For example, major DE analysis packages like DESeq2 do not have convenience functions to make volcano plots, even though I’ve found that those kinds of plots are the most useful and easiest to understand for biologists. Sleuth does have a volcano plotting function, but doesn’t allow for labeling genes. In my experience, however, biologists want a high-res figure with common name gene labels (not ensemble transcript IDs) that they can circle and consider for wet lab validation.
New software to address the divide?
I am hopeful about the recent release of the new “bcbioRNASeq” package from Harvard Chan school bioinformatics (developers of ‘bcbio’ RNA-seq pipeline software). It appears to be a step towards making it easier for people like me to walk on both sides of this cultural divide. The package takes all of the outputs of the ‘bcbio’ pipeline and transforms them into an accessible S4 object that can be operated on quickly and simply within R.
Most importantly, the ‘bcbioRNASeq’ module allows for improved graphics and plotting that make communication with biologists easier. For example, the new MA and volcano plots appear to be based on ‘ggplot2’ graphics and are quite pretty (and have text label options(!), not shown here):
I still need to become familiar with the ‘bcbioRNASeq’ package, but it looks quite promising. ‘pcaExplorer‘ is another R/Bioconductor package that I feel does a great job of making accessible RNA-seq reports quickly and easy. I suspect we will see continued improvements to make plots prettier and more informative, with an increasing emphasis on interactive, online plots and notebooks rather than static images.