Calculate % mitochondrial for mouse scRNA-seq

Seurat is a popular R/Bioconductor package for working with single-cell RNA-seq data. As part of the very first steps of filtering and quality-controlling scRNA-seq data in Seurat, you calculate the % mitochondrial gene expression in each cell, and filter out cells above a threshold. The tutorial provides the following code for doing this in human cells:

 
mito.genes = grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito = Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)


pbmc = AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)

Creating a catalog of mitochondrial genes by searching with ‘grep’ for any gene names that start with “MT-” works just fine for the human reference transcriptome. Unfortunately, it doesn’t work for mouse (at least for mm10, which is the reference assembly I’m working with). There are two workarounds for this, in my opinion.

The easiest is to change the regular expression in the “grep” command from “^MT-” to “^mt-” since a search through the mm10 reference (version 3.0.0) in the cellranger reference files reveals that for whatever reason, the MT genes are labeled with lowercase ‘mt’ instead.

A second, and perhaps more thorough, approach is to take advantage of the Broad Institute’s “Mouse Mitocarta 2.0” encyclopedia of mitochondrial genes (note that you could do this same procedure for human MT genes too).

By creating a list of the top 100-200 genes with the strongest evidence for MT expression, it seems likely that you more accurately capture true mitochondrial gene expression. Below is some code to use the “MitoCarta 2.0” (downloaded as a CSV file) for this procedure. You will need to import “tidyverse” to work with tibbles:

library(tidyverse)
library(seurat)

mouse_mito = as.tibble(read.csv("Mouse.MitoCarta2.0_page2.csv", header = TRUE))
mouse_mito = mouse_mito %>% select(c(Symbol, MCARTA2.0_score)) %>% slice(1:100)
mito.genes = as.character(mouse_mito$Symbol)
mito.genes = mito.genes[mito.genes %in% rownames(sample2@raw.data)]

percent.mito = Matrix::colSums(sample2@raw.data[mito.genes,]) / Matrix::colSums(sample2@raw.data)

Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/"))
naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive")

lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/"))
lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv")

pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/"))
pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp")

cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv")
cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp")

cd4.combined@raw.data <- as.matrix(cd4.combined@raw.data)
cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

cellnames <- rownames(pData(cd4.monocle))
orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1)
pData(cd4.monocle)$orig.ident <- orig_ident

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

library(cowplot)
p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident')
p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior
plot_grid(p1,p2)

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.