Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/"))
naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive")

lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/"))
lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv")

pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/"))
pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp")

cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv")
cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp")

cd4.combined@raw.data <- as.matrix(cd4.combined@raw.data)
cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

cellnames <- rownames(pData(cd4.monocle))
orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1)
pData(cd4.monocle)$orig.ident <- orig_ident

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

library(cowplot)
p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident')
p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior
plot_grid(p1,p2)

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.

 

New “10 simple rules” paper for bioinformatics collaborations coming soon…

Excited to announce that I’ve been working with some fantastic computational biologists (that I met at ISMB2018) on a “10 simple rules” style paper for creating and promoting effective bioinformatics collaborations with wet-lab biologists.  We will leverage our many years of combined bioinformatics core experience to create these “10 simple rules.”

We will touch on:

–experiment planning and design.

–data management plans, data QC, and record keeping.

–avoiding batch effects and contamination.

–managing expectations and developing clear communications.

–handling low-quality data when things do go wrong.

Hope to have this out by the end of 2018…watch this space!

Confounding in *-seq experiments

I see a lot of experimental design that is confounded.  Please read this important essay to understand why this must be avoided to have quality results from expensive *-seq experiments.

Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.

 

Decode SAM files with these handy references

I recently had to inspect some genomic alignments as part of a project.  Usually, I am just working with BAM files and if inspection is needed, I just visualize the pileups to see what is going on.

In this case, I just wanted a quick answer to how the reads were aligning to the reference, and I didn’t want to go through the process of subsetting and copying the BAM files to my local machine.

The SAM file is the uncompressed record of the read alignments produced by an aligner method (STAR, TopHat, BWA, etc….).   This file can get very large, and so is usually compressed into BAM (faster for machine parsing, but not human readable) and the SAM file is discarded.

In my case, I still had the SAM files around to inspect.  If you find yourself needing to read a SAM file, here are three helpful reference tools to make the process less painful:

1)  This page has an enormous amount of detail about SAM files including this helpful chart that enumerates all of the fields that you can expect to find specified within each alignment:

SAM file structure explained in this handy chart.

 

2) This post from the blog “zenfractal.com” contains a great exposition on CIGAR strings and how to decode them:

3)  And finally, if you’re trying to decode the SAM bitwise flags, you can calculate them using this tool from the Broad Institute:

Decode SAM flags with this handy online tool from the Broad Institute.