Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.
Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.
The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.
Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.
Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):
Step 1. Align with Hisat2
hisat2 --new-summary -x bcbio-1.0.8/genomes/Hsapiens/hg38/hisat2/hg38 -p 16 --phred33 --rg-id SW872_CAMTA1_rep1 --rg PL:illumina --rg PU:1_2019-03-11_to_setup_bcbio --rg SM:SW872_CAMTA1_rep1 -1 SW872_CAMTA1_rep1_R1.fastq.gz -2 SW872_CAMTA1_rep1_R2.fastq.gz --known-splicesite-infile bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt
Step 2/3. Pipe to bamsormadup and redirect to sorted BAM
| bamsormadup inputformat=sam threads=12 tmpfile=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort-sorttmp-markdup SO=coordinate indexfilename=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam.bai > work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam
Step 4. Index BAM
samtools index -@ 16 work/align/SW872_TAZ4SA_rep3/SW872_TAZ4SA_rep3-sort.bam /work/bcbiotx/tmpsqOnZQ/SW872_TAZ4SA_rep3-sort.bam.bai
Step 5. Samtools sort by read names
samtools sort -@ 16 -m 2457M -O BAM -n -T work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted-sort -o /work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted.bam /work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 6. Sambamba view to select only primary alignments
sambamba view -t 16 -f bam -F "not secondary_alignment" work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.bam> work/bcbiotx/tmp0zhZuj/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 7. FeatureCounts to count primary alignments in BAM
featureCounts -a /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf -o work/bcbiotx/tmp77coEk/SW872_CAMTA1_rep1.counts -s 0 -p -B -C work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 8. Gffread to write a fasta file with spliced exons
gffread -g /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/seq/hg38.fa -w work/bcbiotx/tmpNpBGRC/hg38.fa.tmp /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf
Step 9. Build the salmon index
salmon index -k 31 -p 16 -i /work/bcbiotx/tmpTQDS7X/hg38 -t work/inputs/transcriptome/hg38.fa
Step 10. Pseudo-alignment and quantification
salmon quant -l IU -i work/salmon/index/hg38 -p 16 --gcBias -o work/bcbiotx/tmpE_RRDN/quant -1 <(gzip -cd /merged/SW872_CAMTA1_rep1_R1.fastq.gz) -2 <(gzip -cd /merged/SW872_CAMTA1_rep1_R2.fastq.gz) --numBootstraps 30
Step 11. Convert salmon output to sleuth format
Rscript -e 'library("wasabi"); prepare_fish_for_sleuth(c("work/bcbiotx/tmpE_RRDN/quant"))'
Step 12. Downsample BAM file with samtools view
samtools view -O BAM -@ 16 -o work/bcbiotx/tmphaXqSf/SW872_CAMTA1_rep1-sort-downsample.bam -s 42.269 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 13. FASTQC on downsampled BAM
export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/fastqc -d work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -t 16 --extract -o work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -f bam work/qc/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-downsample.bam
Step 14. Run Qualimap RNAseq on BAM
unset DISPLAY && export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/qualimap rnaseq -outdir work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1 -a proportional -bam work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam -p non-strand-specific -gtf /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf --java-mem-size=59g
Step 15. A SED command (not sure exactly what it does)
sed -i 's/bam file = .*/bam file = SW872_CAMTA1_rep1.bam/' work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1/rnaseq_qc_results.txt
Step 16. Mark duplicates on the BAM file
bammarkduplicates tmpfile=work/bcbiotx/tmpNdl3wy/SW872_CAMTA1_rep1-sort-dedup-markdup markthreads=16 I=work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam O=work/bcbiotx/tmprVQeKM/SW872_CAMTA1_rep1-sort-dedup.bam
Step 17. Index de-duplicated BAM file
samtools index -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam work/bcbiotx/tmpFAzLLT/SW872_CAMTA1_rep1-sort-dedup.bam.bai
Step 18. Use Sambamba view to create duplicate metrics
sambamba view --nthreads 16 --count -F 'duplicate and not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 19. Use Sambamba to create mapping metrics
sambamba view --nthreads 16 --count -F 'not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 20. Samtools stats on sorted BAM
samtools stats -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > /work/bcbiotx/tmpUPSiOz/SW872_CAMTA1_rep1.txt
Step 21. Samtools idxstats on sorted BAM
samtools idxstats work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > work/bcbiotx/tmpSKFNZQ/SW872_CAMTA1_rep1-idxstats.txt
Single-cell RNA-Seq methods, which sequence and barcode the transcripts within individual cells in a sample, hold enormous promise for understanding transcriptional networks in development and disease. Single-cell investigation of biological phenomena is taking the life sciences world by storm. For example, Science magazine selected single-cell methods as the 2018 “Breakthrough of the Year.”
Closer to home, our bioinformatics group here at the University of Iowa is also seeing a rapid increase in the number of scRNA-seq projects in the research pipeline. Yet with all of this interest and funding, scRNA-seq is still an emerging field with little agreement on best practices.
We see evidence of this when considering one of the main problems of scRNA-seq datasets: dropouts. ‘Dropouts’ are zero-values in the data arising from technical and biological noise. Often the dropout rate can reach up to 90% or more, degrading the ability of the analysis to detect fine structure in the data and low- and moderately expressed DE genes between cell types.
One way to combat this problem is to borrow information across genes within a sample and use that to predict imputed expression values for the missing genes. Another related approach is called data ‘smoothing,’ that attempts to lower the noise in observed values. There are several methods (MAGIC, scImpute, DrImpute, and SAVER) that have been published recently that attempt to do one or both of these approaches. While the authors of each method focus on the advantages of imputation, there can also be drawbacks caused by an increase in false-positives and loss of specificity.
A recent paper by Andrews and Hemberg address the potential drawbacks with imputation in a very concise and clear way using both simulated and real-world data. Figure 1 (below) from this paper shows very clearly the perils of doing imputation on false positive rates and spurious gene-gene correlations.
Performance on simulated scRNA-seq data
Somewhat dramatically, DrImpute and MAGIC introduce strong false positive correlations, while SAVER only strengthens existing correlations between lowly expressed DE genes. As you can see in part B of this figure below, parameter tuning also has a dramatic effect on the false positive rate in some cases. Increasing the k-neighbors for MAGIC and KNN methods increases smoothing and also false positives. SAVER and scImpute are relatively immune to changes in FPR with parameter space.
You can’t have your cake and eat it, too
In this next figure, the authors look at the trade-off between sensitivity and specificity in imputation methods on simulated datasets. It shows clearly that any improvements to sensitivity of DE gene detection come at a significant cost of specificity, and vice versa.
The authors go on to show that on real data, every method including SAVER generates large numbers of false positives. In summary, imputation, while potentially promising, is limited owing to the lack of an independent reference (as in the case of GWAS imputation methods) to impute from. Since single-cell imputation methods rely only on the dataset itself, one cannot escape the sensitivity/specificity tradeoff and false-positive problem.
The ubiquitous RNAseq analysis package, DESeq2, is a very useful and convenient way to conduct DE gene analyses. However, it lacks some useful plotting tools. For example, there is no convenience function in the library for making nice-looking boxplots from normalized gene expression data.
There are other packages one can rely on, for example ‘pcaExplorer’, but I like a simple approach sometimes to plot just a couple of genes. So below I show you how to quickly plot your favorite gene using only ggplot2 (there is no “one weird trick” though…):
traf1_counts <- counts(ddsTxi['ENSG00000056558',], normalized = TRUE) m <- list(counts = as.numeric(traf1_counts), group = as.factor(samples$group)) m <- as.tibble(m) q <- ggplot(m, aes(group, counts)) + geom_boxplot() + geom_jitter(width = 0.1) q <- q + labs(x = "Experimental Group", y = "Normalized Counts ", title = "Normalized Expression of TRAF1") q <- q + scale_x_discrete(labels=c("00hn" = "PMN, 0hrs", "06hn" = "PMN, 6hrs", "06hy" = "PMN+Hp, 6hrs", "24hn" = "PMN, 24hrs", "24hy" = "PMN+Hp, 24hrs")) q
As you can see above, first we must grab the normalized counts at the row corresponding with the Traf1 Ensembl ID using the ‘
counts‘ function that operates on the ‘ddsTxi’
In order to create a dataframe (well, a tibble to be specific) for plotting, we first create a list (‘m’) that combines the counts (as a numeric vector) and metadata group. These two vectors will form the columns of the tibble for plotting, and we must give them names (i.e., “counts” and “group”) so the tibble conversion doesn’t complain.
The list, m, is then converted to a tibble with ‘
as.tibble‘ and plotted with ggplot2, using an ‘
aes(group,counts)‘ aesthetic plus a boxplot aesthetic. The rest of the code is just modifying axis labels and tickmarks. The final product looks like this:
I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses. Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis. I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something). It looks as if you have to import a pre-merged seurat dataset.
Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:
naive.data <- Read10X(data.dir = paste0(base_dir, "naive_outs/filtered_gene_bc_matrices/mm10/")) naive <- CreateSeuratObject(raw.data = naive.data, min.cells = 3, min.genes = 200, project = "10X_naive") lcmv.data <- Read10X(data.dir = paste0(base_dir, "lcmv_outs/filtered_gene_bc_matrices/mm10/")) lcmv <- CreateSeuratObject(raw.data = lcmv.data, min.cells = 3, min.genes = 200, project = "10X_lcmv") pygp.data <- Read10X(data.dir = paste0(base_dir, "pygp_outs/filtered_gene_bc_matrices/mm10/")) pygp <- CreateSeuratObject(raw.data = pygp.data, min.cells = 3, min.genes = 200, project = "10X_pygp") cd4.combined <- MergeSeurat(object1 = naive, object2 = lcmv, add.cell.id1 = "naive", add.cell.id2 = "lcmv") cd4.combined <- AddSamples(object = cd4.combined, new.data = pygp.data, add.cell.id = "pygp") email@example.com <- as.matrix(firstname.lastname@example.org) cd4.monocle <- importCDS(cd4.combined, import_all = TRUE)
Here, I am reading in 10X data using Seurat (v2) w/ the
Read10X function and then creating the Seurat object with
Once this done I use
MergeSeurat to merge the first two experiments, and then
AddSamples to add in the final experiment. Then we can take advantage of the monocle function
importCDS to import the combined object into monocle.
Now there is one final problem and that is that the “orig.ident” field is blank:
To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”). We just need to split these and recover which experiment each cell came from with:
cellnames <- rownames(pData(cd4.monocle)) orig_ident <- sapply(strsplit(cellnames, split = "_"), '[', 1) pData(cd4.monocle)$orig.ident <- orig_ident
We do a
strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.
Now you’re ready to continue with the normal downstream analysis in monocle. With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):
library(cowplot) p1 <- plot_cell_clusters(cd4.monocle, color_by = 'orig.ident') p2 <- plot_cell_clusters(cd4.monocle) #color_by cluster is default behavior plot_grid(p1,p2)
And we get: