I’m thrilled to report that I’ve been promoted to the position of Director of our bioinformatics group here at the University of Iowa. We are within the Iowa Institute of Human Genetics (IIHG) and we support clinical activities in the institute, but also a wide array of research collaborations across the University.
I have a lot of goals and ideas for the group and look forward to working to implement those going forward. I may not be able to write posts here as often, but I’ll try to keep up with it. We also have a new twitter account: @iowabioinfo. Please follow us there.
Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.
Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.
The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.
Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.
Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):
Step 1. Align with Hisat2
hisat2 --new-summary -x bcbio-1.0.8/genomes/Hsapiens/hg38/hisat2/hg38 -p 16 --phred33 --rg-id SW872_CAMTA1_rep1 --rg PL:illumina --rg PU:1_2019-03-11_to_setup_bcbio --rg SM:SW872_CAMTA1_rep1 -1 SW872_CAMTA1_rep1_R1.fastq.gz -2 SW872_CAMTA1_rep1_R2.fastq.gz --known-splicesite-infile bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt
Step 2/3. Pipe to bamsormadup and redirect to sorted BAM
| bamsormadup inputformat=sam threads=12 tmpfile=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort-sorttmp-markdup SO=coordinate indexfilename=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam.bai > work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam
Step 4. Index BAM
samtools index -@ 16 work/align/SW872_TAZ4SA_rep3/SW872_TAZ4SA_rep3-sort.bam /work/bcbiotx/tmpsqOnZQ/SW872_TAZ4SA_rep3-sort.bam.bai
Step 5. Samtools sort by read names
samtools sort -@ 16 -m 2457M -O BAM -n -T work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted-sort -o /work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted.bam /work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 6. Sambamba view to select only primary alignments
sambamba view -t 16 -f bam -F "not secondary_alignment" work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.bam> work/bcbiotx/tmp0zhZuj/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 7. FeatureCounts to count primary alignments in BAM
featureCounts -a /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf -o work/bcbiotx/tmp77coEk/SW872_CAMTA1_rep1.counts -s 0 -p -B -C work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 8. Gffread to write a fasta file with spliced exons
gffread -g /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/seq/hg38.fa -w work/bcbiotx/tmpNpBGRC/hg38.fa.tmp /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf
Step 9. Build the salmon index
salmon index -k 31 -p 16 -i /work/bcbiotx/tmpTQDS7X/hg38 -t work/inputs/transcriptome/hg38.fa
Step 10. Pseudo-alignment and quantification
salmon quant -l IU -i work/salmon/index/hg38 -p 16 --gcBias -o work/bcbiotx/tmpE_RRDN/quant -1 <(gzip -cd /merged/SW872_CAMTA1_rep1_R1.fastq.gz) -2 <(gzip -cd /merged/SW872_CAMTA1_rep1_R2.fastq.gz) --numBootstraps 30
Step 11. Convert salmon output to sleuth format
Rscript -e 'library("wasabi"); prepare_fish_for_sleuth(c("work/bcbiotx/tmpE_RRDN/quant"))'
Step 12. Downsample BAM file with samtools view
samtools view -O BAM -@ 16 -o work/bcbiotx/tmphaXqSf/SW872_CAMTA1_rep1-sort-downsample.bam -s 42.269 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 13. FASTQC on downsampled BAM
export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/fastqc -d work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -t 16 --extract -o work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -f bam work/qc/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-downsample.bam
Step 14. Run Qualimap RNAseq on BAM
unset DISPLAY && export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/qualimap rnaseq -outdir work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1 -a proportional -bam work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam -p non-strand-specific -gtf /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf --java-mem-size=59g
Step 15. A SED command (not sure exactly what it does)
sed -i 's/bam file = .*/bam file = SW872_CAMTA1_rep1.bam/' work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1/rnaseq_qc_results.txt
Step 16. Mark duplicates on the BAM file
bammarkduplicates tmpfile=work/bcbiotx/tmpNdl3wy/SW872_CAMTA1_rep1-sort-dedup-markdup markthreads=16 I=work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam O=work/bcbiotx/tmprVQeKM/SW872_CAMTA1_rep1-sort-dedup.bam
Step 17. Index de-duplicated BAM file
samtools index -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam work/bcbiotx/tmpFAzLLT/SW872_CAMTA1_rep1-sort-dedup.bam.bai
Step 18. Use Sambamba view to create duplicate metrics
sambamba view --nthreads 16 --count -F 'duplicate and not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 19. Use Sambamba to create mapping metrics
sambamba view --nthreads 16 --count -F 'not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 20. Samtools stats on sorted BAM
samtools stats -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > /work/bcbiotx/tmpUPSiOz/SW872_CAMTA1_rep1.txt
Step 21. Samtools idxstats on sorted BAM
samtools idxstats work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > work/bcbiotx/tmpSKFNZQ/SW872_CAMTA1_rep1-idxstats.txt
Seurat is a popular R/Bioconductor package for working with single-cell RNA-seq data. As part of the very first steps of filtering and quality-controlling scRNA-seq data in Seurat, you calculate the % mitochondrial gene expression in each cell, and filter out cells above a threshold. The tutorial provides the following code for doing this in human cells:
mito.genes = grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE) percent.mito = Matrix::colSums(email@example.com[mito.genes, ])/Matrix::colSums(firstname.lastname@example.org) pbmc = AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
Creating a catalog of mitochondrial genes by searching with ‘grep’ for any gene names that start with “MT-” works just fine for the human reference transcriptome. Unfortunately, it doesn’t work for mouse (at least for mm10, which is the reference assembly I’m working with). There are two workarounds for this, in my opinion.
The easiest is to change the regular expression in the “grep” command from “^MT-” to “^mt-” since a search through the mm10 reference (version 3.0.0) in the cellranger reference files reveals that for whatever reason, the MT genes are labeled with lowercase ‘mt’ instead.
A second, and perhaps more thorough, approach is to take advantage of the Broad Institute’s “Mouse Mitocarta 2.0” encyclopedia of mitochondrial genes (note that you could do this same procedure for human MT genes too).
By creating a list of the top 100-200 genes with the strongest evidence for MT expression, it seems likely that you more accurately capture true mitochondrial gene expression. Below is some code to use the “MitoCarta 2.0” (downloaded as a CSV file) for this procedure. You will need to import “tidyverse” to work with tibbles:
library(tidyverse) library(seurat) mouse_mito = as.tibble(read.csv("Mouse.MitoCarta2.0_page2.csv", header = TRUE)) mouse_mito = mouse_mito %>% select(c(Symbol, MCARTA2.0_score)) %>% slice(1:100) mito.genes = as.character(mouse_mito$Symbol) mito.genes = mito.genes[mito.genes %in% rownames(email@example.com)] percent.mito = Matrix::colSums(firstname.lastname@example.org[mito.genes,]) / Matrix::colSums(email@example.com)
Single-cell RNA-Seq methods, which sequence and barcode the transcripts within individual cells in a sample, hold enormous promise for understanding transcriptional networks in development and disease. Single-cell investigation of biological phenomena is taking the life sciences world by storm. For example, Science magazine selected single-cell methods as the 2018 “Breakthrough of the Year.”
Closer to home, our bioinformatics group here at the University of Iowa is also seeing a rapid increase in the number of scRNA-seq projects in the research pipeline. Yet with all of this interest and funding, scRNA-seq is still an emerging field with little agreement on best practices.
We see evidence of this when considering one of the main problems of scRNA-seq datasets: dropouts. ‘Dropouts’ are zero-values in the data arising from technical and biological noise. Often the dropout rate can reach up to 90% or more, degrading the ability of the analysis to detect fine structure in the data and low- and moderately expressed DE genes between cell types.
One way to combat this problem is to borrow information across genes within a sample and use that to predict imputed expression values for the missing genes. Another related approach is called data ‘smoothing,’ that attempts to lower the noise in observed values. There are several methods (MAGIC, scImpute, DrImpute, and SAVER) that have been published recently that attempt to do one or both of these approaches. While the authors of each method focus on the advantages of imputation, there can also be drawbacks caused by an increase in false-positives and loss of specificity.
A recent paper by Andrews and Hemberg address the potential drawbacks with imputation in a very concise and clear way using both simulated and real-world data. Figure 1 (below) from this paper shows very clearly the perils of doing imputation on false positive rates and spurious gene-gene correlations.
Performance on simulated scRNA-seq data
Somewhat dramatically, DrImpute and MAGIC introduce strong false positive correlations, while SAVER only strengthens existing correlations between lowly expressed DE genes. As you can see in part B of this figure below, parameter tuning also has a dramatic effect on the false positive rate in some cases. Increasing the k-neighbors for MAGIC and KNN methods increases smoothing and also false positives. SAVER and scImpute are relatively immune to changes in FPR with parameter space.
You can’t have your cake and eat it, too
In this next figure, the authors look at the trade-off between sensitivity and specificity in imputation methods on simulated datasets. It shows clearly that any improvements to sensitivity of DE gene detection come at a significant cost of specificity, and vice versa.
The authors go on to show that on real data, every method including SAVER generates large numbers of false positives. In summary, imputation, while potentially promising, is limited owing to the lack of an independent reference (as in the case of GWAS imputation methods) to impute from. Since single-cell imputation methods rely only on the dataset itself, one cannot escape the sensitivity/specificity tradeoff and false-positive problem.