Bcbio RNA-seq ‘under the hood’

Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.

Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.

The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.

Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.

Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):

Step 1. Align with Hisat2

Step 2/3. Pipe to bamsormadup and redirect to sorted BAM

Step 4. Index BAM

Step 5. Samtools sort by read names

Step 6. Sambamba view to select only primary alignments

Step 7. FeatureCounts to count primary alignments in BAM

Step 8. Gffread to write a fasta file with spliced exons

Step 9. Build the salmon index

Step 10. Pseudo-alignment and quantification

Step 11. Convert salmon output to sleuth format

Step 12. Downsample BAM file with samtools view

Step 13. FASTQC on downsampled BAM

Step 14. Run Qualimap RNAseq on BAM

Step 15. A SED command (not sure exactly what it does)

Step 16. Mark duplicates on the BAM file

Step 17. Index de-duplicated BAM file

Step 18. Use Sambamba view to create duplicate metrics

Step 19. Use Sambamba to create mapping metrics

Step 20. Samtools stats on sorted BAM

Step 21. Samtools idxstats on sorted BAM

Leave a Reply

Your email address will not be published.