Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.
Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.
The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.
Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.
Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):
Step 1. Align with Hisat2
hisat2 --new-summary -x bcbio-1.0.8/genomes/Hsapiens/hg38/hisat2/hg38 -p 16 --phred33 --rg-id SW872_CAMTA1_rep1 --rg PL:illumina --rg PU:1_2019-03-11_to_setup_bcbio --rg SM:SW872_CAMTA1_rep1 -1 SW872_CAMTA1_rep1_R1.fastq.gz -2 SW872_CAMTA1_rep1_R2.fastq.gz --known-splicesite-infile bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt
Step 2/3. Pipe to bamsormadup and redirect to sorted BAM
| bamsormadup inputformat=sam threads=12 tmpfile=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort-sorttmp-markdup SO=coordinate indexfilename=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam.bai > work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam
Step 4. Index BAM
samtools index -@ 16 work/align/SW872_TAZ4SA_rep3/SW872_TAZ4SA_rep3-sort.bam /work/bcbiotx/tmpsqOnZQ/SW872_TAZ4SA_rep3-sort.bam.bai
Step 5. Samtools sort by read names
samtools sort -@ 16 -m 2457M -O BAM -n -T work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted-sort -o /work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted.bam /work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 6. Sambamba view to select only primary alignments
sambamba view -t 16 -f bam -F "not secondary_alignment" work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.bam> work/bcbiotx/tmp0zhZuj/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 7. FeatureCounts to count primary alignments in BAM
featureCounts -a /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf -o work/bcbiotx/tmp77coEk/SW872_CAMTA1_rep1.counts -s 0 -p -B -C work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.primary.bam
Step 8. Gffread to write a fasta file with spliced exons
gffread -g /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/seq/hg38.fa -w work/bcbiotx/tmpNpBGRC/hg38.fa.tmp /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf
Step 9. Build the salmon index
salmon index -k 31 -p 16 -i /work/bcbiotx/tmpTQDS7X/hg38 -t work/inputs/transcriptome/hg38.fa
Step 10. Pseudo-alignment and quantification
salmon quant -l IU -i work/salmon/index/hg38 -p 16 --gcBias -o work/bcbiotx/tmpE_RRDN/quant -1 <(gzip -cd /merged/SW872_CAMTA1_rep1_R1.fastq.gz) -2 <(gzip -cd /merged/SW872_CAMTA1_rep1_R2.fastq.gz) --numBootstraps 30
Step 11. Convert salmon output to sleuth format
Rscript -e 'library("wasabi"); prepare_fish_for_sleuth(c("work/bcbiotx/tmpE_RRDN/quant"))'
Step 12. Downsample BAM file with samtools view
samtools view -O BAM -@ 16 -o work/bcbiotx/tmphaXqSf/SW872_CAMTA1_rep1-sort-downsample.bam -s 42.269 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam
Step 13. FASTQC on downsampled BAM
export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/fastqc -d work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -t 16 --extract -o work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -f bam work/qc/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-downsample.bam
Step 14. Run Qualimap RNAseq on BAM
unset DISPLAY && export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH && /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/qualimap rnaseq -outdir work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1 -a proportional -bam work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam -p non-strand-specific -gtf /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf --java-mem-size=59g
Step 15. A SED command (not sure exactly what it does)
sed -i 's/bam file = .*/bam file = SW872_CAMTA1_rep1.bam/' work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1/rnaseq_qc_results.txt
Step 16. Mark duplicates on the BAM file
bammarkduplicates tmpfile=work/bcbiotx/tmpNdl3wy/SW872_CAMTA1_rep1-sort-dedup-markdup markthreads=16 I=work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam O=work/bcbiotx/tmprVQeKM/SW872_CAMTA1_rep1-sort-dedup.bam
Step 17. Index de-duplicated BAM file
samtools index -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam work/bcbiotx/tmpFAzLLT/SW872_CAMTA1_rep1-sort-dedup.bam.bai
Step 18. Use Sambamba view to create duplicate metrics
sambamba view --nthreads 16 --count -F 'duplicate and not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 19. Use Sambamba to create mapping metrics
sambamba view --nthreads 16 --count -F 'not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt
Step 20. Samtools stats on sorted BAM
samtools stats -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > /work/bcbiotx/tmpUPSiOz/SW872_CAMTA1_rep1.txt
Step 21. Samtools idxstats on sorted BAM
samtools idxstats work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > work/bcbiotx/tmpSKFNZQ/SW872_CAMTA1_rep1-idxstats.txt
2 Replies to “Bcbio RNA-seq ‘under the hood’”
I have a few questions regarding the bam files. Which folder has the final .bam files ready to move forward (to see visualization, coverage)? I see there is a work directory created which has align folder with sample_1 folder. It contains sample_1-downsample-dedup.bam and sample_1-downsample-dedup.bam.bai and other log files. What does downsample and dedup in files mean here? I had run read 1 and read 2 of fastq files while processing.
It generated two other folders sample_1_star and sample_1_STARgenome which contains sample_1.nsorted.bam, sample_1.nsorted.primary.bam, sample_1.bam and sample_1.transcriptome.bam.
Could you please specify what does each file mean and what’s the correct file to move forward with?
Hi, thanks for your interest. I am not a developer or maintainer of ‘bcbio-nextgen’. I would direct these questions to the github issues page or via email to the authors.