Bcbio RNA-seq ‘under the hood’

Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.

Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.

The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.

Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.

Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):

Step 1. Align with Hisat2

hisat2 --new-summary -x bcbio-1.0.8/genomes/Hsapiens/hg38/hisat2/hg38 -p 16 --phred33 --rg-id SW872_CAMTA1_rep1 --rg PL:illumina --rg PU:1_2019-03-11_to_setup_bcbio --rg SM:SW872_CAMTA1_rep1 -1 SW872_CAMTA1_rep1_R1.fastq.gz -2 SW872_CAMTA1_rep1_R2.fastq.gz
--known-splicesite-infile bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt 

Step 2/3. Pipe to bamsormadup and redirect to sorted BAM

| bamsormadup inputformat=sam threads=12 tmpfile=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort-sorttmp-markdup 
SO=coordinate indexfilename=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam.bai >  work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam

Step 4. Index BAM

samtools index -@ 16 work/align/SW872_TAZ4SA_rep3/SW872_TAZ4SA_rep3-sort.bam /work/bcbiotx/tmpsqOnZQ/SW872_TAZ4SA_rep3-sort.bam.bai

Step 5. Samtools sort by read names

samtools sort -@ 16 -m 2457M -O BAM -n -T work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted-sort -o /work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted.bam /work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam

Step 6. Sambamba view to select only primary alignments

sambamba view -t 16 -f bam -F "not secondary_alignment" work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.bam> work/bcbiotx/tmp0zhZuj/SW872_CAMTA1_rep1-sort.nsorted.primary.bam

Step 7. FeatureCounts to count primary alignments in BAM

featureCounts -a /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf -o work/bcbiotx/tmp77coEk/SW872_CAMTA1_rep1.counts -s 0 -p -B -C work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.primary.bam

Step 8. Gffread to write a fasta file with spliced exons

gffread -g /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/seq/hg38.fa -w work/bcbiotx/tmpNpBGRC/hg38.fa.tmp /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf

Step 9. Build the salmon index

salmon index -k 31 -p 16 -i /work/bcbiotx/tmpTQDS7X/hg38 -t work/inputs/transcriptome/hg38.fa

Step 10. Pseudo-alignment and quantification

salmon quant -l IU -i work/salmon/index/hg38 -p 16 --gcBias -o work/bcbiotx/tmpE_RRDN/quant   -1 <(gzip -cd /merged/SW872_CAMTA1_rep1_R1.fastq.gz) -2 <(gzip -cd /merged/SW872_CAMTA1_rep1_R2.fastq.gz) --numBootstraps 30

Step 11. Convert salmon output to sleuth format

Rscript -e 'library("wasabi"); prepare_fish_for_sleuth(c("work/bcbiotx/tmpE_RRDN/quant"))'

Step 12. Downsample BAM file with samtools view

samtools view -O BAM -@ 16 -o work/bcbiotx/tmphaXqSf/SW872_CAMTA1_rep1-sort-downsample.bam -s 42.269 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam

Step 13. FASTQC on downsampled BAM

export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH &&  /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/fastqc -d work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -t 16 --extract -o work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -f bam work/qc/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-downsample.bam

Step 14. Run Qualimap RNAseq on BAM

unset DISPLAY && export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH &&  /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/qualimap rnaseq -outdir work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1 -a proportional -bam work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam -p non-strand-specific -gtf /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf --java-mem-size=59g

Step 15. A SED command (not sure exactly what it does)

sed -i 's/bam file = .*/bam file = SW872_CAMTA1_rep1.bam/' work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1/rnaseq_qc_results.txt

Step 16. Mark duplicates on the BAM file

bammarkduplicates tmpfile=work/bcbiotx/tmpNdl3wy/SW872_CAMTA1_rep1-sort-dedup-markdup markthreads=16 I=work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam O=work/bcbiotx/tmprVQeKM/SW872_CAMTA1_rep1-sort-dedup.bam

Step 17. Index de-duplicated BAM file

samtools index -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam work/bcbiotx/tmpFAzLLT/SW872_CAMTA1_rep1-sort-dedup.bam.bai

Step 18. Use Sambamba view to create duplicate metrics

sambamba view --nthreads 16 --count -F 'duplicate and not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt

Step 19. Use Sambamba to create mapping metrics

sambamba view --nthreads 16 --count -F 'not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt

Step 20. Samtools stats on sorted BAM

samtools stats -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > /work/bcbiotx/tmpUPSiOz/SW872_CAMTA1_rep1.txt

Step 21. Samtools idxstats on sorted BAM

samtools idxstats work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > work/bcbiotx/tmpSKFNZQ/SW872_CAMTA1_rep1-idxstats.txt

Sequencing depth for accurate SNP calling: bcbio case study

Intuitively, it is easy to grasp that the more sequencing depth (i.e., the greater the number of reads covering any given position in the genome) the more accurate the calling of SNPs and indels (insertions/deletions).   But how much difference does this actually make in the real world?  Is 20X coverage dramatically worse than 30X (considered a standard coverage depth on genomes)?

To find out, I conducted an experiment with the bcbio pipeline, a bioinformatics pipeline solution built in python that allows for automated and reproducible analyses on high-performance computing clusters.  One feature of bcbio is that it can perform validation surveys using high-confidence consensus calls from reference genomes like the NA12878 Coriell sample (from the Genome in a Bottle project).
For NA12878, researchers collated consensus SNP and indel calls from a large variety of sequencing technologies and calling methods to produce a very high-confidence callset for training other methods or validating a sequencing workflow.  bcbio includes these variant calls and can easily be setup to validate these calls against a sequenced NA12878 genome.

The sequencing depth experiment

I started with a NA12878 genome sequenced to 30X sequencing depth.  To compare shallower depths, I subsampled the data to generate 20X, 10X, etc…  [Please note: data was not subsetted randomly, rather “slices” were taken from the 30X dataset] To look at a 60X coverage datapoint, I combined data from two sequencing runs on both flow cells of a HiSeq4000 instrument.

The results after validation are shown in Figure 1 (depth of coverage is along the x-axis):

sequencing depth
Fig 1. SNP discovery as a function of increasing coverage of the GIAB validation sample.

 

The figure shows that, as expected, when sequencing depth decreases  the error rate increases, and SNP discovery declines.   It also makes the case for the commonly held view that 30X is enough coverage for genomes, since going to 60X leads to almost unnoticeable improvement in the % found and a slight increase in error.  Performance really degrades at 12X and below, with poor discovery rates and unacceptably high error rates.

I will be submitting a short manuscript to biorxiv.org soon describing this work in more detail.