Perturbation analysis of spatial single cell RNA-seq with ‘augur’

Spatial single cell RNA-seq data are essentially regular single-cell RNA-seq data that have spatial coordinates associated through localization on a special capture slide. I had previously used so-called “perturbation” analysis successfully with 10X single-cell data and I wanted to apply the technique to spatial single cell to understand how a treatment affects the spatially-resolved clusters.

Here, I want to briefly describe the steps I went through to perform ‘augur’ perturbation analysis of 10X Visium Spatial single cell RNA-seq data. augur works as follows:

Augur is an R package to prioritize cell types involved in the response to an experimental perturbation within high-dimensional single-cell data. The intuition underlying Augur is that cells undergoing a profound response to a given experimental stimulus become more separable, in the space of molecular measurements, than cells that remain unaffected by the stimulus. Augur quantifies this separability by asking how readily the experimental sample labels associated with each cell (e.g., treatment vs. control) can be predicted from molecular measurements alone. This is achieved by training a machine-learning model specific to each cell type, to predict the experimental condition from which each individual cell originated. The accuracy of each cell type-specific classifier is evaluated in cross-validation, providing a quantitative basis for cell type prioritization.

I followed both the Seurat 10X Visium vignette as well as a dataset integration protocol to combine two treatment (a gene knockout, in this case) and control samples (S1 and S2). Normalization was performed by “SCTransform” as recommended for spatial RNA-seq data prior to integration. PCA, K-nearest neighbors, clustering, and uMAP were calculated as described in the Seurat vignette using default values. Cell types were assigned in collaboration with the experimentalists.

With the integrated, clustered and, assigned dataset in hand, I was ready to enter the “augur” workflow as described in the paper, with some minor tweaks. First, because this is spatial and not regular scRNA-seq, there is no “RNA” default assay to set after integration. I chose to set “SCT” as the assay instead, because this represents the normalized and scaled dataset which is what you want for input to an ML model.

```{r, celltype_priority}

library(Augur)
DefaultAssay(s1s2.int) <- "SCT"
augur <- Augur::calculate_auc(s1s2.int, label_col = "orig.ident", cell_type_col = "cell_type", 
                              n_threads = 6, 
                              rf_params = list(trees = 15, mtry = 2, min_n = NULL, importance = "accuracy"),
                              n_subsamples = 25,
                              )
```

Above, you can see the actual call to augur “calculate_auc” method. I found that by specifying ‘rf_params’ and reducing the number of trees, I got better separation between cell types in the AUC readout. The calculation takes about 20 minutes to run on a 2018 MacBook Pro 13 inch laptop.

When the algorithm completes, you can visualize your results. Using the vignette for regular scRNA-seq you can do this:

library(patchwork)
p1 <- plot_umap(augur, s1s2.int, mode = "default", palette = "Spectral")
p1 <- p1 + geom_point(size=0.1) + ggtitle("Augur Perturbation by Type (Red = Most)")
p2 <- DimPlot(s1s2.int, reduction = "umap", group.by = "cell_type") + ggtitle("S1/S2 Integrated Cell Types")
p1 + p2 

The resulting plot looks like this:

Augur perturbation analysis by AUC (red is more perturbed; left) and UMAP plot of cell types (right).

This is great and helpful, but it doesn’t take advantage of the spatially resolved nature of the data. To do that, you have to modify the integrated seurat object with the augur results:

### Make a dataframe of AUC results 
auc_tab <- augur$AUC
auc_tab$rank <- c(1:9)

### Grab the cells by type and barcode 
tib <- s1s2.int$cell_type %>% as_tibble(rownames = "Barcode") %>% rename(cell_type=value)

### Join the AUC information to the barcode on cell_type 
tib <- tib %>% left_join(., auc_tab)

### Sanity check 
assertthat::are_equal(colnames(s1s2.int), tib$Barcode)

### Update the seurat object with new augur metadata 
s1s2.int$AUC <- round(tib$auc, 3) 
s1s2.int$RANK <- tib$rank

Here, I am simply pulling out the AUC results into a table by cell type. Then I get the cell type information from the seurat object and merge the AUC information into it. I just set new metadata on the seurat object to transfer information about AUC and Rank for each barcode (i.e., cell). I do a sanity check to make sure the barcodes match (they do, as expected).

Now you can plot the spatially resolved AUC information:

SpatialDimPlot(s1s2.int, group.by = "AUC", cols = rev(c("#D73027", "#F46D43", "#FDAE61", "#FEE090", "#FFFFBF", "#E0F3F8", "#ABD9E9", "#74ADD1", "#4575B4")))

This takes advantage of the “group.by” flag in the Spatial Dim Plots command to use the AUC metadata. I’m also using a custom color scheme from ColorBrewer that shades the cell types from low to high AUC along a rainbow for ease of viewing. The plot looks like this:

Spatially-resolved perturbation (AUC) of cell clusters in the WT (left) and knockout (right) samples.

New job: Director of IIHG Bioinformatics

I’m thrilled to report that I’ve been promoted to the position of Director of our bioinformatics group here at the University of Iowa. We are within the Iowa Institute of Human Genetics (IIHG) and we support clinical activities in the institute, but also a wide array of research collaborations across the University.

I have a lot of goals and ideas for the group and look forward to working to implement those going forward. I may not be able to write posts here as often, but I’ll try to keep up with it. We also have a new twitter account: @iowabioinfo. Please follow us there.

Bcbio RNA-seq ‘under the hood’

Bcbio is a configuration-based pipeline manager for common NGS workflows. It uses a YAML-config file to set all of the inputs and specifications for pipeline. I’ve used bcbio for dozens of RNA-seq projects, but I’ve never known exactly what it is doing during the pipeline itself. This is because in order to see the exact commands being run you have to either dig into the code, or dig through the log files.

Digging through code is difficult because the code base is large and there are many different pieces of code that call each other. Digging through the logs is difficult when there are dozens of samples (each command is repeated dozens of times, leading to log files with thousands of lines). Well, I finally gave in and sorted through the RNA-seq pipeline command logs to identify the unique steps that bcbio (version 1.0.8) is performing in order to produce its results. I was able to identify 21 unique steps that are performed on each sample.

The difficulty of figuring out exactly what a configuration-based pipeline like bcbio is going to do is one argument in favor of using software like snakemake or nextflow to create or adapt existing pipelines, where the actual steps in the pipeline are made very explicit in “process” blocks. I’m going to be writing more about NextFlow in upcoming posts.

Of these 21 steps, 17 steps all deal with creating a BAM file and then manipulating that BAM file or calculating something about the BAM file. The remainder mainly deal with pseudo-alignment using salmon. It’s somewhat ironic that most of the pipeline and computational time is taken up with creating and manipulating BAM files since I only ever use the salmon pseudo-alignments in my downstream analysis.

Here are the 21 steps of the bcbio RNA-seq workflow (I’ve deleted the long, user-specific file paths to show just the commands):

Step 1. Align with Hisat2

hisat2 --new-summary -x bcbio-1.0.8/genomes/Hsapiens/hg38/hisat2/hg38 -p 16 --phred33 --rg-id SW872_CAMTA1_rep1 --rg PL:illumina --rg PU:1_2019-03-11_to_setup_bcbio --rg SM:SW872_CAMTA1_rep1 -1 SW872_CAMTA1_rep1_R1.fastq.gz -2 SW872_CAMTA1_rep1_R2.fastq.gz
--known-splicesite-infile bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt 

Step 2/3. Pipe to bamsormadup and redirect to sorted BAM

| bamsormadup inputformat=sam threads=12 tmpfile=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort-sorttmp-markdup 
SO=coordinate indexfilename=work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam.bai >  work/bcbiotx/tmplsr55j/SW872_CAMTA1_rep1-sort.bam

Step 4. Index BAM

samtools index -@ 16 work/align/SW872_TAZ4SA_rep3/SW872_TAZ4SA_rep3-sort.bam /work/bcbiotx/tmpsqOnZQ/SW872_TAZ4SA_rep3-sort.bam.bai

Step 5. Samtools sort by read names

samtools sort -@ 16 -m 2457M -O BAM -n -T work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted-sort -o /work/bcbiotx/tmpqFmCaf/SW872_CAMTA1_rep1-sort.nsorted.bam /work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam

Step 6. Sambamba view to select only primary alignments

sambamba view -t 16 -f bam -F "not secondary_alignment" work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.bam> work/bcbiotx/tmp0zhZuj/SW872_CAMTA1_rep1-sort.nsorted.primary.bam

Step 7. FeatureCounts to count primary alignments in BAM

featureCounts -a /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf -o work/bcbiotx/tmp77coEk/SW872_CAMTA1_rep1.counts -s 0 -p -B -C work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.nsorted.primary.bam

Step 8. Gffread to write a fasta file with spliced exons

gffread -g /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/seq/hg38.fa -w work/bcbiotx/tmpNpBGRC/hg38.fa.tmp /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf

Step 9. Build the salmon index

salmon index -k 31 -p 16 -i /work/bcbiotx/tmpTQDS7X/hg38 -t work/inputs/transcriptome/hg38.fa

Step 10. Pseudo-alignment and quantification

salmon quant -l IU -i work/salmon/index/hg38 -p 16 --gcBias -o work/bcbiotx/tmpE_RRDN/quant   -1 <(gzip -cd /merged/SW872_CAMTA1_rep1_R1.fastq.gz) -2 <(gzip -cd /merged/SW872_CAMTA1_rep1_R2.fastq.gz) --numBootstraps 30

Step 11. Convert salmon output to sleuth format

Rscript -e 'library("wasabi"); prepare_fish_for_sleuth(c("work/bcbiotx/tmpE_RRDN/quant"))'

Step 12. Downsample BAM file with samtools view

samtools view -O BAM -@ 16 -o work/bcbiotx/tmphaXqSf/SW872_CAMTA1_rep1-sort-downsample.bam -s 42.269 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam

Step 13. FASTQC on downsampled BAM

export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH &&  /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/fastqc -d work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -t 16 --extract -o work/qc/SW872_CAMTA1_rep1/bcbiotx/tmpgOv610 -f bam work/qc/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-downsample.bam

Step 14. Run Qualimap RNAseq on BAM

unset DISPLAY && export PATH=/Dedicated/IIHG-argon/bcbio-1.0.8/anaconda/bin:$PATH &&  /Dedicated/IIHG-argon/bcbio-1.0.8/galaxy/../anaconda/bin/qualimap rnaseq -outdir work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1 -a proportional -bam work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam -p non-strand-specific -gtf /Dedicated/IIHG-argon/bcbio-1.0.8/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf --java-mem-size=59g

Step 15. A SED command (not sure exactly what it does)

sed -i 's/bam file = .*/bam file = SW872_CAMTA1_rep1.bam/' work/bcbiotx/tmpACJXgn/SW872_CAMTA1_rep1/rnaseq_qc_results.txt

Step 16. Mark duplicates on the BAM file

bammarkduplicates tmpfile=work/bcbiotx/tmpNdl3wy/SW872_CAMTA1_rep1-sort-dedup-markdup markthreads=16 I=work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam O=work/bcbiotx/tmprVQeKM/SW872_CAMTA1_rep1-sort-dedup.bam

Step 17. Index de-duplicated BAM file

samtools index -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam work/bcbiotx/tmpFAzLLT/SW872_CAMTA1_rep1-sort-dedup.bam.bai

Step 18. Use Sambamba view to create duplicate metrics

sambamba view --nthreads 16 --count -F 'duplicate and not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt

Step 19. Use Sambamba to create mapping metrics

sambamba view --nthreads 16 --count -F 'not unmapped' work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort-dedup.bam >> work/bcbiotx/tmpJS4s1r/dup_metrics.txt

Step 20. Samtools stats on sorted BAM

samtools stats -@ 16 work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > /work/bcbiotx/tmpUPSiOz/SW872_CAMTA1_rep1.txt

Step 21. Samtools idxstats on sorted BAM

samtools idxstats work/align/SW872_CAMTA1_rep1/SW872_CAMTA1_rep1-sort.bam > work/bcbiotx/tmpSKFNZQ/SW872_CAMTA1_rep1-idxstats.txt