P values in practice

Found a great article from Andrew Gelman at Columbia on how to think about p values from a Bayesian perspective.  Although I don’t really understand Bayesian statistics at this point, the article still had some very nice explanations about how to think about traditional p values as they are used in practice.

Some key points:

The P value is a measure of discrepancy of the fit of a model or “null hypothesis” H to data y. Mathematically, it is defined as Pr(T(yrep)>T(y)|H), where yrep represents a hypothet- ical replication under the null hypothesis and T is a test statis- tic (ie, a summary of the data, perhaps tailored to be sensitive to departures of interest from the model).”

“[…] the P value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with P values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.”

“[…] we cannot interpret a nonsignificant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, nonsignificance revealed the data to be compatible with the null hypothesis;”

“we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.”

“The focus on P values seems to have both weakened [the] study (by encouraging the researcher to present only some of his data so as to draw attention away from nonsignificant results) and to have led reviewers to inappropriately view a low P value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis […] rather than other, perhaps more scientifically plausible, alternatives such as measurement error and selection bias.”

Exploratory analysis of human splice-altering variants

Single splice-altering variants can alter mRNA structure and cause disease

The splicing of introns and joining of exons to form mRNA is dependent on complex cellular machinery and conserved sequences within introns to be performed correctly.  Single-nucleotide variants in splicing consensus regions, or “scSNVs” (defined as −3 to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site)  have the potential to alter the normal pattern of mRNA splicing in deleterious ways.  Even those variants that are exonic and synonymous (i.e., they do not alter the amino acid incorporated into a polypeptide) can potentially affect splicing.  Altered splicing can have important downstream effects in human disease such as cancer.

Using machine-learning to predict splice-altering variants

In the paper “In silico prediction of splice-altering single nucleotide variants in the human genome,”  the researchers took on the problem of predicting which single-nucleotide variants (SNVs) have the potential to be splice-altering by computing “ensemble scores” for potential variants, combining the results of several popular splicing prediction software tools into one probability score.

They did this by using “random forest” (rf) and “adaptive-boosting” (adaboost) classifiers from machine-learning methods to give improved ensemble predictions that are demonstrated to do better than predictions from an individual tool, leading to improvements in the sensitivity and specificity of the predictions.

As part of their supplementary material, the authors pre-computed rf and adaboost scores for every SNV in a library of nearly ~16 million such sites collated from human RefSeq and Ensembl databases.   The scores are probabilities of a particular SNV being splice-altering (0 to 1).

Exploratory analysis of the database

I performed an exploratory data analysis of chromosome 1 (chr1) SNVs from the  database that was made available with the paper.

First, I just looked at where the SNVs on chrom 1 were located as classified by Ensembl region:

Fig 1. Number of variants in the scSNV database on chromosome 1  by Ensembl region. Not surprisingly, ‘intronic’, ‘exonic’, and ‘splicing’ are the most common regions for potential splice-altering SNPs.

As can be seen from Fig 1, most of the SNVs are located in introns, exons, and splicing consensus sites according to their Ensembl records.

Next, I created histograms for the chrom 1 SNVs by their Ensembl classification, looking at rf scores only (keep in mind that the scale on the  y-axis for the plots in Fig 2 and 3 differs dramatically between regions).  The x-axis is the probability of being splice-altering according to the pre-computed rf score.

Fig 2. Random-forest (rf) score by ensembl region for the ~15 M scSNVs in the database.

I noticed the fact that within ‘exonic’ regions on chrom 1, the rf scores take on a range of values from 0.0 to 1.0 in a broad distribution, while in other regions like ‘UTR3’, ‘UTR5’, ‘downstream’, etc… the distributions are narrowly skewed towards zero.  For the ‘intronic’ region, the majority of sites have low probability of being splice-altering, while at the ‘splicing’ consensus sites, the vast majority are predicted to be splice-altering variants.  This appears to make intuitive sense.

I performed the same analysis for the adaboost scores, as shown in Fig 3 (below).  You can see that the adaboost scores take on a more binary distribution than the rf scores, with any individual SNV likely to be classified as ~1 or 0 according to the adaptive boosting method.  Just like the rf scores, SNVs in ‘exonic’ regions are equally likely to be splice-altering as not, while those in ‘splicing’ regions are highly likely to be splice-altering.  An SNV in an ‘intronic’ regions is ~3X more likely to have no effect on splicing.

Fig 3. Ada score by Ensembl region for the scSNV database.


Finally, I looked at the relationship between the two scoring methods for the SNVs that fall within the Ensembl-characterized ‘splicing’ regions on chrom 1.  That scatter plot is shown below in Fig 4.

I suppose I was expecting a tight linear correlation between the two approaches, however the data show that the rf and adaboost methods differ substantially in their assessment of the collection of SNVs in these regions.

It is obvious from the plot below that there are many SNVs that the rf method considers to have low probability of being splice-altering that are found to have very high (>0.9) probability by the adaboost method.

Fig 4. Scatter plot of rf score vs. ada score for SNVs within 'splicing' regions on chrom 1.
Fig 4. Scatter plot of rf score vs. ada score for SNVs within ‘splicing’ regions on chrom 1.

This result would appear to suggest that if one is going to classify variants as “splice-altering” from this database, it would be best to consider both predictions or some combination of them rather than relying on either score alone if the goal is not to miss any potentially important sites.  Conversely, if the goal is to only consider sites with very high likelihood of being splice-altering, a threshold could be set such that both scores need to be above 0.8, for example.

Concatenate several lanes of Illumina HiSeq reads quickly

If you have raw Illumina HiSeq reads or MiSeq run across several lanes of data, you may need to concatenate them together at the command line before trimming and merging them together. Raw data from Illumina sequencers generally follows a standard naming convention. The files might look like this:


Where ’16’ is the sample ID, followed by the barcode sequence, ‘L00X’ is the lane number, ‘R1’ means forward reads.

An easy way to script this quickly is as follows:

cat `find . -maxdepth 1 | egrep '16.*R1'` > 16_TAGGCGA_R1.fastq &

There’s a lot going on here, so let’s unpack this briefly.

The backquotes (` ) mean that the output of the command is presented as command-line parameters to the enclosing command.

So in this case the ‘find’ command is listing all file names in the current directory at a depth of 1 (no recursion into lower directories).

Next, this is pipe’d (with |) to egrep which searches the list of filenames for a regular expression that indicates a string starting with 16, followed by any number of any characters, then an ‘R1’. Since the expression matches whole strings, this will find the above files.

These filenames are then passed to ‘cat’ as command line arguments. The concatenated files are then redirected with the greater than (‘>’) to the new fastq file. The ‘&’ indicates that the shell run this in the background.

I hope this is useful. I spent about an hour tracking this all down, so I wouldn’t have to process dozens of samples by hand.

Create a volcano plot on EBSeq output

The differential expression analysis program EBSeq produces a number of data objects as part of the workflow, but there aren’t many options for visualization of the data.

The authors suggest the use of heatmap.2 in R:

heatmap.2(NormalizedMatrix[GenesOfInterest,], scale=”row”, trace=”none”, Colv=F)

However, this depends on knowing ahead of time your genes of interest.  It is not practical to generate a heatmap with hundreds or thousands of DE genes.

I wanted to produce something approximating a volcano plot for EBSeq results.  What I came up with initially was the following:

A pseudo-volcano plot for EBSeq results.  The y-axis is posterior probability of differential expression (PPDE).
A pseudo-volcano plot for EBSeq results. The y-axis is posterior probability of differential expression (PPDE).

To make this plot, I had to grab some data arrays from the large object “EBOut” that is produced by calling the “EBTest” method and from the “GeneFC” object as below:

plot(GeneFC$PostFC, EBOut$PPDE, 
xlim =c(0,5), ylim=c(0,1), 
main="Control/Experimental FC vs. PPDE",
sub=GeneFC$Direction, xlab="EBSeq Posterior Fold Change", ylab="EBSeq posterior prob of DE")


The “abline” command places a horizontal line where the PPDE is equal to or greater than 95%.  This would be equivalent to an FDR of 0.05.

If you want to inspect the plot interactively in R to identify gene names above the threshold and/or with large posterior fold changes you would use:

identify(GeneFC$PostFC, EBOut$PPDE, labels=names(GeneFC$PostFC))

To make it look more like a canonical volcano plot, I then tried:

plot(log2(GeneFC$RealFC), EBOut$PPDE, 
xlim =c(-5,5), 
main="Log2FoldChange vs. PPDE", 
xlab="EBSeq Log2 Fold Change", ylab="EBSeq PPDE")

Creating the following plot:

Fig 2.  Log(2) Fold Change on the X-axis.
Fig 2. Log(2) Fold Change on the X-axis.

This is good, except I want to subset the data and add colors. To do this I need to create a new dataframe from the EBOut$PPDE and GeneFC$PostFC objects:

volc_df = data.frame(EBOut$PPDE, GeneFC$PostFC)

With everything in one dataframe, plotting and subsetting the data is easier. Inspired by this post at Stephen Turner’s “Getting Genetics Done” blog, I prepared my final colored volcano plot as follows:

with(volc_df, plot(log2(PostFC), PPDE, pch=20, main="Volcano Plot EBSeq", xlim=c(-5,5)))


with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="orange")) 

with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="red"))

The final plot looks like this:

Fig 3.  The final plot, with colored points for the DE genes, and higher fold change indicated with red.
Fig 3. The final plot, with colored points for the DE genes, and higher fold change indicated with red.

New position at the University of Iowa

I am very excited to have begun a new position this July as a Bioinformatics Specialist within the Iowa Institute for Human Genetics (IIHG).

I am fortunate to be working  with a very talented team of expert bioinformaticians, researchers, and programmers.   I have a lot to learn about my new field, but I’m enthusiastic about the challenge and looking forward to participating in the fast-moving and cutting-edge research at the nexus of computer science, data analysis, genetics, and biology.