Single splice-altering variants can alter mRNA structure and cause disease
The splicing of introns and joining of exons to form mRNA is dependent on complex cellular machinery and conserved sequences within introns to be performed correctly. Single-nucleotide variants in splicing consensus regions, or “scSNVs” (defined as −3 to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site) have the potential to alter the normal pattern of mRNA splicing in deleterious ways. Even those variants that are exonic and synonymous (i.e., they do not alter the amino acid incorporated into a polypeptide) can potentially affect splicing. Altered splicing can have important downstream effects in human disease such as cancer.
Using machine-learning to predict splice-altering variants
In the paper “In silico prediction of splice-altering single nucleotide variants in the human genome,” the researchers took on the problem of predicting which single-nucleotide variants (SNVs) have the potential to be splice-altering by computing “ensemble scores” for potential variants, combining the results of several popular splicing prediction software tools into one probability score.
They did this by using “random forest” (rf) and “adaptive-boosting” (adaboost) classifiers from machine-learning methods to give improved ensemble predictions that are demonstrated to do better than predictions from an individual tool, leading to improvements in the sensitivity and specificity of the predictions.
As part of their supplementary material, the authors pre-computed rf and adaboost scores for every SNV in a library of nearly ~16 million such sites collated from human RefSeq and Ensembl databases. The scores are probabilities of a particular SNV being splice-altering (0 to 1).
Exploratory analysis of the database
I performed an exploratory data analysis of chromosome 1 (chr1) SNVs from the database that was made available with the paper.
First, I just looked at where the SNVs on chrom 1 were located as classified by Ensembl region:
As can be seen from Fig 1, most of the SNVs are located in introns, exons, and splicing consensus sites according to their Ensembl records.
Next, I created histograms for the chrom 1 SNVs by their Ensembl classification, looking at rf scores only (keep in mind that the scale on the y-axis for the plots in Fig 2 and 3 differs dramatically between regions). The x-axis is the probability of being splice-altering according to the pre-computed rf score.
I noticed the fact that within ‘exonic’ regions on chrom 1, the rf scores take on a range of values from 0.0 to 1.0 in a broad distribution, while in other regions like ‘UTR3’, ‘UTR5’, ‘downstream’, etc… the distributions are narrowly skewed towards zero. For the ‘intronic’ region, the majority of sites have low probability of being splice-altering, while at the ‘splicing’ consensus sites, the vast majority are predicted to be splice-altering variants. This appears to make intuitive sense.
I performed the same analysis for the adaboost scores, as shown in Fig 3 (below). You can see that the adaboost scores take on a more binary distribution than the rf scores, with any individual SNV likely to be classified as ~1 or 0 according to the adaptive boosting method. Just like the rf scores, SNVs in ‘exonic’ regions are equally likely to be splice-altering as not, while those in ‘splicing’ regions are highly likely to be splice-altering. An SNV in an ‘intronic’ regions is ~3X more likely to have no effect on splicing.
Finally, I looked at the relationship between the two scoring methods for the SNVs that fall within the Ensembl-characterized ‘splicing’ regions on chrom 1. That scatter plot is shown below in Fig 4.
I suppose I was expecting a tight linear correlation between the two approaches, however the data show that the rf and adaboost methods differ substantially in their assessment of the collection of SNVs in these regions.
It is obvious from the plot below that there are many SNVs that the rf method considers to have low probability of being splice-altering that are found to have very high (>0.9) probability by the adaboost method.
This result would appear to suggest that if one is going to classify variants as “splice-altering” from this database, it would be best to consider both predictions or some combination of them rather than relying on either score alone if the goal is not to miss any potentially important sites. Conversely, if the goal is to only consider sites with very high likelihood of being splice-altering, a threshold could be set such that both scores need to be above 0.8, for example.