Conference report: ISMB 2018 Chicago

In early July, I attended the ISMB 2018 meeting, a computational biology-focused meeting held by the International Society for Computational Biology (ISCB).  The meeting was held in the beautiful Hyatt Regency hotel in downtown Chicago, just across the street from the river and blocks from Navy Pier and Lakeshore Drive.

ISMB 2018 was a huge meeting, with at least 1500 attendees and up to ten parallel meeting tracks (called “COSIs” in ISCB parlance) at any one time.  The meeting was so big I always felt as if I was missing something good, no matter which talk I went to (except for the keynotes, where nothing else was happening).

Here is Steven Salzberg’s excellent keynote on a historical overview of finding genes in the human genome:

Obviously, at a meeting this large, one cannot explore more than a tiny fraction of the talks and posters (but you can watch many of the ISMB2018 talks on youtube now if you’re interested).

I want to briefly summarize three talks that I particularly enjoyed and found very interesting:

1) Michael Seiler, H3 biomedicine.  “Selective small molecule modulation of splicing in cancer”

Up to 60% of hematologic neoplasms (CLL/AML/MDS) contain heterozygous hotspot mutations in key splicing factor genes involved with 3′ splice site recognition.   One such gene is SF3B1, an RNA splicing factor, which is recurrently mutated around the HEAT repeats.  The cryo-EM structure of SF3B1 revealed the mutations cluster in the premRNA-interacting region.   All of the observed mutations lead to alternate 3′ splice sites being chosen during splicing; so-called “cryptic 3′ splice sites (AG)”. In particular, the K700E mutant slips the spliceosome upstream to an internal AG almost 18 nucleotides from the proper splice site.

RNA-seq of 200 CLL patients helped to discover further in-frame delections in SF3B1, SRSF2, and URAF1.   All of these mutations are HET, and the cells rely on the WT copy for survival.

Seiler asked the question whether one could exploit this weakness to attack cancer cells?  The idea is to modulate activity of SF3B1 using small molecules to disturb the WT function.  At H3 they took a natural product compound library and optimized for chemistry and binding to find candidates.  One molecule, H3B-8800, showed promise.  The compound was optimized toward selective induction of apoptosis in SF3B1 mutant cells in vitro. When resistance mutations occur, they are all at points of contact with the drug.  At only 13 nM dose, around 1% of splicing events were affected which was enough to cause lethality to the cancer cell.   RNA-seq demonstrated that it was mainly the spliceosome genes themselves that were affected by the alternate splicing in the presence of the inhibitor.  This demonstrated a delicate feedback loop between correct splicing and spliceosome gene expression.

2) Curtis Huttenhower,  Harvard University.  “Methods for multi-omics in microbial community population studies”

Huttenhower’s group at Harvard is well known for their contributions to methods for microbiome analysis, collectively known as the “biobakery.”  In this talk, Huttenhower addressed the fact that the microbiome is increasingly of interest when looking at population heath, and that chronic immune disease incidence is rising around the world over the last few decades.

Huttenhower also spent a good amount of time describing the IBD-MDB, the inflammatory bowel disease multi ‘omics database.  The database contains ~200 samples that are complete with six orthogonal datatypes, including RNA-seq, metagenome, and metabolome.   He talked about how they can associate bacteria enriched in IBD with covariation in the prevalence of metabolites.   For example, he showed how sphingolipids, carboximidic acids, cholesteryl esters and more are enriched while lactones, beta-diketones, and others are depleted in the guts of crohns disease suffers.

3) Olga Troyanskaya, Princeton University.  “ML approaches for the data-driven study of human disease”

Troyanskaya’s group at Princeton aims to develop accurate models of the complex processes underlying cellular function.  In this talk, she described efforts in her lab to use machine learning methods to understand how single nucleotide changes (SNPs) in non-coding regions can affect gene regulation and expression.  She is also interested in how pathways and networks change in different tissues and whether tissue-specific maps can be used to identify disease genes.

In particular, Dr. Troyanskaya described the “DeepSEA” software, which uses a convolution neural network (CNN) to attempt to predict the chromatin remodeling consequences of a SNP in a genomic context.   The model is a three layer CNN that takes 1000 base pairs of sequence as input.  The 1000 bp region is centered around known TF binding locations (200 bp bins).  The training data consists of a length 919 vector that contains binary values for the presence of a TF binding event across the genome (1 = binds, 0 = does not bind).  The output of the model is the probability that the specific sequence variant will affect TF binding at each of the 919 bins.

Schematic of the DeepSEA model, showing input data, training data, and output.

The DeepSEA model can be used to predict important sequence variants and even eQTLs for further study or to prioritize a list of known non-coding sequence variants.

 

 

 

Exploratory analysis of human splice-altering variants

Single splice-altering variants can alter mRNA structure and cause disease

The splicing of introns and joining of exons to form mRNA is dependent on complex cellular machinery and conserved sequences within introns to be performed correctly.  Single-nucleotide variants in splicing consensus regions, or “scSNVs” (defined as −3 to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site)  have the potential to alter the normal pattern of mRNA splicing in deleterious ways.  Even those variants that are exonic and synonymous (i.e., they do not alter the amino acid incorporated into a polypeptide) can potentially affect splicing.  Altered splicing can have important downstream effects in human disease such as cancer.

Using machine-learning to predict splice-altering variants

In the paper “In silico prediction of splice-altering single nucleotide variants in the human genome,”  the researchers took on the problem of predicting which single-nucleotide variants (SNVs) have the potential to be splice-altering by computing “ensemble scores” for potential variants, combining the results of several popular splicing prediction software tools into one probability score.

They did this by using “random forest” (rf) and “adaptive-boosting” (adaboost) classifiers from machine-learning methods to give improved ensemble predictions that are demonstrated to do better than predictions from an individual tool, leading to improvements in the sensitivity and specificity of the predictions.

As part of their supplementary material, the authors pre-computed rf and adaboost scores for every SNV in a library of nearly ~16 million such sites collated from human RefSeq and Ensembl databases.   The scores are probabilities of a particular SNV being splice-altering (0 to 1).

Exploratory analysis of the database

I performed an exploratory data analysis of chromosome 1 (chr1) SNVs from the  database that was made available with the paper.

First, I just looked at where the SNVs on chrom 1 were located as classified by Ensembl region:

chr1_num_variants_by_ensemblregion
Fig 1. Number of variants in the scSNV database on chromosome 1  by Ensembl region. Not surprisingly, ‘intronic’, ‘exonic’, and ‘splicing’ are the most common regions for potential splice-altering SNPs.

As can be seen from Fig 1, most of the SNVs are located in introns, exons, and splicing consensus sites according to their Ensembl records.

Next, I created histograms for the chrom 1 SNVs by their Ensembl classification, looking at rf scores only (keep in mind that the scale on the  y-axis for the plots in Fig 2 and 3 differs dramatically between regions).  The x-axis is the probability of being splice-altering according to the pre-computed rf score.

chr1_rfscore_by_ensemblregion
Fig 2. Random-forest (rf) score by ensembl region for the ~15 M scSNVs in the database.

I noticed the fact that within ‘exonic’ regions on chrom 1, the rf scores take on a range of values from 0.0 to 1.0 in a broad distribution, while in other regions like ‘UTR3’, ‘UTR5’, ‘downstream’, etc… the distributions are narrowly skewed towards zero.  For the ‘intronic’ region, the majority of sites have low probability of being splice-altering, while at the ‘splicing’ consensus sites, the vast majority are predicted to be splice-altering variants.  This appears to make intuitive sense.

I performed the same analysis for the adaboost scores, as shown in Fig 3 (below).  You can see that the adaboost scores take on a more binary distribution than the rf scores, with any individual SNV likely to be classified as ~1 or 0 according to the adaptive boosting method.  Just like the rf scores, SNVs in ‘exonic’ regions are equally likely to be splice-altering as not, while those in ‘splicing’ regions are highly likely to be splice-altering.  An SNV in an ‘intronic’ regions is ~3X more likely to have no effect on splicing.

chr1_adascore_by_ensemblregion
Fig 3. Ada score by Ensembl region for the scSNV database.

 

Finally, I looked at the relationship between the two scoring methods for the SNVs that fall within the Ensembl-characterized ‘splicing’ regions on chrom 1.  That scatter plot is shown below in Fig 4.

I suppose I was expecting a tight linear correlation between the two approaches, however the data show that the rf and adaboost methods differ substantially in their assessment of the collection of SNVs in these regions.

It is obvious from the plot below that there are many SNVs that the rf method considers to have low probability of being splice-altering that are found to have very high (>0.9) probability by the adaboost method.

Fig 4. Scatter plot of rf score vs. ada score for SNVs within 'splicing' regions on chrom 1.
Fig 4. Scatter plot of rf score vs. ada score for SNVs within ‘splicing’ regions on chrom 1.

This result would appear to suggest that if one is going to classify variants as “splice-altering” from this database, it would be best to consider both predictions or some combination of them rather than relying on either score alone if the goal is not to miss any potentially important sites.  Conversely, if the goal is to only consider sites with very high likelihood of being splice-altering, a threshold could be set such that both scores need to be above 0.8, for example.