Artificial neural network classification of copy number variation, part 2

Recap:

Welcome to the second part of this post series on building artificial neural network models for copy number classification.  In the first part, I described the problem with interpreting copy-ratio plots to find clinically-relevant CNV events.  The data from targeted capture deep sequencing are noisy and biased, and finding clinically-relevant genotypes in genes that have CNVs requires the analyst to visualize the CNV event and assign a classification on the basis of experience and expert knowledge.

The LASSO model

Once my training data were in place (see part 1), I used a multiple linear regression LASSO model as a machine-learning benchmark.  I did this to determine whether a more powerful neural network model would be warranted.   The LASSO model uses an “L1” prior to perform feature selection, setting some coefficients to zero as warranted by the data.   There is ample precedent for applying this type of model in bioinformatics settings where the goal is maximize predictive power without overfitting.

I fit the LASSO to the data, with 33% held out for validation.  The best fit was obtained with the alpha parameter set to 0.001.  k-fold cross validation (where k=10 and alpha=0.001) yielded an accuracy of 76%.   These results are surprisingly good, given the complexity of the CNV signals in the noisy data.  Unfortunately, 76% accuracy is simply not good enough for an automated method that will be used to predict genotypes in clinical data.

The ANN model

Next, I decided to construct an artificial neural network model.  My goal was to keep the model as simple as possible, while reaching a very high classification accuracy needed for clinical work.   To that end I constructed a one hidden-layer model with 19 input nodes corresponding to the 19 copy-ratio probes in the CNV data.  The output layer contained five nodes, corresponding to the five classes of defined CNV event or other event (for example, a very distinct sequencing artifact that kept appearing in the data):

 

In between the input and output layers I constructed a 10-node hidden layer.   A one hidden-layer neural network is the simplest form of the ANN model, and I tried to keep the number of hidden-layer nodes to a minimum as well.  Specific details about the model, hyper-parameter tuning, and the code will be available in the near future when I put a pre-print of this work on biorxiv.

Model training and cross-validation

I trained the model on the 175 sample dataset and on a 350 sample “synthetic” dataset created by adding gaussian “noise” to the real data.  The results are shown below, across 250 training epochs.

When the ANN model was tested with 10-fold cross validation, the accuracy reached a level of 96.5% (+/- 5.4%).   This is obviously a big improvement on the LASSO model, and reaches a level of accuracy that is good enough for clinical pipelines (with the caveat that low confidence predictions will still be checked “by hand.”)

Below, I’m showing a sample of the model output (left) and ground truth (right) from the test data.  The numbers (and colors) of the boxes correspond to the model’s probability in that classification.  You can see that most CNV events are called with high probability, but several (yellow boxes) are called correctly but with lower probability.   One event (red box) is called incorrectly with high probability.

Conclusions and caveats

Going into this project, I had no idea if the ANN model would be able to make predictions on the basis of so few examples in the training set.  The classic examples you see about ANN/CNN models rely on handwriting training sets with 10,000 or more images.   So I was surprised when the model did very well with extremely limited training data.   Since this method was developed for a clinical pipeline, it can be improved as the pipeline generates new training data with each new patient sample.  We would need many thousands of samples through our “legacy” pipeline to see enough examples of the rare star allele events in CYP2D6 that we could then classify them.  That is why I limited my CNV calling to three star alleles.

The low confidence, true positive predictions concern me less than the high confidence false negative.  Missing a real CNV that has impact on CYP2D6 function and therefore clinical relevance is very dangerous.  This can lead to incorrect prescribing and adverse drug reactions for the patient.   I really want to understand why the method makes predictions like this, and how to fix it.  Unfortunately, I believe it will require a lot more training data to solve this problem and that is something I lack.

My goals for this project now are 1) to publish a preprint on biorxiv describing this work and 2) to obtain some additional training/test datasets.   Because our pharmacogenomics test is not generating the kind of volume we expected, I may have to look around for another gene with clinically-relevant CNV events to test this method further.   For example, we do have an NGS-based test of hearing and deafness genes with thousands of validated patient samples.  One gene, STRC, has relevant CNVs that are complex and require analyst visualization to detect.  This may be a good system for follow up refinement of this type of model.

Matataki, a new ultrafast read mapper: hope or hype?

tl;dr:  It’s only 2X faster than kallisto; a marginal, rather than “ultra” improvement.  If you’re already using kallisto in your pipelines there’s no reason to switch.   Now onto the blog post: 

Another RNA-seq read aligner

In July, a paper appeared in BMC Bioinformatics: “Matataki: an ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data.”   The title seems to imply that Matataki is much faster than anything else out there.  After all, what does ‘ultrafast’ mean unless it’s in comparison to the next fastest method?  The authors claim it also solves a unique problem of large-scale RNA-seq reanalysis that can’t be solved any other way.  Is this true?  Let’s dive into the details and find out.

The authors point out in the introduction that the amount of publicly available RNA-seq data is increasing rapidly and that many would like to do large-scale reanalysis of these datasets.   Reanalysis (including realignment) is necessary because of the large batch effects of using different methods for alignment and quantification.

Although these alignment-free methods such as Sailfish, Kallisto, and RNA-Skim are much faster than the alignment-based methods, the recent accumulation of large-scale sequence data requires development of an even faster method for data management and reanalysis.

Counting unique k-mers

The method itself eschews traditional alignment for a k-mer based read counting approach.  This approach takes two steps:  build an index and quantify the reads against it.  Building the index requires the algorithm to search for all k-mers unique to a gene and also present in all isoforms of that gene (to avoid effects of differential expression of isoforms).  Quantification proceeds by matching the k-mers from a read with the unique k-mers in the index.  If there is only one indexed gene match from several k-mers (with a minimum cutoff set by the user) that read is assigned to the indexed gene.  If there is more than one match to several genes, the read is discarded.

If this sounds a lot like how kallisto operates under the hood, I would agree with you as would the authors:

Similar to Kallisto, our method uses k-mers that appear only once in a gene and quantifies expression from the number of unique k-mers.

Of course, kallisto creates a transcriptome de bruijn graph (tDBG) and uses compatibility classes to decide whether to skip k-mers while hashing a read.  Matataki does not use a tDBG and simply skips along a read using a fixed interval set by the user.   The code also skips the Expectation-Maximization (EM) step found in kallisto and focuses only on gene-level quantification.   These two novel approaches are what set Matataki apart from other pseudo-alignment algorithms, according to the authors.

But does it work?

Well, it appears to.  The authors start by reporting the performance of the method on simulated data using optimized parameters (I’ll discuss this below).  In Figure 2,  they describe the parameter tuning that led to those choices.  For parameter tuning the performance of Matataki on real data, the TPM values from an eXpress run on sample ERR188125 were taken as “ground truth.”  The choice of ‘eXpress’ as their benchmark for real-world dataset quantification is odd, given that eXpress is defunct and has been for some time.  It also takes orders of magnitude longer to complete than a method like sailfish or kallisto.

Fig 2. Comparison of TPM when k was varied. The x-axis shows the TPM values of eXpress, the y-axis shows the TPM values of our method, and the color indicates the indexed k-mer coverage of each gene when changing k from 16 to 56 with a step of 8

Going back to Figure 1 (below) and the performance on simulated data, we can observe accuracy similar to that of kallisto, RSEM, sailfish and other methods  (note that by using a scale from 0.93 to 0.99, the differences between the methods seem exaggerated).  The one simulated sample was created by RSEM from the real sample plus three other related samples.  So perhaps it’s not surprising that RSEM does the best in this comparison overall.

I think it’s important to remember as well that Matataki is being run on this simulated data with optimized parameters for this sample while the other methods are run with “default” parameters.   We don’t know how Figure 1a and 1b would change for other samples, other organisms, and other sequencing types (PE vs. SE, long vs. short, etc…)  It would be enlightening to find out.

Figure 1. Summary of the results using simulation data. a Spearman correlation coefficient with the expected expression and estimated expression values using each method. “Matataki” indicates the results of the proposed method, and “MatatakiSubset” indicates the results of the proposed method without uncovered genes. To compare the gene-level expression profile and transcript-level expression profile, the sum of TPM by each gene was calculated. b Means of absolute difference from the expected expression levels.

So how fast is ‘ultrafast’ anyway?

We finally get to the promised speed improvements near the end of the manuscript, only to find that the speedup relative to kallisto, a method that has been available publicly since at least 2015, is only 2-fold.  Yep, that’s it.  It’s only twice as fast.   This is an incremental improvement, but not a paradigm shift.  In fact, it shows how the leap from alignment to pseudo-alignment was in fact a paradigm shift when kallisto first came out.  It’s been three years and nothing has substantially improved on it.

Fig 3. CPU time comparison.

Limitations?

Finally, the authors end the paper with a section on ‘expected use-cases and limitations’ where they say that the method is only suitable for large-scale reanalysis and not for normal, turn-the-crank RNA-seq.

Nevertheless, Matataki is not suitable for common RNA-Seq purposes because other methods are sufficiently fast and provide better accuracy. For example, a single nucleotide substitution has larger effects in Matataki than in other methods, because even a single point substitution changes the k-mer for 2 k − 1 bases, which ultimately affects the number of k-mers in a transcript and calculation of the TPM value.

For reasons that I don’t understand, the authors promote this tool (which is only 2X faster than kallisto) as being for large-scale reanalysis, yet they performed no large-scale reanalysis to benchmark the tool.   Rather, they used it to perform normal RNA-seq as a benchmark, which they go on to say not to do in the wild.

To sum up, while a 2X speedup is useful, I see no reason to abandon kallisto to pseudoalign and quantify data when doing large scale reanalysis.   Unlike Matataki, kallisto is robust to read errors and it delivers transcript-level estimates that can be later summed to gene level, if you so desire.  It also provides bootstrapped estimates of technical variation, which helps to understand the biological variation as well.   And finally, kallisto integrates nicely with sleuth for DE testing and visualization.

Developers versus consumers of bioinformatics analysis tools

Life in the middle

As a bioinformatics applications scientist, I work in a middle ground between those who develop code for analyzing next-gen sequencing data and those who consume that analysis.   The developers are often people trained in computer science, mathematics, and statistics.   The consumers are often people trained in biology and medicine.    There is some overlap, of course, but if you’ll allow a broad generalization, I think the two groups (developers and consumers) are separated by large cultural differences in science.

Ensemble focus vs. single-gene focus

I think one of the biggest differences from my experience is the approach to conceptualizing the results of a next-gen seq experiment (e.g., RNA-seq).    People on the methods side tend to think in terms of ensembles and distributions.   They are interested in how the variation observed across all 30,000 genes can be modeled and used to estimate differential expression.   They are interested in concepts like shrinkage estimators, bayesian priors, and hypothesis weighting.  The table of differentially expressed features is thought to have meaning mainly in the statistical analysis of pathway enrichment (another ensemble).

Conversely, biologists have a drastically different view.   Often they care about a single gene or a handful of genes and how those genes vary across conditions of interest in the system that they are studying.  This is a rational response to the complexity of biological systems; no one can keep the workings of hundreds of genes in mind.  In order to make progress, you must focus.  However, this narrow focus leads investigators to sometimes cherry pick results pertaining to their ‘pet’ genes.  This can invest those results with more meaning than is warranted from a single experiment.

The gene-focus of biologists also leads to clashes with the ensemble-focus of bioinformatics software.  For example, major DE analysis packages like DESeq2 do not have convenience functions to make volcano plots, even though I’ve found that those kinds of plots are the most useful and easiest to understand for biologists.   Sleuth does have a volcano plotting function, but doesn’t allow for labeling genes.  In my experience, however, biologists want a high-res figure with common name gene labels (not ensemble transcript IDs) that they can circle and consider for wet lab validation.

New software to address the divide?

I am hopeful about the recent release of the new “bcbioRNASeq” package from Harvard Chan school bioinformatics (developers of ‘bcbio’ RNA-seq pipeline software).   It appears to be a step towards making it easier for people like me to walk on both sides of this cultural divide.    The package takes all of the outputs of the ‘bcbio’ pipeline and transforms them into an accessible S4 object that can be operated on quickly and simply within R.

Most importantly, the ‘bcbioRNASeq’ module allows for improved graphics and plotting that make communication with biologists easier.  For example, the new MA and volcano plots appear to be based on ‘ggplot2’ graphics and are quite pretty (and have text label options(!), not shown here):

I still need to become familiar with the ‘bcbioRNASeq’ package, but it looks quite promising.   ‘pcaExplorer‘ is another R/Bioconductor package that I feel does a great job of making accessible RNA-seq reports quickly and easy.  I suspect we will see continued improvements to make plots prettier and more informative, with an increasing emphasis on interactive, online plots and notebooks rather than static images.

Artificial neural network classification of copy number variation, part 1

In this series of posts, I want to describe some work I’ve been doing attempting to apply an artificial neural network model to the problem of classifying copy number variation in a clinically-important gene, CYP2D6.   I will be presenting a poster-paper on this topic at the 2018 ISMB/ISCB meeting in Chicago.

Here is the abstract from the poster:

Pharmacogenomics is a rapidly developing field that aims to deliver on the promise of personalized medicine by guiding pharmacological intervention using an understanding of a patient’s individual genotype for drug metabolism.  By avoiding ineffective or dangerous treatments, patient outcomes could be dramatically improved and hospital costs reduced. We have developed a sequencing-based pharmacogenomics screening panel that uses targeted capture to perform deep sequencing of 200+ critical drug metabolism genes.  Assessing copy-number variation is a critical part of correctly interpreting genotypes in key drug metabolism genes, such as CYP2D6.  Historically, this has been a time-consuming step in our clinical pipeline involving large amounts of expert analyst time and data visualization.  This study presents a novel application of an artificial neural network (ANN) machine learning algorithm to learn the complex patterns in CNV data.  The result is a trained network that can quickly and accurately classify copy number events according to known training categories in the CYP2D6 gene.  We show that a simple, one hidden-layer network is sufficient to achieve the extremely high accuracy and low false-positive rate required in a high-throughput clinical setting.

Motivating factors

Motivating this work is the fact that interpreting copy number variation data from targeted capture sequencing is difficult owing to several factors.  First, the data are noisy owing to biases in capture efficiency and GC content.  Second, the copy number variation events in a gene like CYP2D6 are complex and subtle, but have dramatic impacts on the functional status of the gene.  Third, most copy number variation detection methods are optimized for whole genome sequencing with smooth and even coverage across each chromosome.

The result is that, although most of the variant calling and interpretation is automated, a person still has to sit down with the copy number variation copy-ratio plots and make a manual determination of genotype for the CYP2D6 gene (and other genes if they contain clinically-actionable CNVs).   Below is one such copy-ratio plot that illustrates the problems we face:

copy ratio plot
A copy-ratio plot for 9 clinical samples from our targeted sequencing pipeline at the CYP2D6 gene and CYP2D7 pseudogene.  Note the noise in the data.

You can see from the plot that one sample (red) has what appears to be a deletion (copy ratio ~ 0.5).   We do look at these plots individually, but the problem with noise and complexity remains.  This means that the analyst must be, in effect, a domain expert on each gene with a clinically-actionable copy number variation.   This limits pipeline throughput and is impractical if the goal is to process hundreds or thousands of patient samples per month.

A role for machine learning?

It occurred to me that the copy ratio data fall into distinct patterns that a human (like myself) learns by eye and with experience.  These patterns could be classifiable by machine-learning methods.   I considered applying a convolution neural network (CNN) to the plot above.

However, I only had 175 samples in my test/train set drawn from real patients and CEPH/Coriell depositories.   Therefore, I thought it would be best to start with simple models and go from there.  To establish the ground truth CYP2D6 genotype, the samples had all be processed by our current pipeline methods, with many (but not all) confirmed by Taqman assay.

It began to dawn on me that I first needed a better representation of the training data.   I liked the output of the CNV-kit method (below) better than the method from our pipeline (above):

Visualization of CNV-kit output at CYP2D6. The gray and orange lines are not important to this discussion.

The important part of the plot above are the gray dots, representing copy ratio values across 19 different bins or segments along the CYP2D6/2D7 gene and pseudogene.   CNV-kit also had the advantage of being well-documented, fast to run, and available as open-source software on github.

Tidying the training set

The data plotted above looked like this after I did some wrangling and tidying in python:

Copy ratio data from CNV-kit output at the CYP2D6 and 2D7 gene/pseudogene.  There are five columns to the right (four not shown) that contain “one-hot” encoded ground-truth for model training.

With the data in “tidy” form (each column a variable, each row an observation, thank you Hadley Wickham), I was ready to train some machine learning models to see if I could classify common copy number variation events in CYP2D6 and bypass the time-consuming visualization step.

I had no idea if this would work, given that there are only 175 samples (total) to train on and some of the more rare copy number variation events only have a handful of examples.  In part 2, I will talk more about how I tried a LASSO regression model, which, while performing well, failed to yield the high accuracy needed for an automated clinical pipeline.  I then tried a simple one-layer neural network approach.  I’ll talk about the surprisingly good performance of neural network approach and also future directions for this project.