Artificial neural network classification of copy number variation, part 2

Recap:

Welcome to the second part of this post series on building artificial neural network models for copy number classification.  In the first part, I described the problem with interpreting copy-ratio plots to find clinically-relevant CNV events.  The data from targeted capture deep sequencing are noisy and biased, and finding clinically-relevant genotypes in genes that have CNVs requires the analyst to visualize the CNV event and assign a classification on the basis of experience and expert knowledge.

The LASSO model

Once my training data were in place (see part 1), I used a multiple linear regression LASSO model as a machine-learning benchmark.  I did this to determine whether a more powerful neural network model would be warranted.   The LASSO model uses an “L1” prior to perform feature selection, setting some coefficients to zero as warranted by the data.   There is ample precedent for applying this type of model in bioinformatics settings where the goal is maximize predictive power without overfitting.

I fit the LASSO to the data, with 33% held out for validation.  The best fit was obtained with the alpha parameter set to 0.001.  k-fold cross validation (where k=10 and alpha=0.001) yielded an accuracy of 76%.   These results are surprisingly good, given the complexity of the CNV signals in the noisy data.  Unfortunately, 76% accuracy is simply not good enough for an automated method that will be used to predict genotypes in clinical data.

The ANN model

Next, I decided to construct an artificial neural network model.  My goal was to keep the model as simple as possible, while reaching a very high classification accuracy needed for clinical work.   To that end I constructed a one hidden-layer model with 19 input nodes corresponding to the 19 copy-ratio probes in the CNV data.  The output layer contained five nodes, corresponding to the five classes of defined CNV event or other event (for example, a very distinct sequencing artifact that kept appearing in the data):

 

In between the input and output layers I constructed a 10-node hidden layer.   A one hidden-layer neural network is the simplest form of the ANN model, and I tried to keep the number of hidden-layer nodes to a minimum as well.  Specific details about the model, hyper-parameter tuning, and the code will be available in the near future when I put a pre-print of this work on biorxiv.

Model training and cross-validation

I trained the model on the 175 sample dataset and on a 350 sample “synthetic” dataset created by adding gaussian “noise” to the real data.  The results are shown below, across 250 training epochs.

When the ANN model was tested with 10-fold cross validation, the accuracy reached a level of 96.5% (+/- 5.4%).   This is obviously a big improvement on the LASSO model, and reaches a level of accuracy that is good enough for clinical pipelines (with the caveat that low confidence predictions will still be checked “by hand.”)

Below, I’m showing a sample of the model output (left) and ground truth (right) from the test data.  The numbers (and colors) of the boxes correspond to the model’s probability in that classification.  You can see that most CNV events are called with high probability, but several (yellow boxes) are called correctly but with lower probability.   One event (red box) is called incorrectly with high probability.

Conclusions and caveats

Going into this project, I had no idea if the ANN model would be able to make predictions on the basis of so few examples in the training set.  The classic examples you see about ANN/CNN models rely on handwriting training sets with 10,000 or more images.   So I was surprised when the model did very well with extremely limited training data.   Since this method was developed for a clinical pipeline, it can be improved as the pipeline generates new training data with each new patient sample.  We would need many thousands of samples through our “legacy” pipeline to see enough examples of the rare star allele events in CYP2D6 that we could then classify them.  That is why I limited my CNV calling to three star alleles.

The low confidence, true positive predictions concern me less than the high confidence false negative.  Missing a real CNV that has impact on CYP2D6 function and therefore clinical relevance is very dangerous.  This can lead to incorrect prescribing and adverse drug reactions for the patient.   I really want to understand why the method makes predictions like this, and how to fix it.  Unfortunately, I believe it will require a lot more training data to solve this problem and that is something I lack.

My goals for this project now are 1) to publish a preprint on biorxiv describing this work and 2) to obtain some additional training/test datasets.   Because our pharmacogenomics test is not generating the kind of volume we expected, I may have to look around for another gene with clinically-relevant CNV events to test this method further.   For example, we do have an NGS-based test of hearing and deafness genes with thousands of validated patient samples.  One gene, STRC, has relevant CNVs that are complex and require analyst visualization to detect.  This may be a good system for follow up refinement of this type of model.

Are deep neural nets “Software 2.0”?

Image from: https://cdn.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png

Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets.  Instead of thinking of them as powerful machine learning tools, the authors  instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’   Hence the term, “Software 2.0.”

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks.   — Andrej Karpathy, Medium.com

I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive.  The fact is that combinations of artificial neurons can be used to model any logical operation.  Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify.  You provide the inputs and desired outputs, and the model searches for the optimal program.

This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior.   While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.

 

 

Hands-on with cancer mutational signatures, part 2

In this second part of the “Hands On” series, I want to address how to create the input for the MatLab mutational signature framework from the output of my python code to prepare the SNP data for analysis.

First, creating a Matlab .mat file for input to the program.   The code is expecting an input file that contains a set of mutational catalogues and metadata information about the cancer type and the mutational types and subtypes represented in the data.

Fig 1. The required data types within one .mat file to run the framework.
Fig 1. The required data types within one .mat file to run the framework.

As you can see from Fig 1., you need to provide a 96 by X matrix, where X is the number of samples in your mutational catalogue.  You also need an X by 1 cell array specifying sample names, a 96 by 1 cell array specifying the subtypes (ACA, ACC, ACG, etc…) and a 96 by 1 cell array specifying the types (C>A, C>A, C>A, etc…).  These must correspond to the same order as specified in the “originalGenomes” matrix or the results won’t make sense.

My code outputs .csv files for all of these needed inputs.  For example, when you run my python code on your SNP list, you will get a “subtypes.csv”, “types.csv”, “samples.csv”, and “samples_by_counts.csv” matrix (i.e., originalGenomes) corresponding to the above cell arrays in Fig 1.

Now, the challenge is to get those CSV files into MatLab.  You should have downloaded and installed MatLab on your PC.  Open MatLab and select “Import Data.”

Fig 2. Select the "Import Data" button.
Fig 2. Select the “Import Data” button.

Browse to one of the output CSV files and select it.  It will open in a new window like in Fig 3 below:

Fig 3. The data import window from MatLab.
Fig 3. The data import window from MatLab.

Be sure to select the correct data type in the “imported data” section.  Also, select only row 1 for import (row 2 is empty).  Once you’re finished, click Import Selection.  It will create a 1×96 cell called “types.”  It looks like this:

Fig 4. The new imported cell data "types."
Fig 4. The new imported cell data “types.”

We’re almost done, but we have to switch the cell to be 96×1 rather than 1×96.  To do this, just double-click it and select “transpose” in the variable editor.   Now you should repeat this process for the other CSV input files, being sure to select “matrix” as the data type for the “samples_by_counts” file.   Pay special attention to make sure the dimensions and data types are correct.

Once you have everything in place you should be ready do run the mutational analysis framework from the paper.   To do this, open the “example2.m” Matlab script included with the download.  In the “Define parameters” section, change the file paths to the .mat file you just created:

Fig 5. Define your parameters for the signature analysis.
Fig 5. Define your parameters for the signature analysis.

 

Here you can see in Fig 5, I’ve specified 100 iterations per core, a number of possible signatures from 1-8, and the relative path to the input and output files.  The authors say that ~1000 iterations is necessary for accuracy, but I’ve found little difference in the predictions between 50-500 iterations.   I would perform as many iterations as possible, given constraints from time and computational power.

Note also that you may need to change the part of the code that deals with setting up a parallel computing pool.  Since MatLab 2014, I believe, the “matlabpool” processing command has been deprecated.   Substituting the “parpool” command seems to work fine for me (Mac OS X 10.10, 8 core Macbook Pro Retina) as follows:

if ( parpool('local') == 0 )

parpool open; % opens the default matlabpool, if it is not already opened

end

This post is getting lengthy, so I will stop here and post one more part later about how to compare the signatures you calculate with the COSMIC database using the cosine similarity metric.

Hands-on with cancer mutational signatures

In my last post, I wrote about the biological context of mutational signatures in cancer and how a recent series of papers has addressed this notion by creating an algorithm for extracting signatures from a catalogue of tumor SNPs.

nature12477-f2

In this follow-up post, I wanted to offer practical advice, based on my experience, about how to prepare data for mutational signature analysis and how to interpret the results.

First, in order to analyze your SNPs for mutational signatures, one needs to derive the trimer context of each SNP (i.e., the upstream base and downstream base flanking the SNP) for technical reasons described in the paper linked above.

In order to do this,  a reference genome must be queried with the positions of the SNPs in order to find those bases adjacent to them.  One could either query a remote database, like Entrez Nucleotide, or download an entire 40GB reference genome and search it locally.

In my code, I opted for the former option:  querying the NCBI/Entrez Nucleotide database using HTTP requests.   The advantage of this approach is that I can reuse the same code to query multiple reference genomes (e.g., hg38 vs. hg19), depending on the changing needs of future projects.

The relevant section of code is as follows:

def get_record(chrom, pos):               
    '''Fetch record, and minus/plus one base from Entrez'''
    try:
        handle = Entrez.efetch(db="nucleotide", 
                    id=hg19_chrom[chrom], 
                    rettype="fasta", 
                    strand=1, 
                    seq_start=int(pos) - 1, 
                    seq_stop=int(pos) + 1)
                       
        record = SeqIO.read(handle, "fasta")
        handle.close()
    except BaseException:
        return None
    
    return record

 

You can see from the code that I am using a dictionary (‘hg19_chrom’)  to translate between chromosome numbers and their Entrez Nucleotide IDs in the eFetch request.

The disadvantage of this approach is that Entrez HTTP tools limits the user to 3 queries per second (in fact this limitation is hard-coded into Biopython).  Even with my mediocre coding skills, this turns out to be the rate-limiting step.   Thus, this code would have to be run pretty much overnight for any sizable number of SNPs (~50,000 SNPs would take ~4.6 hrs).  However, it’s easy to let the script run overnight, so this wasn’t a deal breaker for me.

In the third and final post next two posts on this topic I will address how to create the MatLab .mat file from the output of this script and how to compare the signatures generated by the MatLab framework to the COSMIC reference database in a non-biased way.

 

Spot the dancing gorilla to code better python

OK, OK, I know the title of this post falls into the gray area between informative and “click-bait.” 

However, now that you’re here, watching the following talks by Python Core Developer and coding guru Raymond Hettinger will be both immediately useful and highly entertaining! 

PyCon 2015 — Beyond PEP8

Can you spot the dancing gorilla in your code?

PyCon 2013 — Class Development toolkit

From Mom’s basement to a loft in SOMA, Python classes solve your startup woes