Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets. Instead of thinking of them as powerful machine learning tools, the authors instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’ Hence the term, “Software 2.0.”
It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks. — Andrej Karpathy, Medium.com
I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive. The fact is that combinations of artificial neurons can be used to model any logical operation. Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify. You provide the inputs and desired outputs, and the model searches for the optimal program.
This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior. While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.
In this second part of the “Hands On” series, I want to address how to create the input for the MatLab mutational signature framework from the output of my python code to prepare the SNP data for analysis.
First, creating a Matlab .mat file for input to the program. The code is expecting an input file that contains a set of mutational catalogues and metadata information about the cancer type and the mutational types and subtypes represented in the data.
As you can see from Fig 1., you need to provide a 96 by X matrix, where X is the number of samples in your mutational catalogue. You also need an X by 1 cell array specifying sample names, a 96 by 1 cell array specifying the subtypes (ACA, ACC, ACG, etc…) and a 96 by 1 cell array specifying the types (C>A, C>A, C>A, etc…). These must correspond to the same order as specified in the “originalGenomes” matrix or the results won’t make sense.
My code outputs .csv files for all of these needed inputs. For example, when you run my python code on your SNP list, you will get a “subtypes.csv”, “types.csv”, “samples.csv”, and “samples_by_counts.csv” matrix (i.e., originalGenomes) corresponding to the above cell arrays in Fig 1.
Now, the challenge is to get those CSV files into MatLab. You should have downloaded and installed MatLab on your PC. Open MatLab and select “Import Data.”
Browse to one of the output CSV files and select it. It will open in a new window like in Fig 3 below:
Be sure to select the correct data type in the “imported data” section. Also, select only row 1 for import (row 2 is empty). Once you’re finished, click Import Selection. It will create a 1×96 cell called “types.” It looks like this:
We’re almost done, but we have to switch the cell to be 96×1 rather than 1×96. To do this, just double-click it and select “transpose” in the variable editor. Now you should repeat this process for the other CSV input files, being sure to select “matrix” as the data type for the “samples_by_counts” file. Pay special attention to make sure the dimensions and data types are correct.
Once you have everything in place you should be ready do run the mutational analysis framework from the paper. To do this, open the “example2.m” Matlab script included with the download. In the “Define parameters” section, change the file paths to the .mat file you just created:
Here you can see in Fig 5, I’ve specified 100 iterations per core, a number of possible signatures from 1-8, and the relative path to the input and output files. The authors say that ~1000 iterations is necessary for accuracy, but I’ve found little difference in the predictions between 50-500 iterations. I would perform as many iterations as possible, given constraints from time and computational power.
Note also that you may need to change the part of the code that deals with setting up a parallel computing pool. Since MatLab 2014, I believe, the “matlabpool” processing command has been deprecated. Substituting the “parpool” command seems to work fine for me (Mac OS X 10.10, 8 core Macbook Pro Retina) as follows:
if ( parpool('local') == 0 )
parpool open; % opens the default matlabpool, if it is not already opened
This post is getting lengthy, so I will stop here and post one more part later about how to compare the signatures you calculate with the COSMIC database using the cosine similarity metric.
In my last post, I wrote about the biological context of mutational signatures in cancer and how a recent series of papers has addressed this notion by creating an algorithm for extracting signatures from a catalogue of tumor SNPs.
In this follow-up post, I wanted to offer practical advice, based on my experience, about how to prepare data for mutational signature analysis and how to interpret the results.
First, in order to analyze your SNPs for mutational signatures, one needs to derive the trimer context of each SNP (i.e., the upstream base and downstream base flanking the SNP) for technical reasons described in the paper linked above.
In order to do this, a reference genome must be queried with the positions of the SNPs in order to find those bases adjacent to them. One could either query a remote database, like Entrez Nucleotide, or download an entire 40GB reference genome and search it locally.
In my code, I opted for the former option: querying the NCBI/Entrez Nucleotide database using HTTP requests. The advantage of this approach is that I can reuse the same code to query multiple reference genomes (e.g., hg38 vs. hg19), depending on the changing needs of future projects.
The relevant section of code is as follows:
def get_record(chrom, pos):
'''Fetch record, and minus/plus one base from Entrez'''
handle = Entrez.efetch(db="nucleotide",
seq_start=int(pos) - 1,
seq_stop=int(pos) + 1)
record = SeqIO.read(handle, "fasta")
You can see from the code that I am using a dictionary (‘hg19_chrom’) to translate between chromosome numbers and their Entrez Nucleotide IDs in the eFetch request.
The disadvantage of this approach is that Entrez HTTP tools limits the user to 3 queries per second (in fact this limitation is hard-coded into Biopython). Even with my mediocre coding skills, this turns out to be the rate-limiting step. Thus, this code would have to be run pretty much overnight for any sizable number of SNPs (~50,000 SNPs would take ~4.6 hrs). However, it’s easy to let the script run overnight, so this wasn’t a deal breaker for me.
In the third and final post next two posts on this topic I will address how to create the MatLab .mat file from the output of this script and how to compare the signatures generated by the MatLab framework to the COSMIC reference database in a non-biased way.
In this post I want to focus on an aspect of using the IPython.parallel implementation that may be confusing to new users.
In the IPython.parallel documentation, one of the first things you do to show that you have started the parallel python engines is a call to python’s “map” method with a lambda function that takes x to the 10th power over a range of x.
In serial (non-parallel) form that is as follows:
serial_result = map(lambda x:x**10, range(100))
Then, you do the same in parallel with the python engines you’ve started:
This works fine, but there is a problem. You would probably never actually use an IPython.parallel client for work like this. Given that the documentation is aimed at introducing new users, it is a bit confusing to present this simple example without the caveat that this is not a typical use case.
Here is why you’d never actually code this calculation in parallel:
In : %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop
In : %timeit lview.map(lambda x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop
Notice that the parallel version of this calculation over a range of just 3000, took 22 secs to complete! That is 2,300 times slower than just using one core and the built-in map method.
Apparently, this surprising result is because there is a huge amount of overhead associated with distributing the 3000 small, very fast jobs in the way I’ve written statement  above. Every time the job is distributed to an engine, the function and data have to be serialized and deserialized (“pickled”), if my understanding is correct.
In : %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop
In : %timeit lview.map(lambda i:[x**10 for x in range(i * 500)], range(6)) # range(6) owing to 6 cores available for work
100 loops, best of 3: 11.4 ms per loop
In : %timeit lview.map(lambda i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop
Note that what Univerio is doing in line  is to distribute equal shares of the work across 6 cores. Now the time to complete the task is within the same order of magnitude as the single-threaded version. If you use just two tasks in example , the time is cut in half again owing to less overhead.
The take-home message is that if you’re going to expend the overhead necessary to setup and start multiple IPython.parallel engines and distribute jobs to them, the jobs need to be more resource-consuming than just a few ms each. And you should try to make as few function calls as possible. Each call should do as much work as possible.