Category Archives: python

Hands-on with cancer mutational signatures, part 2

In this second part of the “Hands On” series, I want to address how to create the input for the MatLab mutational signature framework from the output of my python code to prepare the SNP data for analysis.

First, creating a Matlab .mat file for input to the program.   The code is expecting an input file that contains a set of mutational catalogues and metadata information about the cancer type and the mutational types and subtypes represented in the data.

Fig 1. The required data types within one .mat file to run the framework.
Fig 1. The required data types within one .mat file to run the framework.

As you can see from Fig 1., you need to provide a 96 by X matrix, where X is the number of samples in your mutational catalogue.  You also need an X by 1 cell array specifying sample names, a 96 by 1 cell array specifying the subtypes (ACA, ACC, ACG, etc…) and a 96 by 1 cell array specifying the types (C>A, C>A, C>A, etc…).  These must correspond to the same order as specified in the “originalGenomes” matrix or the results won’t make sense.

My code outputs .csv files for all of these needed inputs.  For example, when you run my python code on your SNP list, you will get a “subtypes.csv”, “types.csv”, “samples.csv”, and “samples_by_counts.csv” matrix (i.e., originalGenomes) corresponding to the above cell arrays in Fig 1.

Now, the challenge is to get those CSV files into MatLab.  You should have downloaded and installed MatLab on your PC.  Open MatLab and select “Import Data.”

Fig 2. Select the "Import Data" button.
Fig 2. Select the “Import Data” button.

Browse to one of the output CSV files and select it.  It will open in a new window like in Fig 3 below:

Fig 3. The data import window from MatLab.
Fig 3. The data import window from MatLab.

Be sure to select the correct data type in the “imported data” section.  Also, select only row 1 for import (row 2 is empty).  Once you’re finished, click Import Selection.  It will create a 1×96 cell called “types.”  It looks like this:

Fig 4. The new imported cell data "types."
Fig 4. The new imported cell data “types.”

We’re almost done, but we have to switch the cell to be 96×1 rather than 1×96.  To do this, just double-click it and select “transpose” in the variable editor.   Now you should repeat this process for the other CSV input files, being sure to select “matrix” as the data type for the “samples_by_counts” file.   Pay special attention to make sure the dimensions and data types are correct.

Once you have everything in place you should be ready do run the mutational analysis framework from the paper.   To do this, open the “example2.m” Matlab script included with the download.  In the “Define parameters” section, change the file paths to the .mat file you just created:

Fig 5. Define your parameters for the signature analysis.
Fig 5. Define your parameters for the signature analysis.

 

Here you can see in Fig 5, I’ve specified 100 iterations per core, a number of possible signatures from 1-8, and the relative path to the input and output files.  The authors say that ~1000 iterations is necessary for accuracy, but I’ve found little difference in the predictions between 50-500 iterations.   I would perform as many iterations as possible, given constraints from time and computational power.

Note also that you may need to change the part of the code that deals with setting up a parallel computing pool.  Since MatLab 2014, I believe, the “matlabpool” processing command has been deprecated.   Substituting the “parpool” command seems to work fine for me (Mac OS X 10.10, 8 core Macbook Pro Retina) as follows:

if ( parpool('local') == 0 )

parpool open; % opens the default matlabpool, if it is not already opened

end

This post is getting lengthy, so I will stop here and post one more part later about how to compare the signatures you calculate with the COSMIC database using the cosine similarity metric.

Hands-on with cancer mutational signatures

In my last post, I wrote about the biological context of mutational signatures in cancer and how a recent series of papers has addressed this notion by creating an algorithm for extracting signatures from a catalogue of tumor SNPs.

nature12477-f2

In this follow-up post, I wanted to offer practical advice, based on my experience, about how to prepare data for mutational signature analysis and how to interpret the results.

First, in order to analyze your SNPs for mutational signatures, one needs to derive the trimer context of each SNP (i.e., the upstream base and downstream base flanking the SNP) for technical reasons described in the paper linked above.

In order to do this,  a reference genome must be queried with the positions of the SNPs in order to find those bases adjacent to them.  One could either query a remote database, like Entrez Nucleotide, or download an entire 40GB reference genome and search it locally.

In my code, I opted for the former option:  querying the NCBI/Entrez Nucleotide database using HTTP requests.   The advantage of this approach is that I can reuse the same code to query multiple reference genomes (e.g., hg38 vs. hg19), depending on the changing needs of future projects.

The relevant section of code is as follows:

def get_record(chrom, pos):               
    '''Fetch record, and minus/plus one base from Entrez'''
    try:
        handle = Entrez.efetch(db="nucleotide", 
                    id=hg19_chrom[chrom], 
                    rettype="fasta", 
                    strand=1, 
                    seq_start=int(pos) - 1, 
                    seq_stop=int(pos) + 1)
                       
        record = SeqIO.read(handle, "fasta")
        handle.close()
    except BaseException:
        return None
    
    return record

 

You can see from the code that I am using a dictionary (‘hg19_chrom’)  to translate between chromosome numbers and their Entrez Nucleotide IDs in the eFetch request.

The disadvantage of this approach is that Entrez HTTP tools limits the user to 3 queries per second (in fact this limitation is hard-coded into Biopython).  Even with my mediocre coding skills, this turns out to be the rate-limiting step.   Thus, this code would have to be run pretty much overnight for any sizable number of SNPs (~50,000 SNPs would take ~4.6 hrs).  However, it’s easy to let the script run overnight, so this wasn’t a deal breaker for me.

In the third and final post next two posts on this topic I will address how to create the MatLab .mat file from the output of this script and how to compare the signatures generated by the MatLab framework to the COSMIC reference database in a non-biased way.

 

Spot the dancing gorilla to code better python

OK, OK, I know the title of this post falls into the gray area between informative and “click-bait.” 

However, now that you’re here, watching the following talks by Python Core Developer and coding guru Raymond Hettinger will be both immediately useful and highly entertaining! 

PyCon 2015 — Beyond PEP8

Can you spot the dancing gorilla in your code?

PyCon 2013 — Class Development toolkit

From Mom’s basement to a loft in SOMA, Python classes solve your startup woes

 

How not to use IPython.parallel on a laptop

In this post I want to focus on an aspect of using the IPython.parallel implementation that may be confusing to new users.

In the IPython.parallel documentation, one of the first things you do to show that you have started the parallel python engines is a call to python’s “map” method with a lambda function that takes x to the 10th power over a range of x.

In serial (non-parallel) form that is as follows:

serial_result = map(lambda x:x**10, range(100))

Then, you do the same in parallel with the python engines you’ve started:

parallel_result = lview.map(lambda x:x**10, range(100))

Then, you assert that the results are the same:

assert serial_result == parallel_result

 

This works fine, but there is a problem. You would probably never actually use an IPython.parallel client for work like this. Given that the documentation is aimed at introducing new users, it is a bit confusing to present this simple example without the caveat that this is not a typical use case.

Here is why you’d never actually code this calculation in parallel:

In [8]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop

In [9]: %timeit lview.map(lambda x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop

 

Notice that the parallel version of this calculation over a range of just 3000, took 22 secs to complete! That is 2,300 times slower than just using one core and the built-in map method.

Apparently, this surprising result is because there is a huge amount of overhead associated with distributing the 3000 small, very fast jobs in the way I’ve written statement [9] above.   Every time the job is distributed to an engine, the function and data have to be serialized and deserialized (“pickled”), if my understanding is correct.

In response to my StackOverflow question on this issue, Univerio helpfully suggested the following more clever use of parallel resources (he is using 6 cores in this example):

In [7]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop

In [8]: %timeit lview.map(lambda i:[x**10 for x in range(i * 500)], range(6))  # range(6) owing to 6 cores available for work
100 loops, best of 3: 11.4 ms per loop

In [9]: %timeit lview.map(lambda i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop

Note that what Univerio is doing in line [8] is to distribute equal shares of the work across 6 cores. Now the time to complete the task is within the same order of magnitude as the single-threaded version. If you use just two tasks in example [9], the time is cut in half again owing to less overhead.

The take-home message is that if you’re going to expend the overhead necessary to setup and start multiple IPython.parallel engines and distribute jobs to them, the jobs need to be more resource-consuming than just a few ms each.  And you should try to make as few function calls as possible.  Each call should do as much work as possible.

The power of pandas; an example

I wanted to demonstrate further how powerful and straightforward the pandas library is for data analysis.  A good example comes from the book “Bioinformatics Programming using Python,” by Mitchell Model.   While this is an excellent reference book on Python programming, it was written before pandas was in widespread use as a library.

In the “Extended Examples” on p. 158 of Chapter 4, the author demonstrates some code to read in a text file containing the names of enzymes, their restriction sites, and the patterns that they match.  The code takes the text file, cleans it up, and makes a dictionary that is searchable by key.  This is done using core python tools only and it looks like this (note: I am using Python 2.7 hence the need to import “print_function” from “__future__”):

Screen Shot 2015-01-02 at 12.01.39 PM

The last few lines of output from calling test() is as follows:Screen Shot 2015-01-02 at 12.12.34 PM

Hold onto your seats because you can do all of that and more with just 5 lines of code using pandas (if you don’t count the imports):

Screen Shot 2015-01-02 at 12.10.27 PM

The read_table function can take regex separators (in this case “any number of white spaces”) when using the “python” engine option.  We skip the first 8 rows because they have no information.  The header is set as the second row after the skipped rows.

I then use a boolean mask to find the places where the condition “is_null” is true looking down the “pattern” column. This is because some rows lack a “site” entry, so pandas found only two data fields when separated on whitespace and thus left the third column empty, not knowing there was missing data.   Wherever the pattern column is null, I assign the missing values into the pattern column from the site column.  I then replace the site column values with “NaNs”.

The first few lines of the ‘rebase’ dataframe object look like this:

Screen Shot 2015-01-02 at 12.19.16 PMTechnically, what I just did in pandas is not quite the same thing as the core python version above.  It is in many ways far better.  First, all of the blank spaces in the second column are now “NaN” instead of blanks.  This makes data analysis easier.  Second, the object “rebase” is a dataframe that allows access to all of the dataframe methods.  It is also indexed by row and has named columns for easier interpretation.  The dataframe also automatically “pretty prints” for easy reading, whereas the table created using core python has to be formatted with additional function definitions to print to stdout or to file in a readable way.

 

 

 

Book Review: Python for Data Analysis

Python for data analysis

Introduction

The book “Python for Data Analysis” (O’Reilly Media 2013) by author Wes McKinney is a guide to using the NumPy, matplotlib, and pandas Python libraries for data analysis. The author sets out to provide a template for Python programmers to gain working knowledge of the rapidly maturing Python technologies for data analysis and visualization tasks.   The tone of the book is conversational and focused, with no fluff or filler. The book accomplishes its purpose admirably by providing a concise, meaty, and highly readable tutorial through the essential features of doing data analysis in Python.

McKinney does a skillful job of bringing the Python novice through the requisite background and quickly up to speed doing useful work with pandas without becoming bogged down in introductory Python minutia. In fact, the opening chapter is titled “Introductory Examples” and includes several relatively complex data analysis examples that serve to demonstrate the capabilities of pandas. I found this approach provided me with the motivation to read on into the more detailed and technical chapters.

Why you should listen to Wes McKinney

The author is uniquely suited to write this book, having been the creator and first developer of pandas in the course of his own work as a quantitative analyst at a hedge fund back in 2008. I could tell that the author has a mastery of the subject; he provides many useful insights that could only be gained through real-world experience. The book focuses mainly on the pandas library and its core technologies, the Series and the Dataframe. Both are important because they build on the speed and precision of numpy arrays, while allowing richer, more intuitive and powerful manipulation of data tables.

pandas: it just works the way it should

Another aspect of this book that is so enjoyable is that pandas itself just works the way I would expect it to work. The tools, in my opinion, are constructed to be as convenient and intuitive as possible. I find that pandas behaves very predictably, despite being extremely powerful. Oftentimes, I was able to invent an expression in pandas that behaved exactly as I intended without knowing a priori whether it was possible to do so. There is something very satisfying about a tool that just works and doesn’t require a lot of boilerplate code.

The publisher also provides downloadable iPython notebooks containing the code examples for each chapter. Using these notebooks it was very easy to follow along, running code while reading the chapter. The illustrations in the book also consist almost entirely of matplotlib plots prepared using the code examples. I was able to work up many of the figures, giving me a sense of having gained practical, working knowledge in each chapter.

Python for data analysis? Yes!

I really have nothing negative to say about “Python for Data Analysis”. If forced to find something to change, it would be that the author could have left out the highly-condensed chapter on introductory Python programming found at the end of the book, using the extra space instead to include even more examples of pandas in practical, real-world applications.

For instance, an example on building a data analysis model with interactive graphics for the web would have been welcome. Similarly, a demonstration of approaches for making matplotlib, with its rather utilitarian graphics, more closely resemble the stylistically attractive plots of ggplot2 (the well-known R plotting library) would also have been useful.

After reading this book, however, I have been convinced to transition my data analysis workflow entirely into Python and largely abandon R, which now seems somewhat esoteric and unnecessarily complex by comparison. Overall, I would highly recommend this book to anyone seeking to learn how to use Python for data analysis. It is a valuable reference for scientists, engineers, data analysts, and others who want to leverage the power of Python (and specifically numpy and pandas) for dealing with their data.

The Python ecosystem for beginners, part 2

Welcome to Part 2 of my post on the scientific Python ecosystem (Part 1 is here).   I will describe a few more of the most common and useful libraries that make up the typical Python scientific computing stack.   This is not an exhaustive list by any means, and new libraries are being continually developed by the open source community.

Matplotlib – high quality 2D and 3D plotting

Matplotlib_3d

Matplotlib is a plotting library that aims to make it easy to produce publication quality plots.  In typical Python style, Matplotlib code can be very succinct and yet yield complete, high-quality plots.  The library can generate many types of 2D graphs: regular plots, histograms, scatterplots, pie charts, statistical plots, and contour plots, to name a few.

Matplotlib is organized in a hierarchical manner that allows the user to quickly and easily create plots using high-level commands, while simultaneously allowing power users to delve into the object-oriented programming layer to control minute details of individual plots, should they choose to do so.

Traits – interactive class instances and GUI building

Traits is a powerful package that extends Python type attributes in interesting and useful ways.  For instance, python objects such as classes can have attribute “traits”  that allow for initialization (set an default value), notification (tell another part of the program that a value has changed) and visualization (respond to GUI inputs).   Although it is possible to achieve this using Python properties, Traits reduces a lot of the boilerplate code and streamlines the process.

Chaco – interactive 2D plotting

Chaco is a plotting application toolkit for building rich, interactive plots.   Chaco works with Traits to build object-oriented models of plots that can accept and react to inputs from the GUI.

Cython – speed up your code with C

The easiest way to think about Cython is to imagine it as a superset of the Python language.   That is, all of the normal Python language is there, along with additional commands that allow code that calls back and forth to C/C++ libraries seamlessly.  In Cython, you can also add static type declarations to python functions to get C-level speedups in computation.  Cython code is compiled into C code for execution.  Unlike weave, which allows inline C code but requires that the python code be re-compiled for C during every execution, Cython code is compiled only once (unless there are changes later) meaning that an end user does need to bother with recompiling to run the code as a standalone program.

Using Cython for numerical computation in Python, speedups of 2000X or more above the pure Python equivalent are not uncommon.

SciKit Learn – interactive machine learning

SciKit Learn is a machine-learning library for Python.  It is based on NumPy, SciPy and matplotlib.  There are many algorithms available for performing machine-learning tasks, falling into four main areas: classification, clustering, regression, and dimensionality reduction (principle component analysis).

The Python ecosystem for beginners

When first starting to learn Python, I found the array of package names and libraries a bit bewildering and confusing.  In this post I will enumerate many of the most common and useful parts of the Python computing “ecosystem” and attempt to describe them very briefly.

My aim is to provide some clarity on the situation for new users, as I would have liked to have seen the “30,000 ft view” when I started learning not long ago.   So without further ado, part 1 of the python ecosystem overview:

Python -high-level, interpreted language

ImageCache

Python is an interpreted programming language (itself written in C) that allows you to write very clean and simple code in a fast and human-readable form as compared to lower-level compiled languages (C++, Fortran).

The language is simple, self-consistent and beautiful.   It is relatively easy to learn and is gaining in popularity every year.  The simplicity and ease of use does come with a price as code written in pure Python is generally slower to execute than compiled C/C++ code.

iPython -get code written faster

iPython is an enhanced, interactive Python shell designed to make code development faster.   According to Wes McKinney in his excellent book, “Python for Data Analysis,” iPython is designed to encourage “an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages.”

iPython contains the Python interpreter and is ready to execute commands as you enter them.  It is where you run code snippets, examine the outputs, and make iterative improvements.  In that sense, it is like kind of like the UNIX command line.  You don’t write full programs here, you do that in a text editor or IDE (integrated development environment).

It also contains useful features called “magic commands.”  These are commands that are unique to the iPython command line, and are not valid Python code (i.e., you cannot use these commands in stand-alone programs).  Magic commands provide productivity speedups in many useful ways, such as recalling command history, running parts of scripts, timing code, and debugging code interactively.

If you have iPython installed you can invoke it from a regular python shell; however I find it easiest to use it within an IDE that supports iPython.

iPython Notebook -share code and ideas over the web

The iPython notebook is an interactive python format that runs in a web browser or in an IDE.  The Notebook is a flexible and powerful document format that allows python code, text markdown, mathematics equations, and figures to be displayed together in a coherent, inline way.  An iPython notebook could be used to provide all of the steps in a data analysis project, for example, or to teach a programming concept.  It is very useful for sharing code and visualizing results in an interactive, portable document.

NumPy -implement very fast vectorized computation

NumPy (numerical python) is a powerful and fast library for doing numerical computation in Python.  It is based on a data structure called an ndarray.  This lower level structure is faster for computation than regular higher-level python structures like lists and dictionaries, but it is less flexible and behaves in somewhat unintuitive ways.   Functions can be applied across a numpy array all at once and “in place”; this is known as vectorization.  NumPy contains a number of vectorized built-in functions known as “ufuncs” for doing transformations on ndarrrays.

pandas -powerful library for data analysis

pandas is a data analysis package for python; conceived and built initially by Wes McKinney.  It was developed to allow python users to access some of the powerful features of the R statistics language while staying in the python ecosystem.  Prior to the development of pandas, data analysis had to be carried out using the NumPy ndarray structures which are rather difficult for handling messy real world data.

Pandas achieves very fast speeds and efficiency because it is built on top of NumPy and therefore takes advantage of the built-in speed advantage of the low-level ndarray data structures.   However, pandas allows users to create higher-level structures called Series (1D), Dataframes (2D) and Panels (3D), that are more flexible and useful for regular data analysis than raw numpy arrays owing to their ability to contain mixed data types, headers, and indexes.

Pandas also contains many built-in methods for operations on Series, Dataframes, and Panels that allow users to quickly and easily do data aggregation, reductions, and “split-apply-combine” strategies.

SciPy -collection of libraries for a variety of computing applications

SciPy is a collection of scientific algorithms for doing scientific computing.   It is an open-source project under active development.  Some of the packages available include scipy.linalg for doing linear algebra, scipy.stats for statistics, scipy.cluster for clustering (K-means and others), scipy.fftpack for doing fourier transform analysis, scipy.optimize for doing curve fitting and minimization, and scipy.signal for doing signal processing.  There are many more packages in SciPy;  what you end up using will depend on your application and area of interest.

In part 2, I will describe even more libraries and packages that you will encounter as you learn scientific computing with Python.

Importance of list comprehensions in Python

A beginner to python programming is usually taught to use for loops to do a lot things. The temptation  is to bust out a for loop whenever you need to modify a list or string object, but this quickly leads to complex “loops within loops” code architectures that are hard to read (by humans), even though they may work OK.

A simple example:

>>>test_list = [2,4,6,8]

>>>for x in test_list:

…     new_list.append(x  + 1)

>>>new_list

[3,5,7,9]

A better approach is to take advantage of Python’s built-in list comprehension expressions, with the form ‘x for x in y’.

Example:

>>>new_list = [x+2 for x in test_list]

>>>new_list

[4,6,8,10]

This can be expanded to include conditionals, for example:

>>>stripped_list = [line.strip() for line in line_list if line !=””]

You can also loop over multiple elements like this:

>>>seq1=’abc’

>>>seq2=(1,2,3)

>>>[(x,y) for x in seq1 for y in seq2]

[(‘a’,1),(‘a’,2),(‘a’,3),(‘b’,1),(‘b’,2)….(‘c’,3)]