How not to use IPython.parallel on a laptop

In this post I want to focus on an aspect of using the IPython.parallel implementation that may be confusing to new users.

In the IPython.parallel documentation, one of the first things you do to show that you have started the parallel python engines is a call to python’s “map” method with a lambda function that takes x to the 10th power over a range of x.

In serial (non-parallel) form that is as follows:

serial_result = map(lambda x:x**10, range(100))

Then, you do the same in parallel with the python engines you’ve started:

parallel_result = lview.map(lambda x:x**10, range(100))

Then, you assert that the results are the same:

assert serial_result == parallel_result

 

This works fine, but there is a problem. You would probably never actually use an IPython.parallel client for work like this. Given that the documentation is aimed at introducing new users, it is a bit confusing to present this simple example without the caveat that this is not a typical use case.

Here is why you’d never actually code this calculation in parallel:

In [8]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop

In [9]: %timeit lview.map(lambda x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop

 

Notice that the parallel version of this calculation over a range of just 3000, took 22 secs to complete! That is 2,300 times slower than just using one core and the built-in map method.

Apparently, this surprising result is because there is a huge amount of overhead associated with distributing the 3000 small, very fast jobs in the way I’ve written statement [9] above.   Every time the job is distributed to an engine, the function and data have to be serialized and deserialized (“pickled”), if my understanding is correct.

In response to my StackOverflow question on this issue, Univerio helpfully suggested the following more clever use of parallel resources (he is using 6 cores in this example):

In [7]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop

In [8]: %timeit lview.map(lambda i:[x**10 for x in range(i * 500)], range(6))  # range(6) owing to 6 cores available for work
100 loops, best of 3: 11.4 ms per loop

In [9]: %timeit lview.map(lambda i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop

Note that what Univerio is doing in line [8] is to distribute equal shares of the work across 6 cores. Now the time to complete the task is within the same order of magnitude as the single-threaded version. If you use just two tasks in example [9], the time is cut in half again owing to less overhead.

The take-home message is that if you’re going to expend the overhead necessary to setup and start multiple IPython.parallel engines and distribute jobs to them, the jobs need to be more resource-consuming than just a few ms each.  And you should try to make as few function calls as possible.  Each call should do as much work as possible.

Practical Fragments blog has reviewed our paper!

Our latest fragment-based drug discovery paper against the p97 ATPase has been noticed and reviewed favorably by the widely-read Practical Fragments blog.

Here is an excerpt from that review:

“The protein p97 is important in regulating protein homeostasis, and thus a potential anti-cancer target. But this is no low-hanging fruit: the protein has three domains and assembles into a hexamer. Two domains, D1 and D2, are ATPases. The third (N) domain binds to other proteins in the cell. All the domains are dynamic and interdependent. Oh, and crystallography is tough. Previous efforts have identified inhibitors of the D2 domain, but not the others. Not to be put off by difficult challenges, a group of researchers at the University of California San Francisco (UCSF) led by Michelle Arkin and Mark Kelly have performed fragment screening against the D1 and N domains, and report their adventures in J. Biomol. Screen.

High performance computing versus high throughput

high performance computing
Xserve G5 supercomputer. Image credit: Christopher Bowns, Flickr

 

Two approaches to scientific computing

The terms “high performance computing” (HPC) and “high throughput computing” (HTC) might sound interchangeable to those not familiar with scientific computing, but they denote two very different approaches to computing.  I’m going to describe the difference below (with the caveat that I have only a layman’s understanding of this field).

High throughput computing is for many smaller tasks

HTC is a computing approach that aims to make available a large number of computers to quickly accomplish tasks that are easily broken up into smaller, independent components.  For example, if you have to process 100 video clips, and each one takes ~1 hr, then you would need ~100 hrs of computing time on your laptop.

However, if you had 100 laptops, you could theoretically do the task in 1 hr assuming that you could instantly command each one to begin the processing task (in reality, of course, you’d have to run around setting up the task on each computer which could take longer than the compute time).  The point is this: each video processing task is independent of the others.

It is these types of tasks that HTC aims to address.  By providing many hundreds or thousands of networked CPUs in a cluster and a software application that can easily and automatically track and distribute hundreds of tasks (called a DRM or distributed-resource manager) an HTC user can submit a task such as the video processing example described above and have it automatically farmed out to 100 compute nodes for processing (in the HTC world this is called a “pleasantly parallel” problem).  Once each node completes, the data are copied back into the user’s home folder and it appears to the user that they have just used an extremely fast computer, when in fact they have used 100 computers working simultaneously.

High performance computing is for difficult computational problems

Now, however, consider the case of a computational task where each subunit is not independent of all of the others.  One that I am intimately familiar with is Molecular Dynamics (MD) simulations of protein structure and dynamics.  In MD simulations, an algorithm simulates the atomic motions of a protein molecule immersed in a box of waters on a very short timescale (somewhere on the order of a microsecond).  Even with the short timescale, this is a very compute-intensive task.  But because each atom in the protein interacts with many other atoms in the system, it is a task that can’t be neatly broken down into independent components in the way that video processing can be.   You can’t simply give each atom to a separate compute node.  In effect, MD simulation is a single, extremely resource-intensive computation.

Enter high performance computing.  In HPC (also called supercomputing), the aim is to build hardware and software that are focused on peak computing capability (i.e., speed) and extremely fast interconnectedness, rather than on the number of simultaneous tasks that can be accomplished.   The “high performance” part of HPC comes about from the technological focus on networking the computational nodes together with extremely fast connections so that communicating data and messages back and forth does not become a significant bottleneck to completing a large-scale computation.

On the software side of HPC, code libraries like MPI have been developed that allow simulations to be “parallelized” (i.e.,  broken down into smaller pieces).  These smaller pieces (called “domain decomposition” for MD simulation) are then farmed out to the compute nodes of an HPC supercomputer and they can exchange data in real time so that each part of the simulation “knows” about the results from every other part.  In this way, the velocities and positions of certain atoms of a protein can be influenced by all of the other velocities and positions of atoms even if they are being simulated on different CPU nodes.

 

Cancer immunotherapy and the role of tryptophan

 

cancer immunotherapy drug 1-MT
DL-1-Methyltryptophan

 

Background

In the body, L-tryptophan is catabolized by an enzyme called Indoleamine 2,3-dioxygenase (IDO) to form a class of molecules known as kynurenines.  These compounds have been shown to be immunosuppresive, preventing inflammation and T-cell mobilization.  Additionally, depletion of cellular stores of L-tryptophan also appears to induce down-regulation of the  immune response.

What does this have to do with cancer immunotherapy?  Interestingly, cancer actively hijacks the IDO pathway to promote immune system suppression and tolerance to tumor cell antigens by overexpressing IDO in the tumor, at host cells in the immediate area of the tumor, and at tumor-draining lymph nodes where T-cells could normally become activated against tumor antigens.

Think of it like a beekeeper using smoke to keep the bees calm as the keeper removes honey from the hive.   By upregulating the expression and activity of the IDO pathway, tumors effectively “hide” from the immune system while they grow out of control in the host tissue.  But this exploitation of the body’s own immune regulation by cancer also presents a weakness that can leveraged in the fight against tumor progression.

Inhibiting IDO to enable tumor recognition

Enter 1-methyl-DL-tryptophan (1MT), pictured above.  1MT is known to be an inhibitor of IDO that works presumably by mimicking the natural substrate (although I believe this has not been shown explicitly).   IDO inhibition by 1MT has been shown to work in combination with chemotherapy approaches to limit tumor progression in mouse models.

Adding 1MT to chemotherapy treatments allows the host immune system to mediate a response to the tumor cells, especially in the presence of dying tumor cells undergoing apoptosis and releasing antigen.  By taking away tumor-induced immune tolerance, 1MT inhibition of IDO allows the T-cell system to recognize, attack and destroy cancer cells in synergy with chemotherapy.

Early  clinical trials involving 1MT appear to be ongoing, with work being done by NewLink Genetics in Ames, IA.

————–

References

https://en.wikipedia.org/wiki/Indoleamine_2,3-dioxygenase

 

The power of pandas; an example

I wanted to demonstrate further how powerful and straightforward the pandas library is for data analysis.  A good example comes from the book “Bioinformatics Programming using Python,” by Mitchell Model.   While this is an excellent reference book on Python programming, it was written before pandas was in widespread use as a library.

In the “Extended Examples” on p. 158 of Chapter 4, the author demonstrates some code to read in a text file containing the names of enzymes, their restriction sites, and the patterns that they match.  The code takes the text file, cleans it up, and makes a dictionary that is searchable by key.  This is done using core python tools only and it looks like this (note: I am using Python 2.7 hence the need to import “print_function” from “__future__”):

Screen Shot 2015-01-02 at 12.01.39 PM

The last few lines of output from calling test() is as follows:Screen Shot 2015-01-02 at 12.12.34 PM

Hold onto your seats because you can do all of that and more with just 5 lines of code using pandas (if you don’t count the imports):

Screen Shot 2015-01-02 at 12.10.27 PM

The read_table function can take regex separators (in this case “any number of white spaces”) when using the “python” engine option.  We skip the first 8 rows because they have no information.  The header is set as the second row after the skipped rows.

I then use a boolean mask to find the places where the condition “is_null” is true looking down the “pattern” column. This is because some rows lack a “site” entry, so pandas found only two data fields when separated on whitespace and thus left the third column empty, not knowing there was missing data.   Wherever the pattern column is null, I assign the missing values into the pattern column from the site column.  I then replace the site column values with “NaNs”.

The first few lines of the ‘rebase’ dataframe object look like this:

Screen Shot 2015-01-02 at 12.19.16 PMTechnically, what I just did in pandas is not quite the same thing as the core python version above.  It is in many ways far better.  First, all of the blank spaces in the second column are now “NaN” instead of blanks.  This makes data analysis easier.  Second, the object “rebase” is a dataframe that allows access to all of the dataframe methods.  It is also indexed by row and has named columns for easier interpretation.  The dataframe also automatically “pretty prints” for easy reading, whereas the table created using core python has to be formatted with additional function definitions to print to stdout or to file in a readable way.