Spot the dancing gorilla to code better python

OK, OK, I know the title of this post falls into the gray area between informative and “click-bait.” 

However, now that you’re here, watching the following talks by Python Core Developer and coding guru Raymond Hettinger will be both immediately useful and highly entertaining! 

PyCon 2015 — Beyond PEP8

Can you spot the dancing gorilla in your code?

PyCon 2013 — Class Development toolkit

From Mom’s basement to a loft in SOMA, Python classes solve your startup woes

 

How not to use IPython.parallel on a laptop

In this post I want to focus on an aspect of using the IPython.parallel implementation that may be confusing to new users.

In the IPython.parallel documentation, one of the first things you do to show that you have started the parallel python engines is a call to python’s “map” method with a lambda function that takes x to the 10th power over a range of x.

In serial (non-parallel) form that is as follows:

serial_result = map(lambda x:x**10, range(100))

Then, you do the same in parallel with the python engines you’ve started:

parallel_result = lview.map(lambda x:x**10, range(100))

Then, you assert that the results are the same:

assert serial_result == parallel_result

 

This works fine, but there is a problem. You would probably never actually use an IPython.parallel client for work like this. Given that the documentation is aimed at introducing new users, it is a bit confusing to present this simple example without the caveat that this is not a typical use case.

Here is why you’d never actually code this calculation in parallel:

In [8]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop

In [9]: %timeit lview.map(lambda x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop

 

Notice that the parallel version of this calculation over a range of just 3000, took 22 secs to complete! That is 2,300 times slower than just using one core and the built-in map method.

Apparently, this surprising result is because there is a huge amount of overhead associated with distributing the 3000 small, very fast jobs in the way I’ve written statement [9] above.   Every time the job is distributed to an engine, the function and data have to be serialized and deserialized (“pickled”), if my understanding is correct.

In response to my StackOverflow question on this issue, Univerio helpfully suggested the following more clever use of parallel resources (he is using 6 cores in this example):

In [7]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop

In [8]: %timeit lview.map(lambda i:[x**10 for x in range(i * 500)], range(6))  # range(6) owing to 6 cores available for work
100 loops, best of 3: 11.4 ms per loop

In [9]: %timeit lview.map(lambda i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop

Note that what Univerio is doing in line [8] is to distribute equal shares of the work across 6 cores. Now the time to complete the task is within the same order of magnitude as the single-threaded version. If you use just two tasks in example [9], the time is cut in half again owing to less overhead.

The take-home message is that if you’re going to expend the overhead necessary to setup and start multiple IPython.parallel engines and distribute jobs to them, the jobs need to be more resource-consuming than just a few ms each.  And you should try to make as few function calls as possible.  Each call should do as much work as possible.

High performance computing versus high throughput

high performance computing
Xserve G5 supercomputer. Image credit: Christopher Bowns, Flickr

 

Two approaches to scientific computing

The terms “high performance computing” (HPC) and “high throughput computing” (HTC) might sound interchangeable to those not familiar with scientific computing, but they denote two very different approaches to computing.  I’m going to describe the difference below (with the caveat that I have only a layman’s understanding of this field).

High throughput computing is for many smaller tasks

HTC is a computing approach that aims to make available a large number of computers to quickly accomplish tasks that are easily broken up into smaller, independent components.  For example, if you have to process 100 video clips, and each one takes ~1 hr, then you would need ~100 hrs of computing time on your laptop.

However, if you had 100 laptops, you could theoretically do the task in 1 hr assuming that you could instantly command each one to begin the processing task (in reality, of course, you’d have to run around setting up the task on each computer which could take longer than the compute time).  The point is this: each video processing task is independent of the others.

It is these types of tasks that HTC aims to address.  By providing many hundreds or thousands of networked CPUs in a cluster and a software application that can easily and automatically track and distribute hundreds of tasks (called a DRM or distributed-resource manager) an HTC user can submit a task such as the video processing example described above and have it automatically farmed out to 100 compute nodes for processing (in the HTC world this is called a “pleasantly parallel” problem).  Once each node completes, the data are copied back into the user’s home folder and it appears to the user that they have just used an extremely fast computer, when in fact they have used 100 computers working simultaneously.

High performance computing is for difficult computational problems

Now, however, consider the case of a computational task where each subunit is not independent of all of the others.  One that I am intimately familiar with is Molecular Dynamics (MD) simulations of protein structure and dynamics.  In MD simulations, an algorithm simulates the atomic motions of a protein molecule immersed in a box of waters on a very short timescale (somewhere on the order of a microsecond).  Even with the short timescale, this is a very compute-intensive task.  But because each atom in the protein interacts with many other atoms in the system, it is a task that can’t be neatly broken down into independent components in the way that video processing can be.   You can’t simply give each atom to a separate compute node.  In effect, MD simulation is a single, extremely resource-intensive computation.

Enter high performance computing.  In HPC (also called supercomputing), the aim is to build hardware and software that are focused on peak computing capability (i.e., speed) and extremely fast interconnectedness, rather than on the number of simultaneous tasks that can be accomplished.   The “high performance” part of HPC comes about from the technological focus on networking the computational nodes together with extremely fast connections so that communicating data and messages back and forth does not become a significant bottleneck to completing a large-scale computation.

On the software side of HPC, code libraries like MPI have been developed that allow simulations to be “parallelized” (i.e.,  broken down into smaller pieces).  These smaller pieces (called “domain decomposition” for MD simulation) are then farmed out to the compute nodes of an HPC supercomputer and they can exchange data in real time so that each part of the simulation “knows” about the results from every other part.  In this way, the velocities and positions of certain atoms of a protein can be influenced by all of the other velocities and positions of atoms even if they are being simulated on different CPU nodes.