How not to use IPython.parallel on a laptop

In this post I want to focus on an aspect of using the IPython.parallel implementation that may be confusing to new users.

In the IPython.parallel documentation, one of the first things you do to show that you have started the parallel python engines is a call to python’s “map” method with a lambda function that takes x to the 10th power over a range of x.

In serial (non-parallel) form that is as follows:

serial_result = map(lambda x:x**10, range(100))

Then, you do the same in parallel with the python engines you’ve started:

parallel_result = x:x**10, range(100))

Then, you assert that the results are the same:

assert serial_result == parallel_result


This works fine, but there is a problem. You would probably never actually use an IPython.parallel client for work like this. Given that the documentation is aimed at introducing new users, it is a bit confusing to present this simple example without the caveat that this is not a typical use case.

Here is why you’d never actually code this calculation in parallel:

In [8]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop

In [9]: %timeit x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop


Notice that the parallel version of this calculation over a range of just 3000, took 22 secs to complete! That is 2,300 times slower than just using one core and the built-in map method.

Apparently, this surprising result is because there is a huge amount of overhead associated with distributing the 3000 small, very fast jobs in the way I’ve written statement [9] above.   Every time the job is distributed to an engine, the function and data have to be serialized and deserialized (“pickled”), if my understanding is correct.

In response to my StackOverflow question on this issue, Univerio helpfully suggested the following more clever use of parallel resources (he is using 6 cores in this example):

In [7]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop

In [8]: %timeit i:[x**10 for x in range(i * 500)], range(6))  # range(6) owing to 6 cores available for work
100 loops, best of 3: 11.4 ms per loop

In [9]: %timeit i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop

Note that what Univerio is doing in line [8] is to distribute equal shares of the work across 6 cores. Now the time to complete the task is within the same order of magnitude as the single-threaded version. If you use just two tasks in example [9], the time is cut in half again owing to less overhead.

The take-home message is that if you’re going to expend the overhead necessary to setup and start multiple IPython.parallel engines and distribute jobs to them, the jobs need to be more resource-consuming than just a few ms each.  And you should try to make as few function calls as possible.  Each call should do as much work as possible.