Category Archives: reviews

Book Review: Python for Data Analysis

Python for data analysis


The book “Python for Data Analysis” (O’Reilly Media 2013) by author Wes McKinney is a guide to using the NumPy, matplotlib, and pandas Python libraries for data analysis. The author sets out to provide a template for Python programmers to gain working knowledge of the rapidly maturing Python technologies for data analysis and visualization tasks.   The tone of the book is conversational and focused, with no fluff or filler. The book accomplishes its purpose admirably by providing a concise, meaty, and highly readable tutorial through the essential features of doing data analysis in Python.

McKinney does a skillful job of bringing the Python novice through the requisite background and quickly up to speed doing useful work with pandas without becoming bogged down in introductory Python minutia. In fact, the opening chapter is titled “Introductory Examples” and includes several relatively complex data analysis examples that serve to demonstrate the capabilities of pandas. I found this approach provided me with the motivation to read on into the more detailed and technical chapters.

Why you should listen to Wes McKinney

The author is uniquely suited to write this book, having been the creator and first developer of pandas in the course of his own work as a quantitative analyst at a hedge fund back in 2008. I could tell that the author has a mastery of the subject; he provides many useful insights that could only be gained through real-world experience. The book focuses mainly on the pandas library and its core technologies, the Series and the Dataframe. Both are important because they build on the speed and precision of numpy arrays, while allowing richer, more intuitive and powerful manipulation of data tables.

pandas: it just works the way it should

Another aspect of this book that is so enjoyable is that pandas itself just works the way I would expect it to work. The tools, in my opinion, are constructed to be as convenient and intuitive as possible. I find that pandas behaves very predictably, despite being extremely powerful. Oftentimes, I was able to invent an expression in pandas that behaved exactly as I intended without knowing a priori whether it was possible to do so. There is something very satisfying about a tool that just works and doesn’t require a lot of boilerplate code.

The publisher also provides downloadable iPython notebooks containing the code examples for each chapter. Using these notebooks it was very easy to follow along, running code while reading the chapter. The illustrations in the book also consist almost entirely of matplotlib plots prepared using the code examples. I was able to work up many of the figures, giving me a sense of having gained practical, working knowledge in each chapter.

Python for data analysis? Yes!

I really have nothing negative to say about “Python for Data Analysis”. If forced to find something to change, it would be that the author could have left out the highly-condensed chapter on introductory Python programming found at the end of the book, using the extra space instead to include even more examples of pandas in practical, real-world applications.

For instance, an example on building a data analysis model with interactive graphics for the web would have been welcome. Similarly, a demonstration of approaches for making matplotlib, with its rather utilitarian graphics, more closely resemble the stylistically attractive plots of ggplot2 (the well-known R plotting library) would also have been useful.

After reading this book, however, I have been convinced to transition my data analysis workflow entirely into Python and largely abandon R, which now seems somewhat esoteric and unnecessarily complex by comparison. Overall, I would highly recommend this book to anyone seeking to learn how to use Python for data analysis. It is a valuable reference for scientists, engineers, data analysts, and others who want to leverage the power of Python (and specifically numpy and pandas) for dealing with their data.