I wanted to demonstrate further how powerful and straightforward the pandas library is for data analysis. A good example comes from the book “Bioinformatics Programming using Python,” by Mitchell Model. While this is an excellent reference book on Python programming, it was written before pandas was in widespread use as a library.
In the “Extended Examples” on p. 158 of Chapter 4, the author demonstrates some code to read in a text file containing the names of enzymes, their restriction sites, and the patterns that they match. The code takes the text file, cleans it up, and makes a dictionary that is searchable by key. This is done using core python tools only and it looks like this (note: I am using Python 2.7 hence the need to import “print_function” from “__future__”):
Hold onto your seats because you can do all of that and more with just 5 lines of code using pandas (if you don’t count the imports):
The read_table function can take regex separators (in this case “any number of white spaces”) when using the “python” engine option. We skip the first 8 rows because they have no information. The header is set as the second row after the skipped rows.
I then use a boolean mask to find the places where the condition “is_null” is true looking down the “pattern” column. This is because some rows lack a “site” entry, so pandas found only two data fields when separated on whitespace and thus left the third column empty, not knowing there was missing data. Wherever the pattern column is null, I assign the missing values into the pattern column from the site column. I then replace the site column values with “NaNs”.
The first few lines of the ‘rebase’ dataframe object look like this:
Technically, what I just did in pandas is not quite the same thing as the core python version above. It is in many ways far better. First, all of the blank spaces in the second column are now “NaN” instead of blanks. This makes data analysis easier. Second, the object “rebase” is a dataframe that allows access to all of the dataframe methods. It is also indexed by row and has named columns for easier interpretation. The dataframe also automatically “pretty prints” for easy reading, whereas the table created using core python has to be formatted with additional function definitions to print to stdout or to file in a readable way.