Five Common Problems with Messy Data

6957593947_75f7aaecd0_zReal world datasets are often quite messy and not well-organized for available data analysis tools.  The data scientist’s job often begins with whipping these messy datasets into shape for analysis.

Listed below are five of the most common problems with messy datasets, according to an excellent paper on “tidy data” by Hadley Wickham:

1) Column headers are variables, not variable names

Tabular data falls into this type, where columns are variables themselves.  For example,  a table with median income by percentile in columns and US states in rows. 

2) Multiple variables are stored in one column

An example here would be storing data in columns that combine two variables, like gender and age range.  Better to make two separate columns for gender and age range.

3) Variables are stored in both rows and columns

The most complex form of messy data.   For example, a dataset in which measurements from a weather station are stored according to date and time, with the various measurment types (temp, pressure, etc…) in a column called “measurements”.  

4) Multiple types of observational units are stored in the same table

A dataset that combines multiple unrelated observations or facts into one table.   For example, a clinical trial dataset that includes both treatment outcomes and diet choices into one large table by patient and date. 

5) A single observational unit stored in multiple tables

Measurements recorded in different tables split up by person, location, or time.  For example, a separate table of an individual’s medical history for each year of their life. 

Using R to automate ROC analysis

ROC analysis is used in many types of research.  I use it to examine the ability of molecular docking to enrich a list of poses for experimental hits.  This is a pretty standard way to compare the effectiveness of docking methodologies and make adjustments in computational parameters.

An example ROC plot on a randomly generated dataset
An example ROC plot on randomized data

Normally this kind of plot would take at least an hour to make by hand in Excel, so I wrote a function in R that generates a publication-quality ROC plot on the fly.  This is handy if you want to play around with the hit threshold of the data (i.e., the binding affinity) or experiment with different scoring functions.

According to wikipedia:

a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the total actual positives (TPR = true positive rate) vs. the fraction of false positives out of the total actual negatives (FPR = false positive rate), at various threshold settings.

There are already several ROC plot calculators on the web.  But I wanted to write my own using the R statistical language owing to its ability to produce very high-quality, clean graphics.  You can find the code here:

The function takes a simple 2 column input in csv format.   One column is “score,” the other is “hit” (1 or 0).   In the context of docking analysis, “score” is the docking score and hit is whether or not the molecule was an experimental binder.   The area-under-curve is calculated using the “trapz” function from the “pracma” (practical mathematics) package.


The peril of big (flu) data

There is an interesting new post at “In the Pipeline” that summarizes the performance of Google’s “big data” project to track flu trends from search terms.  In short, the predictive performance appears to be pretty bad so far, at least compared to what you might have expected given the hype around “big data.”  The author raises some key points, including the importance of high-quality data, even in very large datasets.  I particularly like this analogy:

“The quality of the data matters very, very, much, and quantity is no substitute. You can make a very large and complex structure out of toothpicks and scraps of wood, because those units are well-defined and solid. You cannot do the same with a pile of cotton balls and dryer lint, not even if you have an entire warehouse full of the stuff.”  –In the Pipeline, March 24, 2014

Data filtering and modeling approaches will likely continue to improve, however, and I think this project is worth watching in the future.


Why you should think exponentially to grasp the future of medicine

People often assume that the world tomorrow will be pretty much like the world today.  We all have an in-built bias towards linear thinking when we ponder the future.  Although a linear bias was helpful for thousands of years of our evolution, today technology is changing at an exponential pace and in order to better anticipate future market opportunities and technology’s impact on society, it is crucial to think in terms of exponential trends.  This is a point that renowned futurist Ray Kurzweil has made in his many books and speeches for the last several decades. 

We all have an in-built bias towards linear thinking when we ponder the future.

One example of an exponential trend in biology (among many) is the cost per genome sequence (graph below).  As recently as 2001, the cost to sequence a genome was an astronomical $100M.  Between 2001 and 2007, the cost decreased exponentially (a straight line on a log plot), to the point where a genome in 2007 cost only $10M to sequence.  Around 2007, a paradigm shift in technology massively accelerated this exponential process, and the cost decreased even faster than before, hitting just $10K in 2012.


The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.

As economists are fond of saying, when the price falls, more is demanded.  As a result of this massively reduced sequencing price, many more partial and complete genomes are being sequenced than ever before.  The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.

Using R to create a dotplot with jittered x values

If you need to create a plot where you have a several groups of data that you want to distribute along the ‘y’ axis, but bin into one of several categories in x then you can do the following:

1) create a .csv file with your data in columns (you can use headers)

2) import the .csv file into R with: TEST <- read.table(“yourfile.csv”, sep=’,’, header=TRUE)

3) do the dotplot: dotplot(values ~ ind, data=stack(TEST), jitter.x=TRUE)

The important point here is the use of the “stack” function.  This converts vectors into factors; it also lets you create the type of dotplot where the data is plotted along ‘y’ while having the same ‘x’ value.