What is tidy data?

tidy data
A warehouse of tidy data (in paper form).

What is “tidy” data?

What is meant by the term “tidy” data, as opposed to “messy” data?  In my last post I listed five of the most common problems encountered with messy datasets.  Logically, “tidy” data must not have any of these problems.  So just what does tidy data look like?

Let’s take a look at an example of tidy data.  Below are the first 20 lines from R’s built-in “airquality” dataset:

Fig 1.  Air quality dataset is messy data.
Figure 1. The “airquality” dataset.

According to R programmer and professor of statistics Hadley Wickham, tidy data can be defined as the following:

1)  Each variable forms a column

2) Each observation forms a row

3) Each type of observational unit forms a table

That’s it.  “Airquality” is tidy because each row corresponds to one month/day combination and the four measured weather variables (ozone, solar, wind, and temp) on that day.

What about messy data?

Let’s see an example of a messy weather dataset for a counterexample (data examples are from this paper by H. Wickham):

Figure 2.  A messy weather dataset.  Not all columns are shown for the sake of clarity.
Figure 2. A messy weather station dataset.  Not all columns are shown for the sake of clarity.

There are multiple “messy” data problems with this table.  First, identifying variables like day of the month are stored in column headers (“d1”, “d2”, etc…), not in rows.  Second, there are a lot of missing values, complicating analysis and making it harder to read the table.  Third, the column “element” consists of variable names (“tmin” and “tmax”) violating rule 1 of tidy data.

How to use R tools to transform this table into tidy form is beyond the scope of this post, so I will just show the tidy version of this dataset in Figure 3.

Screen shot 2014-08-01 at 1.55.23 PM
Figure 3. The weather station data in tidy form.

Each column now forms a unique variable.  The date information has been condensed into a more compact form and each row contains the measurements for only one day.  The two variables in the “element” column are now forming their own columns, “tmax” and “tmin.”  With the data in this form it is far easier to prepare plots, aggregate the data, and perform statistical analysis.