There is an interesting new post at “In the Pipeline” that summarizes the performance of Google’s “big data” project to track flu trends from search terms. In short, the predictive performance appears to be pretty bad so far, at least compared to what you might have expected given the hype around “big data.” The author raises some key points, including the importance of high-quality data, even in very large datasets. I particularly like this analogy:
“The quality of the data matters very, very, much, and quantity is no substitute. You can make a very large and complex structure out of toothpicks and scraps of wood, because those units are well-defined and solid. You cannot do the same with a pile of cotton balls and dryer lint, not even if you have an entire warehouse full of the stuff.” –In the Pipeline, March 24, 2014
Data filtering and modeling approaches will likely continue to improve, however, and I think this project is worth watching in the future.