Using R to automate ROC analysis

ROC analysis is used in many types of research.  I use it to examine the ability of molecular docking to enrich a list of poses for experimental hits.  This is a pretty standard way to compare the effectiveness of docking methodologies and make adjustments in computational parameters.

An example ROC plot on a randomly generated dataset
An example ROC plot on randomized data

Normally this kind of plot would take at least an hour to make by hand in Excel, so I wrote a function in R that generates a publication-quality ROC plot on the fly.  This is handy if you want to play around with the hit threshold of the data (i.e., the binding affinity) or experiment with different scoring functions.

According to wikipedia:

a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the total actual positives (TPR = true positive rate) vs. the fraction of false positives out of the total actual negatives (FPR = false positive rate), at various threshold settings.

There are already several ROC plot calculators on the web.  But I wanted to write my own using the R statistical language owing to its ability to produce very high-quality, clean graphics.  You can find the code here:

https://github.com/mchimenti/data-science-coursera/blob/master/roc_plot_gen.R

The function takes a simple 2 column input in csv format.   One column is “score,” the other is “hit” (1 or 0).   In the context of docking analysis, “score” is the docking score and hit is whether or not the molecule was an experimental binder.   The area-under-curve is calculated using the “trapz” function from the “pracma” (practical mathematics) package.