Reflections of a Data Scientist: Receiver Operating Characteristic (ROC Curve) (SPSS)

What is a Receiver Operating Characteristic, or ROC Curve, as it is more commonly referred to as, and why should you care? The information provided within this article requires a firm understanding of the prior articles pertaining to logistic regression.

The ROC curve was first utilized during World War II for the analysis of radar signals.* Today, the ROC curve is utilized to illustrate, through the use of a diagram, the measurement of positive predictive results against false positive results. If this sounds complicated, don’t be nervous, the concept of the ROC curve is rather synthetic, meaning, that it is not at inherently intuitive. Therefore, I will do my best to explain its interpretation throughout this entry.

This article utilizes the prior the data set example and output.

To create an ROC curve, select “Analyze” from the drop down menu above the data sheet, then select “ROC Curve”.

Once this has been accomplished, you must specify your “Test Variable” and “State Variable” within the menu interface. In our case, the Test Variable will be: “Predicted Probability”, and the State Variable will be: “Cancer”. You must specify the positive value of the “State Variable”, meaning, that you must specify the value of the “State Variable” which identifies a positive result. This value will typically be “1”, which I have entered into the interface example below. For “Display” options, I have selected: “ROC Curve”, “With diagonal reference line”, “Coordinate points of the ROC Curve”.

This produces the output:

“Case Processing Summary” – This output demonstrates the number of actual positive and negative results contained within the original data set.

“ROC Curve” - What you are witnessing in the diagram above, is sensitivity vs. specificity as it pertains to points within the “Coordiantes of the Curve”. Y = Sensitivity and

X = 1 – Specificity. I am aware that the graphical co-ordinates are plotted backwards on the graphic, however, I wanted to maintain the order in which they were presented within the table. What the diagram seeks to illustrate, is the predictive capacity of our model at various percentage confidence thresholds. Ideally, we would like to see a ROC curve illustration which contains single point in the upper leftmost corner of the graph. The green line illustrates a perfect binomial outcome (50/50 chance). If our blue line passes below the green line, the point in which this occurs, indicates a model cutoff which would provide probability results worse with than random chance.

The ROC Curve graphic is providing an illustration of the "Coordinates of the Curve".

What is occurring in this table, is the measurement of positive results and false positive results at each percentage output provided by the logistic regression model.

So, to break this down in simpler terms, assuming that we utilized the value on the right: “Positive if Greater Than or Equal to”, as a cutoff in which to deem all cases of that value or higher as positive results (1), “Sensitivity” would then indicate the number of cases which were actually positive and identified as such (when 1 is predicted and the result is actually 1), “1-Specificity” would indicate the number of cases which were identified as positive by the model, but were actually negative (when 1 is predicted by the result is actually 0).

As an example: While identifying true outcomes based on a model specificity of .806, we will correctly identify 25% of positive outcomes. Therefore, 75% of the total positive outcomes will be overlooked. Additionally, 14.3% outcomes will be identified as positive while they are actually negative. That is to say, that this would be the result if we applied our model to our sample data and considered any result with a logit(p) <= .806 as being negative (0), and any result with a logit(p) >= as being positive (1).

With this in mind, it would not be an exaggeration to refer to “Sensitivity” as the number of positive cases identified by the model (as a percentage). Additionally, we could refer to “1-Specificity” as the number of false positive cases identified by the model (as a percentage).

“Area Under the Curve” – Commonly abbreviated as AUC, this value is representative of exactly what the name indicates. AUC, if we were shading in our graphic, would resemble the following:

For this reason, you can probably assume why you would prefer a higher AUC value, as opposed to the alternative. AUC values cannot, for obvious reasons, exceed the value of 1.

I hope that this entry de-mystified the concept of the ROC curve, and hopefully, has provided you with the confidence and information needed to implement its usage into your own work. Until next time, stay tuned, Data Heads!

* https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Reflections of a Data Scientist

Tuesday, January 16, 2018

Receiver Operating Characteristic (ROC Curve) (SPSS)

No comments:

Post a Comment