The ROC curve was first utilized during World War II for the analysis of radar signals.* Today, the ROC curve is utilized to illustrate, through the use of a diagram, the measurement of positive predictive results against false positive results. If this sounds complicated, don’t be nervous, the concept of the ROC curve is rather synthetic, meaning, that it is not at inherently intuitive. Therefore, I will do my best to explain its interpretation throughout this entry.

This article utilizes the prior the data set example and output.

**“Analyze”**from the drop down menu above the data sheet, then select

**“ROC Curve”**.

**“Test Variable”**and

**“State Variable”**within the menu interface. In our case, the Test Variable will be:

**“Predicted Probability”**, and the State Variable will be:

**“Cancer”**. You must specify the positive value of the

**“State Variable”**, meaning, that you must specify the value of the

**“State Variable”**which identifies a positive result. This value will typically be “1”, which I have entered into the interface example below. For

**“Display”**options, I have selected:

**“ROC Curve”**,

**“With diagonal reference line”**,

**“Coordinate points of the ROC Curve”**.

This produces the output:

**– This output demonstrates the number of actual positive and negative results contained within the original data set.**

__“Case Processing Summary”__**- What you are witnessing in the diagram above, is sensitivity vs. specificity as it pertains to points within the**

__“ROC Curve”__**“Coordiantes of the Curve”**. Y = Sensitivity and

X = 1 – Specificity. I am aware that the graphical co-ordinates are plotted backwards on the graphic, however, I wanted to maintain the order in which they were presented within the table. What the diagram seeks to illustrate, is the predictive capacity of our model at various percentage confidence thresholds. Ideally, we would like to see a ROC curve illustration which contains single point in the upper leftmost corner of the graph. The green line illustrates a perfect binomial outcome (50/50 chance). If our blue line passes below the green line, the point in which this occurs, indicates a model cutoff which would provide probability results worse with than random chance.

What is occurring in this table, is the measurement of positive results and false positive results at each percentage output provided by the logistic regression model.

So, to break this down in simpler terms, assuming that we utilized the value on the right:

As an example: While identifying true outcomes based on a model specificity of .806, we will correctly identify 25% of positive outcomes. Therefore, 75% of the total positive outcomes will be overlooked. Additionally, 14.3% outcomes will be identified as positive while they are actually negative. That is to say, that this would be the result if we applied our model to our sample data and considered any result with a logit(p) <= .806 as being negative (0), and any result with a logit(p) >= as being positive (1).

With this in mind, it would not be an exaggeration to refer to

For this reason, you can probably assume why you would prefer a higher AUC value, as opposed to the alternative. AUC values cannot, for obvious reasons, exceed the value of 1.

__The ROC Curve graphic is providing an illustration of the__

**"Coordinates of the Curve"**.What is occurring in this table, is the measurement of positive results and false positive results at each percentage output provided by the logistic regression model.

So, to break this down in simpler terms, assuming that we utilized the value on the right:

**“Positive if Greater Than or Equal to”**,

**as a cutoff in which to deem all cases of that value or higher as positive results (1),**

**“Sensitivity”**would then indicate the number of cases which were actually positive and identified as such (when 1 is predicted and the result is actually 1),

**“1-Specificity”**would indicate the number of cases which were identified as positive by the model, but were actually negative (when 1 is predicted by the result is actually 0).

As an example: While identifying true outcomes based on a model specificity of .806, we will correctly identify 25% of positive outcomes. Therefore, 75% of the total positive outcomes will be overlooked. Additionally, 14.3% outcomes will be identified as positive while they are actually negative. That is to say, that this would be the result if we applied our model to our sample data and considered any result with a logit(p) <= .806 as being negative (0), and any result with a logit(p) >= as being positive (1).

With this in mind, it would not be an exaggeration to refer to

**“Sensitivity”**as the number of positive cases identified by the model (as a percentage). Additionally, we could refer to

**“1-Specificity”**as the number of false positive cases identified by the model (as a percentage).

**– Commonly abbreviated as AUC, this value is representative of exactly what the name indicates. AUC, if we were shading in our graphic, would resemble the following:**

__“Area Under the Curve”__For this reason, you can probably assume why you would prefer a higher AUC value, as opposed to the alternative. AUC values cannot, for obvious reasons, exceed the value of 1.

I hope that this entry de-mystified the concept of the ROC curve, and hopefully, has provided you with the confidence and information needed to implement its usage into your own work. Until next time, stay tuned, Data Heads!

** https://en.wikipedia.org/wiki/Receiver_operating_characteristic*

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.