Reflections of a Data Scientist: (R) Odds Ratio, Relative Risk and Cohen’s Kappa

In this article, we will be discussing Odds Ratio, Relative Risk, The Cohen's Kappa Statistic, and their relation to data contained within contingency tables. Additionally, I will be sharing some code which I have created within the R platform which will simplify the process of generating these figures.

Example:

We will address the typical scenario which involves smoking as it relates to cancer diagnosis.

We will create three data vectors. Each data vector will contain data collected from a single individual. The first data vector will contain an identification variable pertaining to individual. The next data vector will contain a binary variable which identifies smokers and non-smokers alike. The final variable will indicate which individuals are afflicted with cancer.

# Create Data Vectors #

id <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

smoker <- c(0,0,0,1,1,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0)

cancer <- c(0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,1,1,0,0,0)

smokerscancer <- data.frame(id, smoker, cancer)

This data frame now must be transformed into a matrix before it can be analyzed.

This can be achieved through the following code:

# Organize Data into Contingency Table #

x <- smokerscancer$smoker

y <- smokerscancer$cancer

cases <- data.frame(x,y)

cases$y <- c(ifelse(cases$y == 1, 'aTrue', 'bFalse'))

cases$x <- c(ifelse(cases$x == 1, 'aTrue', 'bFalse'))

cases <- cases[complete.cases(cases), ]

ctable <- table(cases)

# Apply Chi-Squared Test of Independence #

chisq.test(ctable)

Which produces the following output:

Pearson's Chi-squared test with Yates' continuity correction

data: ctable
X-squared = 3.2323, df = 1, p-value = 0.0722

For more information as how to interpret Chi-squared output, please consult this article.

The following code produces additional output, along with the original contingency table:

# Produce Odds Ratio / Relative Risk Outputs #

totala <- ctable[1,1] + ctable[1,2]
totalb <- ctable[2,1] + ctable[2,2]
percent1 <- (ctable[1,1] / totala) * 100
percent2 <- (ctable[1,2] / totala) * 100
percent3 <- (ctable[2,1] / totalb) * 100
percent4 <- (ctable[2,2] / totalb) * 100
totalc <- ctable[1,2] + ctable[2,2]
totald <- ctable[1,1] + ctable[2,1]
totale <- totalc + totald
rowtotala <-totald/(totalc + totald) * 100
rowtotalb <-totalc/(totalc + totald) * 100
rowtotalc <- (rowtotala + rowtotalb)

keycol <- c("Count", "(%)", "Count", "(%)", "SumCount", "(%)")
col1 <- c(ctable[1,1], percent1, ctable[2,1], percent3, totald, rowtotala)
col2 <- c(ctable[1,2], percent2, ctable[2,2], percent4, totalc, rowtotalb)
rowtotals <- c((ctable[1,1] + ctable[1,2]), (percent1 + percent2), (ctable[2,1] + ctable[2,2]), (percent3 + percent4), (totalc + totald), rowtotalc)

data <- data.frame(keycol, col1, col2, rowtotals)

vala <- percent3 / percent1
valb <- percent4 / percent2
valab <- valb/vala

num <- (ctable[1,1]) / (ctable[1,1] + ctable[1,2])
den <- (ctable[2,1]) / (ctable[2,1] + ctable[2,2])
relrisk <- num/den

ctable

data

valab # (Odds Ratio) col1 (%/%) / col2 (%/%) #

relrisk # Relative Risk = TRUE col1 (%/%) #

Which produces the following output:

y
x aTrue bFalse
aTrue 8 2
bFals 3 7

keycol col1 col2 rowtotals
1 Count 8 2 10
2 (%) 80 20 100
3 Count 3 7 10
4 (%) 30 70 100
5 SumCount 11 9 20
6 (%) 55 45 1 00

valab # (Odds Ratio) col1 (%/%) / col2 (%/%) #

[1] 9.333333

relrisk # Relative Risk = TRUE col1 (%/%) #

[1] 2.666667

This output was created to resemble the table generated by SPSS as it pertains to chi-squared contingency data.

But how are the results within this table interpreted?

If your data is structured in the manner in which the contingency table within the example is organized, we can make the following statements.

Risk ratios indicated that the risk of the outcome variable (cancer), within the category of smokers, increased by 167 %* relative to the group of non-smokers. Or, the outcome of this event was 2.667 times more likely to occur within the smoker group.

The odds ratio indicates that the odds of finding cancer within an individual who smokes, as compared to an individual who does not smoke, is 9.333.

*= Absolute value of (1 – RR), multiplied by 100.

Cohen’s Kappa

Cohen’s Kappa, like the odds ratio and the risk ratio, is utilized to analyze the relationships of data within a contingency table. However, while the previously mentioned methods are universally applicable, Cohen’s Kappa can only be applied to contingency tables which measure categorical ratings.

Example:

Two film critics are rating a series of 20 movies. Each critic can either vote “Yes”, as an indication that he enjoyed the film, or “No”, as an indication that he did not enjoy the film. The results were recorded within a binary format.

# Create Vectors #

movieid <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

criticoneratings <- c(0,0,0,1,1,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0)

critictworatings <- c(0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,1,1,0,0,0)

movieratings <- data.frame(movieid, criticoneratings, critictworatings)

With this information, we wish to measure the level of agreement between the two raters. One of the main ways in which this technique differs from traditional probabilistic methodologies, is that this particular technique accounts for chance agreement.

Now that we have our data values correctly input, we can proceed with the analysis.

# Organize Data into Contingency Table #

x <- movieratings$criticoneratings

y <- movieratings$critictworatings

cases <- data.frame(x,y)

cases$y <- c(ifelse(cases$y == 1, 'aTrue', 'bFalse'))

cases$x <- c(ifelse(cases$x == 1, 'aTrue', 'bFalse'))

cases <- cases[complete.cases(cases), ]

ctable <- table(cases)

# With the “psych” package downloaded and enabled #

cohen.kappa(ctable,alpha=.05)

This produces the output:

Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)

Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
lower estimate upper
unweighted kappa 0.12 0.5 0.88
weighted kappa 0.12 0.5 0.88

Number of subjects = 20

I have bolded the kappa statistic (.5) contained within the output.

Opinion as to how to interpret the kappa statistic varies. Written below are the recorded opinions of professional analysts pertaining to such:

“Landis and Koch, characterized values < 0 as indicating no agreement, and 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement.

Fleiss's equally arbitrary guidelines characterize kappa over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor. “ **

**- https://en.wikipedia.org/wiki/Cohen’s_kappa

If we were to apply this model in more gregarious manner, for the sake of spectating at the aspects of its inner workings, the following code could be applied:

# Observed Proportionate Agreement #

# Define Cell Entries #

a <- ctable[1,1]
b <- ctable[1,2]

c <- ctable[2,1]
d <- ctable[2,2]

# Observed Proportionate Agreement (p0) #

p0 <- (a + d) / (a + b + c + d)

# Expected probability that both would say yes at random (pYes) #

pyes <- ((a + b) / (a + b + c + d)) * ((a + c) / (a + b + c + d))

# Expected probability that both would say no at random (pNo) #

pno <- ((c + d) / (a + b + c + d)) * ((b + d) / (a + b + c + d))

# Overall random agreement probability is the probability that they #
# agreed on either Yes or No (pe) #

pe <- pyes + pno

# Cohen's Kappa #

K <- (p0 - pe) / (1 - pe)

That’s all for now, Data Heads! Stay tuned for more awesome articles!

Reflections of a Data Scientist

Saturday, May 19, 2018

(R) Odds Ratio, Relative Risk and Cohen’s Kappa

No comments:

Post a Comment