Reflections of a Data Scientist: October 2018

Monday, October 22, 2018

(R) Making Predictions with predict()

In today’s article we will be discussing a function which has been utilized within previous entries, but within the context of this website, has never been fully covered in-depth. The function which I am referring to is: “predict()”.

What predict() achieves is rather simple, in that, it provides an applied output as it pertains to applying a linear model to observational data.

Let’s delve right in with a few examples of this application.

Linear Regression

# Model Creation #

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)

y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

linregress <- (lm(y ~ x))

# Build Predictive Structure #

predictdataframe <- data.frame(x)

# Print Predicted Values to Console #

predict(linregress, predictdataframe)

# Organize predicted values in a variable column adjacent to observed values #

# Store Predicted Values in Variable #

predictedvalues <- predict(linregress, predictdataframe)

# Add Variables to Data Frame #

predictdataframe$y <- y

predictdataframe$predictedvalues <- predictedvalues

# View Results #

predictdataframe

# Console Output #

x y predictedvalues
1 27 70 75.60686
2 34 80 81.56332
3 22 73 71.35224
4 30 77 78.15963
5 17 60 67.09763
6 32 93 79.86148
7 25 85 73.90501
8 34 72 81.56332
9 46 90 91.77441
10 37 85 84.11609

Loglinear Analysis

# Model Creation #

Obese <- c("Yes", "Yes", "No", "No")

Smoking <- c("Yes", "No", "Yes", "No")

Count <- c(5, 1, 2, 2)

DataModel <- glm(Count ~ Obese + Smoking , family = poisson)

# Build Predictive Structure #

predictdataframe <- data.frame(Obese, Smoking)

# Print Predicted Values to Console #

exp(predict(DataModel, predictdataframe))

# Organize predicted values in a variable column adjacent to observed values #

# Store Predicted Values in Variable #

predictedvalues <- predict(DataModel, predictdataframe)

# Add Variables to Data Frame #

predictdataframe$Obese <- Obese

predictdataframe$Smoking <- Smoking

predictdataframe$Count <- Count

predictdataframe$predictedvalues <- exp(predictedvalues)

# View Results #

predictdataframe

# Console Output #

Obese Smoking Count predictedvalues
1 Yes Yes 5 4.2
2 Yes No 1 1.8
3 No Yes 2 2.8
4 No No 2 1.2

Probit Regression

# Create data vectors #

age <- c(55.00, 45.00, 33.00, 22.00, 34.00, 56.00, 78.00, 47.00, 38.00, 68.00, 49.00, 34.00, 28.00, 61.00, 26.00)

obese <- c(1.00, .00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, 1.00, .00, 1.00, .00)

smoking <- c(1.00, .00, .00, 1.00, 1.00, 1.00, .00, .00, 1.00, .00, .00, 1.00, .00, 1.00, 1.00)

cancer <- c(1.00, .00, .00, 1.00, .00, 1.00, .00, .00, 1.00, 1.00, .00, 1.00, 1.00, 1.00, .00)

# Combine data vectors into a single data frame #

cancerdata <- data.frame(cancer, smoking, obese, age)

# Create Probit Model #

probitmodel <- glm(cancer ~ smoking + obese + age, family=binomial(link= "probit"), data=cancerdata)

# Build Predictive Structure #

predictdataframe <- data.frame(smoking, obese, age)

# Print Predicted Values to Console #

plogis(predict(probitmodel, predictdataframe ))

# Organize predicted values in a variable column adjacent to observed values #

# Store Predicted Values in Variable #

predictedvalues <- predict(probitmodel, predictdataframe )

# Add Variables to Data Frame #

predictdataframe$smoking <- smoking

predictdataframe$obese <- obese

predictdataframe$age <- age

predictdataframe$cancer <- cancer

predictdataframe$predictedvalues <- plogis(predictedvalues)

# View Results #

predictdataframe

# Console Output #

smoking obese age cancer predictedvalues
1 1 1 55 1 0.7098209
2 0 0 45 0 0.3552599
3 0 0 33 0 0.3076726
4 1 0 22 1 0.6338307
5 1 1 34 0 0.6267316
6 1 1 56 1 0.7134978
7 0 0 78 0 0.4988303
8 0 1 47 0 0.3088181
9 1 1 38 1 0.6433412
10 0 0 68 1 0.4541625
11 0 1 49 0 0.3165195
12 1 1 34 1 0.6267316
13 0 0 28 1 0.2889239
14 1 1 61 1 0.7314569
15 1 0 26 0 0.6503007

Logistic Regression Analysis (Non-Binary Categorical Variables)

# Non-Binary Categorical Variables #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

White <- c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0)

African_American <- c(0,0,0,1,1,1,0,0,0,0,0,0,0,0,0)

Asian <- c(0,0,0,0,0,0,1,1,1,0,0,0,0,0,0)

Indian <- c(0,0,0,0,0,0,0,0,0,1,1,1,0,0,0)

Native_American <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,1,1)

CancerModelLogII <- glm(Cancer~ Age + Obese + Smoking + White + African_American + Asian + Indian + Native_American, family=binomial)

# Build Predictive Structure #

predictdataframe <- data.frame(Age, Obese, Smoking, White, African_American, Asian, Indian, Native_American)

# Print Predicted Values to Console #

plogis(predict(CancerModelLogII, predictdataframe ))

# Organize predicted values in a variable column adjacent to observed values #

# Store Predicted Values in Variable #

predictedvalues <- predict(CancerModelLogII, predictdataframe )

# Add Variables to Data Frame #

predictdataframe$Age <- Age

predictdataframe$Obese <- Obese

predictdataframe$Smoking <- Smoking

predictdataframe$White <- White

predictdataframe$African_American <- African_American

predictdataframe$Asian <- Asian

predictdataframe$Indian <- Indian

predictdataframe$Native_American <- Native_American

predictdataframe$Cancer <- Cancer

predictdataframe$predictedvalues <- plogis(predictedvalues)

# View Results #

predictdataframe

# Console Output #

Age Obese Smoking White African_American Asian Indian Native_American Cancer

1 55 1 1 1 0 0 0 0 1

2 45 0 0 1 0 0 0 0 0

3 33 0 0 1 0 0 0 0 0

4 22 0 1 0 1 0 0 0 1

5 34 1 1 0 1 0 0 0 0

6 56 1 1 0 1 0 0 0 1

7 78 0 0 0 0 1 0 0 0

8 47 1 0 0 0 1 0 0 0

9 38 1 1 0 0 1 0 0 1

10 68 0 0 0 0 0 1 0 1

11 49 1 0 0 0 0 1 0 0

12 34 1 1 0 0 0 1 0 1

13 28 0 0 0 0 0 0 1 1

14 61 1 1 0 0 0 0 1 1

15 26 0 1 0 0 0 0 1 0

predictedvalues

1 0.74330743

2 0.15053796

3 0.10615461

4 0.64063327

5 0.60103365

6 0.75833308

7 0.32059004

8 0.08677812

9 0.59263184

10 0.69613463

11 0.40773029

12 0.89613509

13 0.23207436

14 0.91405050

15 0.85387513

Logistic Regression Analysis

# Model Creation #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

# Build Predictive Structure #

predictdataframe <- data.frame(Age, Obese, Smoking, Cancer)

# Print Predicted Values to Console #

plogis(predict(CancerModelLog, predictdataframe ))

# Organize predicted values in a variable column adjacent to observed values #

# Store Predicted Values in Variable #

predictedvalues <- predict(CancerModelLog, predictdataframe )

# Add Variables to Data Frame #

predictdataframe$Age <- Age

predictdataframe$Obese <- Obese

predictdataframe$Smoking <- Smoking

predictdataframe$Cancer <- Cancer

predictdataframe$predictedvalues <- plogis(predictedvalues)

# View Results #

predictdataframe

# Console Output #

Age Obese Smoking Cancer predictedvalues
1 55 1 1 1 0.8102649
2 45 0 0 0 0.2686795
3 33 0 0 0 0.2043280
4 22 0 1 1 0.7018502
5 34 1 1 0 0.6952985
6 56 1 1 1 0.8148105
7 78 0 0 0 0.4958797
8 47 1 0 0 0.2090126
9 38 1 1 1 0.7199845
10 68 0 0 1 0.4219139
11 49 1 0 0 0.2190519
12 34 1 1 1 0.6952985
13 28 0 0 1 0.1811344
14 61 1 1 1 0.8362786
15 26 0 1 0 0.7262143

Function Functionality

For all of the time saving capability that the predict() function provides, its internal structure is rather simple. All that is necessary is that the function be called, along with the required model, and the independent variable data which will be utilized to provide predictions.

This concept illustrated, would resemble the following:

predict(linearmodel, new_data_frame_containing_idependent_variables)

For more information pertaining to this function and its customizable options, please consult the like below:

https://www.rdocumentation.org/packages/raster/versions/2.7-15/topics/predict

That’s all for now. Stay ambitious, Data Heads!

(R) Finding the Best Linear Model w/stepAIC()

In today’s article, we will continue to address reader inquiries. Recently, I was contacted by an analyst who shared a concern pertaining to linear modeling, specifically, what is the most optimal manner in which a user may create an efficient linear model under the circumstances in which a data frame contains numerous independent variables? The trial-and-error technique isn’t a terrible option absent an abundant number of independent variables. However, when encountering a data frame which contains hundreds of independent variables, a more efficient method is necessary.

Thankfully, for the R user, a tenable solution exists.

Utilizing the “MASS” Package to find the Best Linear Model

As the title suggests, this technique requires that the “MASS” package be downloaded and enabled.

For this example, we will be utilizing a rather lengthy data frame. The sample data frame: “BiTestData.csv”, can be found amongst other files within the site’s corresponding GitHub.

Once the .CSV file has been downloaded, it can be loaded into the R platform through the utilization of the following code:

DataFrameA <- read.table("C:\\Users\\UserName\\Desktop\\BiTestData.csv", fill = TRUE, header = TRUE, sep = "," )

The pathway must be altered to reflect the file destination within your working environment.

To demonstrate the capability of the “MASS” package, we will first create a logistic regression model within R through the utilization of the glm() function.

bimodel <- glm(Outcome ~., family=binomial, data=DataFrameA)

summary(bimodel)

# Console Output: #

Call:
glm(formula = Outcome ~ ., family = binomial, data = DataFrameA)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.35061 -0.00005 -0.00005 -0.00004 1.77333

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.032e+01 1.980e+03 -0.010 0.992
VarA -6.206e-02 1.269e+04 0.000 1.000
VarB 2.036e+01 1.254e+04 0.002 0.999
VarC -4.461e-01 5.376e-01 -0.830 0.407
VarD -5.893e-01 5.699e-01 -1.034 0.301
VarE 4.928e-01 9.435e-01 0.522 0.601
VarF -2.334e-02 5.032e-02 -0.464 0.643

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom
Residual deviance: 84.197 on 138 degrees of freedom
AIC: 98.197

Number of Fisher Scoring iterations: 19

We will now measure the model’s predictive capacity through the application of the Nagelkerke R-Squared methodology.

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

McFadden Adj.McFadden Cox.Snell

0.40831495 0.29587694 0.33015814

Nagelkerke McKelvey.Zavoina Effron

0.52807741 0.96866777 0.33839985

Count Adj.Count AIC

0.81379310 0.03571429 98.19715620

Corrected.AIC

99.01467445

Notice that the Nagelkerke R-Squared value is .528, which by most standards, indicates that the model possesses a fairly decent predictive capacity. In prior articles related to Logistic Regression Analysis, we discussed how this statistic is utilized in lieu of the traditional R-Squared figure to measure the strength of predictability in logistic regression models. However, another statistic which is illustrated within this output, the AIC, or Akaike Information Criterion, was not specifically mentioned.

AIC differs from both the Nagelkerke R-Squared value and the traditional R-Squared statistic, in that, it does not measure how well the current model explains the observed data, but instead, seeks to estimate model accuracy as it is applied to new observational data. R-Squared measures training error, while AIC acts as an estimate of the test error, thus, accounting for bias and variance.

As was mentioned in the prior article pertaining to Logistic Regression, when measuring the strength of model predictability, the Nagelkerke R-Squared value is the most easily interpretable.

The reason which necessitates the discussion of the Akaike Information Criterion is its utilization as the mechanism for which model optimization is determined by the stepAIC function. As it concerns interpretability, the smaller the AIC value, the better the model is assumed to perform when applied to new observational sets.

Let us now apply the stepAIC() function to our linear model and observe the results.

# With the “MASS” package downloaded and enabled #

stepAIC(bimodel)

This produces the output:

# Console Output #

Start: AIC=98.2

Outcome ~ VarA + VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarA 1 84.197 96.197

- VarF 1 84.414 96.414

- VarE 1 84.479 96.479

- VarC 1 84.891 96.891

- VarD 1 85.290 97.290

- VarB 1 86.022 98.022

<none> 84.197 98.197

Step: AIC=96.2

Outcome ~ VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarF 1 84.414 94.414

- VarE 1 84.479 94.479

- VarC 1 84.891 94.891

- VarD 1 85.290 95.290

<none> 84.197 96.197

- VarB 1 96.542 106.542

Step: AIC=94.41

Outcome ~ VarB + VarC + VarD + VarE

Df Deviance AIC

- VarE 1 84.677 92.677

- VarC 1 84.999 92.999

- VarD 1 85.586 93.586

<none> 84.414 94.414

- VarB 1 96.757 104.757

Step: AIC=92.68

Outcome ~ VarB + VarC + VarD

Df Deviance AIC

- VarC 1 85.485 91.485

- VarD 1 85.742 91.742

<none> 84.677 92.677

- VarB 1 132.815 138.815

Step: AIC=91.49

Outcome ~ VarB + VarD

Df Deviance AIC

- VarD 1 86.557 90.557

<none> 85.485 91.485

- VarB 1 139.073 143.073

Step: AIC=90.56

Outcome ~ VarB

Df Deviance AIC

<none> 86.557 90.557

- VarB 1 142.301 144.301

Call: glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Coefficients:

(Intercept) VarB

-20.57 20.34

Degrees of Freedom: 144 Total (i.e. Null); 143 Residual

Null Deviance: 142.3

Residual Deviance: 86.56 AIC: 90.56

As illustrated, the ideal model that the stepAIC() function suggests is:

bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)

summary(bimodal)

# Console Output #

Call:
glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.08424 -0.00005 -0.00005 -0.00005 1.27352

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.57 1957.99 -0.011 0.992
VarB 20.34 1957.99 0.010 0.992

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom
Residual deviance: 86.557 on 143 degrees of freedom
AIC: 90.557

Number of Fisher Scoring iterations: 19
Now let’s measure the model’s predictive capacity.

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron Count Adj.Count

0.3917303 0.3495661 0.3191667 0.5104969 0.9686596 0.3114910 NA NA

AIC Corrected.AIC

90.5571588 90.6416659

As you can observe from the information presented above, the Nagelkerke (0.51) value has been lowered slightly. However, the AIC (90.56) value has fallen by a much more substantial amount. This should be viewed as a positive occurrence. The lower the AIC value, the more the model is able to appropriately account for new observational data. The slight decline in the Nagelkerke value is significantly offset by the large AIC value decline, therefore, we can conclude that given the dependent variables present within the data set, that the model below contains the optimal structuring format:

bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)

For more information pertaining to The Akaike Information Criterion (AIC):

https://en.wikipedia.org/wiki/Akaike_information_criterion

For more information pertaining to the Akaike Information Criterion and the R-Squared statistic as quantifiable measurements:

https://stats.stackexchange.com/questions/140965/when-aic-and-adjusted-r2-lead-to-different-conclusions

That’s all for now, Data Heads! Stay subscribed for more substantive concepts.

Saturday, October 20, 2018

Analyzing Chi-Square Output (SPSS)

In prior articles, we discussed how to generate output within the SPSS platform as it pertains to the chi-squared methodology. The purpose of this entry, is to answer inquires which I have received related to the chi-squared analysis. Specifically, how to properly assess Risk Estimate and Cross-Tabulation tables.

To aid in the assessment of these output types, I have created the following charts and corresponding keys. Though the data which was utilized to create these tables is fictional, the fundamental aspects of the charts remain un-impacted.

Cross-Tabulation Chart and Key

a = Individuals who smoked and received a cancer diagnosis.

b = Individuals who smoked and did not receive a cancer diagnosis.

c = Total number of individuals who were smokers.

d = Percentage of individual who were smokers and received a cancer diagnosis.

e = Percentage of individuals who were smokers and did not received a cancer diagnosis.

f = Total percentage of individual smokers.

g = Individuals who did not smoke and received a cancer diagnosis.

h = Individuals who did not smoke and did not receive a cancer diagnosis.

I = Total number of individuals who were not smokers.

j = Percentage of individuals who were not smokers and received a cancer diagnosis.

k = Percentage of individuals who were not smokers and did not receive a cancer diagnosis.

l = Total percentage of individual non-smokers.

m = Total number of individuals diagnosed with cancer.

n = Total number of individuals not diagnosed with cancer.

o = Total number of individuals surveyed.

p = Percentage of total surveyed individuals who were diagnosed with cancer.

q = Percentage of total surveyed individuals who were not diagnosed with cancer.

r = Total percentage of surveyed individuals.

Risk Estimate Chart and Key

a = The odds ratio indicates that the odds of finding cancer within an individual who smokes, as compared to an individual who does not smoke, is 9.333.

b = The outcome of this event (Cancer Diagnosis) was 2.667 times more likely to occur within the smoker group.

c = The number of total individuals surveyed.

Calculating Relative Outcome: | 1 – risk estimate value | * 100

| 1 – 2.667 | * 100 = 167

Risk ratios indicated that the risk of the outcome variable (cancer), within the category of smokers, increased by 167 % relative to the group of non-smokers.

(R) Importing Strange Data Formats

Today’s entry will discuss additional aspects pertaining to the R data importing process.

Importing an Excel File into the R Platform

To import a Microsoft excel file into the R platform, you must first download the R package: “readxl”. It is important to note, prior to proceeding, that files read into R from excel still maintain the escape characters that were present within the original format (\r, \t, etc.)

# With the package: ‘readxl’ downloaded and enabled #

# Import a single workbook sheet #

ExcelFile <- read_excel("A:\\FilePath\\File.xlsx")

# Import a single workbook sheet by specifying a specific sheet (3) for import #

ExcelFile <- read_excel("A:\\FilePath\\File.xlsx", sheet = 3)

Export a Data Frame with a Specified Delineator

There may be instances in which another party may request that you utilize a specific character to act as a data delineator. In the case of our example, we will be utilizing the “|” (pipe-character) to demonstrate functionality.

# Export Pipe-Delineated Data #

write.table(PipeExport, file = "A\\FilePath\\PipeExport.txt", sep = "|", col.names = NA, row.names = TRUE)

Import a Data Frame with a Specified Delineator

There will also be instances, in which another party may provide data which utilizes a specific character to act as a data delineator. Again for our example, we will be utilizing the “|” (pipe-character) to demonstrate functionality.

PipeImport <- read.delim("A\\FilePath\\PipeImport.txt", fill = TRUE, header = TRUE, sep = "|")

That’s all for now. Stay subscribed, Data Heads!

Wednesday, October 17, 2018

Syntax – Pt. (II) (SPSS)

In a previous article, we discussed how to create SPSS syntax. Since that article appeared on this website, I have received numerous inquiries, both online and off, pertaining to syntax functionality. In this article, I hope to further demonstrate additional aspects of SPSS syntax which will increase proficiency within the subject.

Creating a New Variable to act as a Variable Flag

If an SPSS data frame contained six variables, and from such, you wished to create an additional variable to act as a flag to identify the instances in which one of the six variables was equal to the number: “1”, the line of code to establish this task would resemble the following:

if(Var1 = 1 OR Var2 = 1 OR Var3 = 1 OR Var4 = 1 OR Var5 = 1 OR Var6 = 1) VarFlag = 1.
exe.

In the case of our example above, a new variable, “VarFlag”, would be created and populated with the value of “1” whenever any of the aforementioned values equaled “1”.

If instead, you wished to only identify instances in which “Var1” and “Var2” are equal to “1”, the code below could be utilized:

if (
Var1 = 1 AND Var2 = 1
) VarFlag = 1.
EXECUTE.

If you wished to flag for instances in which “Var1” contained a missing value, the following code be utilized to achieve such:

if ( MISSING(Var1) ) VarFlag = 1.
exe.

Finally, if you desired to only identify instances in which a text field contains a value, the following code could be utilized:

if (Var1 NE "") VarFlag = 1.
exe.

That’s all for now, Data Heads. I heavily encourage you to research this topic to better build your arsenal of analytic tools.