Reflections of a Data Scientist: (R) General Linear Mixed Models

A General Linear Mixed Model is a methodology of modeling utilized to create linear models which account for categorical variables. This is achieved through the synthesis of a model which contains dynamic intercept values as an aspect of its design. The categorical variables which are included as elements of the model design can be singular or numerous. However, if multiple variable of this nature are included, they will be assessed in categorical combinations.

Example:

We will be utilizing the following data vectors:

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)
v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)
x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)
z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

These vectors can be combined to create a single data frame:

model <- data.frame(v, w, x, y, z)

If we were to create a simple linear model, the necessary code could be applied:

lmmodeltest <- lm(x ~ y + z, data=model)

summary(lmmodeltest)

Which produces the output:

Call:
lm(formula = x ~ y + z, data = model)

Residuals:
Min 1Q Median 3Q Max
-4.027 -2.882 -1.643 2.668 5.623

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.587008 53.572275 0.048 0.963
y 0.034052 0.099425 0.342 0.742
z -0.002543 0.285784 -0.009 0.993

Residual standard error: 4.267 on 7 degrees of freedom
Multiple R-squared: 0.01653, Adjusted R-squared: -0.2645
F-statistic: 0.05883 on 2 and 7 DF, p-value: 0.9433

Now, we will produce a general linear mixed model which accounts for the variable “w”.

(This variable, the variable which is accounted for, but is not as an independent variable within the model, is referred to as a “random variable”.)

# You will need to download and enable the package: “nlme” #

mmodeltest1 <- lme(x ~ y + z, data=model, random=~1|w)

summary(mmodeltest1)

This produces the output:

Linear mixed-effects model fit by REML
Data: model
AIC BIC logLik
65.41286 65.14241 -27.70643

Random effects:
Formula: ~1 | w

(Intercept) Residual

StdDev: 3.995423 1.498283

Fixed effects: x ~ y + z
Value Std.Error DF t-value p-value
(Intercept) 2.5870077 53.57227 7 0.0482900 0.9628
y 0.0340519 0.09942 7 0.3424887 0.7420
z -0.0025432 0.28578 7 -0.0088991 0.9931
Correlation:
(Intr) y
y -0.066
z -0.989 -0.081

Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-0.3313798 -0.2371629 -0.1352219 0.2195812 0.4627224

Number of Observations: 10
Number of Groups: 10

Generally speaking, the models look identical. What differs between the two models are the coefficient values. It is these values that determine the value of the intercept.

For example, the values for the intercept of the original linear model are:

coef(lmmodeltest)

Output:

(Intercept) y z
2.587007686 0.034051914 -0.002543224

However, the values for the general linear mixed model are:

coef(mmodeltest1)

Output:

(Intercept) y z
311 4.95003237 0.03405191 -0.002543224
331 7.48857278 0.03405191 -0.002543224
332 1.33355478 0.03405191 -0.002543224
351 -0.66296480 0.03405191 -0.002543224
352 7.51706478 0.03405191 -0.002543224
356 0.95902860 0.03405191 -0.002543224
357 0.43833139 0.03405191 -0.002543224
366 4.85601183 0.03405191 -0.002543224
383 -0.06589015 0.03405191 -0.002543224
399 -0.94366472 0.03405191 -0.002543224

As you can observe, the intercept coefficient values, which are sorted categorically, differ. However, the coefficient values of the dependent variables remain static. This phenomenon allows for dynamic model creation.

For example, if we were to test the simple linear model with prior data set variables, the outcome for the first observation would be:

yvalue <- 97

zvalue <- 188

interceptvalue <- 2.5870077

xvalue <- (yvalue * 0.034052) + (zvalue * -0.002543) + interceptvalue

xvalue

Which would produce the output:

[1] 5.41192

The actual value of x in the initial set was:

8

If we were to test the generalized mixed linear model with prior data set variables, the outcome for the first observation would be:

yvalue <- 97

zvalue <- 188

interceptvalue <- 4.85601183

xvalue <- (yvalue * .0340519) + (zvalue * -0.0025432) + interceptvalue

xvalue

Which would produce the output:

[1] 7.680925

Which is closer to the actual value of x:

8

To produce a general linear mixed model which accounts for both the variables “w” and “v”, the following code can be utilized:

mmodeltest2 <- lme(x ~ y + z, data=model, random=~1|w/v)

summary(mmodeltest2)

This code produces the output:

Linear mixed-effects model fit by REML
Data: model
AIC BIC logLik

67.41286 67.08832 -27.70643

Random effects:
Formula: ~1 | w
(Intercept)
StdDev: 2.916513

Formula: ~1 | v %in% w
(Intercept) Residual
StdDev: 2.916513 1.093692

Fixed effects: x ~ y + z
Value Std.Error DF t-value p-value
(Intercept) 2.5870077 53.57227 7 0.0482900 0.9628
y 0.0340519 0.09942 7 0.3424887 0.7420
z -0.0025432 0.28578 7 -0.0088991 0.9931

Correlation:
(Intr) y
y -0.066
z -0.989 -0.081

Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-0.24189513 -0.17312025 -0.09870703 0.16028629 0.33777045

Number of Observations: 10
Number of Groups:
w v %in% w
10 10

With the model output produced, we must again generate coefficient values. This can be achieved through the utilization of the following code:

coef(mmodeltest2)

Which produces the output:

(Intercept) y z
311/19 5.1052676 0.03405191 -0.002543224
331/13 7.8105734 0.03405191 -0.002543224
332/6 1.2512112 0.03405191 -0.002543224
351/17 -0.8764666 0.03405191 -0.002543224
352/12 7.8409371 0.03405191 -0.002543224
356/11 0.8520811 0.03405191 -0.002543224
357/12 0.2971775 0.03405191 -0.002543224
366/6 5.0050705 0.03405191 -0.002543224
383/10 -0.2401681 0.03405191 -0.002543224
399/18 -1.1756067 0.03405191 -0.002543224

Each combination of categorical variables, along with their coinciding intercept values, had been output in the above data stream. If the model were to be utilized to assess new observational values generated from the same system of observation, cross tabulated categorical values, would determine the value of the utilized intercept.

The type of categorical data which would satisfy the “random” element of model synthesis, typically would refer to variables such as: gender, race, ethnicity, zipcode, etc.

That’s all for now, Data Heads! Thanks for visiting! I appreciate your patronage.

Reflections of a Data Scientist

Monday, April 9, 2018

(R) General Linear Mixed Models

No comments:

Post a Comment