Reflections of a Data Scientist: (R) Partial Least Squares Regression

The Partial Least Squares Regression model is a methodology which shares numerous similarities with concept of dimension reduction. Both methods utilize matrix based functionality to create output, and both seek to explain variance through the utilization of component analysis.

The output provided by the partial least squares regression methodology provides both variance estimates of components, and a working model in which to estimate dependent variable values. How this process is achieved, is illustrated in the example below.

Example:

(This example requires that the R Packages: “CCA”, and “PLS”, be downloaded and enabled.)

We will begin by defining our vector values:

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)

Next, we will combine these vectors into a single data frame.

testset <- data.frame(x, y, z, w, v)

With the packages: “CCA”, and “PLS” enabled, we can create the following model:

plstestset <- plsr(w ~ x + y + z + v, data = testset, ncomp=4, validation="CV")

ncomp – Specifies the number of components to include within the model. Typically, we should first set this value to the number of independent variables contained within the model. After producing the initial output, we can then modify our model to include a specified number of components.

validation – This indicates the type of validation method that will be utilized within our model. In the example, “cross validation” was specified.

To view the output provided, we will need to run the code below:

summary(plstestset)

Which produces the following:

Data: X dimension: 10 4
Y dimension: 10 1
Fit method: kernelpls
Number of components considered: 4

VALIDATION: RMSEP
Cross-validated using 10 leave-one-out segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 26.98 28.47 38.84 40.81 40.85
adjCV 26.98 28.38 37.67 39.66 39.71

TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps
X 76.45 82.76 91.55 100.00
w 12.29 39.46 39.69 39.72

From this output, we can determine that two components should suffice for the creation an accurate model. If we include extraneous components, we risk creating a disproportionate amount of inaccuracies for specificity’s sake.

With the decision made to create a new model containing two components, we will utilize the following code to create such:

plstestset <- plsr(w ~ x + y + z + v, data = testset, ncomp=2, validation="CV")

summary(plstestset)

This produces the output:

> summary(plstestset)
Data: X dimension: 10 4
Y dimension: 10 1
Fit method: kernelpls
Number of components considered: 2

VALIDATION: RMSEP
Cross-validated using 10 leave-one-out segments.
(Intercept) 1 comps 2 comps
CV 26.98 28.47 38.84
adjCV 26.98 28.38 37.67

TRAINING: % variance explained
1 comps 2 comps
X 76.45 82.76
w 12.29 39.46

Now that we have our model created, we can utilize it to test data which satisfies the model’s parameters. For our example, we will apply the model to the data contained within the original data set.

This can be achieved with the code below:

testset$fit <- predict(plstestset, newdata = testset)[,,2]

testset$fit – This is creating a new column within the data frame which is specifically designated to contain the predicted values of the dependent variable.

plstestset – This is the model which will be utilized to create the predicted independent variables.

newdata – This value indicates the data for which the model will be applied.

[,,2] – This code is indicating that we want to utilize two components in our predictability model.

If we instead decided that we only wanted to utilize one component, our code would resemble:

testset$fit <- predict(plstestset, newdata = testset)[,,1]

Running the following lines of code:

testset$fit <- predict(plstestset, newdata = testset)[,,2]

testset

Provides us with the following output:

x y z w v fit
1 8 97 188 366 6 339.7623
2 1 56 184 383 10 377.5278
3 4 97 181 332 6 351.2976
4 10 68 194 331 13 341.2700
5 8 94 190 311 19 331.5256
6 10 66 180 352 12 335.3762
7 3 81 192 356 11 363.3685
8 1 76 184 351 17 363.8741
9 1 86 192 399 18 363.3546
10 2 69 191 357 12 370.6433

The final column contains the model’s predicted values as it pertains to the dependent variable (“w”).

Reflections of a Data Scientist

Monday, April 16, 2018

(R) Partial Least Squares Regression

No comments:

Post a Comment