Reflections of a Data Scientist: (R) TURF Analysis (SPSS)

Unless you possess a background in advertising, you probably have never heard of TURF Analysis. TURF is an acronym for “Total Unduplicated Reach and Frequency”. In contrast to many of the methods previously featured on this site, TURF Analysis is rather intuitive and easily derived.

What this method seeks to accomplish, is the generation of a frequency output which demonstrates the best possible scenario, in which, the greatest number of individuals are positively impacted.

The initial data input exists as a data frame which consists of response data gathered from a questionnaire. Individuals who were surveyed were asked to provide a binary response (Yes/No) as it pertains to preference related to a series of items or scenarios. For example, you could imagine the scenario in which restaurant patrons were surveyed as it pertains to menu items which they would consider ordering.

Example (R):

We will utilize the sample data frame “turf_ex_data”, which is included within the R package “turfR”.

# With the R package: “turfR” downloaded and enabled #

The data set resembles the following initially

respid – This column contains the variable id for each individual respondent.

wgt – This is the assumed weight of each respondent. We will be ignoring this data for our example scenario.

Item_1 – Item_10 – This series of variables contains the binary response data as it pertains to the individual’s preference related to the represented item.

Since weight is required for the TURF function included within the “turfR” package, we will modify the data frame so that this variable is nullified for our example purposes. This is achieved by running the following code which creates a copy of the initial data frame, and then subsequently sets the weight values of the duplicate data frame variables to “1”.

turftest <- turf_ex_data

turftest$wgt <- 1

How we decide to proceed depends on what we are seeking to achieve with our research. If we desired to discover the three most popular items listed amongst the total set of ten items, we would run the following code:

# (dataframename, totalnumberofcolumnselections, outputcombination) #

exampleoutput <- turf(turftest, 10, 3)

exampleoutput

In this case, the following output would be produced:

> exampleoutput <- turf(turftest, 10, 3)
3 of 10: 0.02860188 sec
total time elapsed: 0.02960205 sec
> exampleoutput
$`turf`
$`turf`[[1]]
combo rchX frqX 1 2 3 4 5 6 7 8 9 10
1 120 0.9944444 2.4277778 0 0 0 0 0 0 0 1 1 1
2 119 0.9944444 2.3888889 0 0 0 0 0 0 1 0 1 1
3 110 0.9944444 2.1666667 0 0 0 0 1 0 0 0 1 1
4 116 0.9888889 2.3333333 0 0 0 0 0 1 0 0 1 1
5 115 0.9888889 2.2611111 0 0 0 0 0 1 0 1 0 1
6 117 0.9888889 2.2333333 0 0 0 0 0 0 1 1 1 0
7 113 0.9888889 2.2222222 0 0 0 0 0 1 1 0 0 1
8 109 0.9888889 2.0944444 0 0 0 0 1 0 0 1 0 1
9 85 0.9888889 2.0500000 0 0 1 0 0 0 0 0 1 1
10 99 0.9888889 1.9944444 0 0 0 1 0 0 0 1 0 1

What this output is illustrating, is that there is tie amongst the combination of the three most popular items. This is evident in combos: 120, 119, 110 – all of which indicate a tie in percentage of the sample reached (99.44%). If you were a restaurant owner who conducted this study in order to provide guidance as you underwent the process of reducing you menu to only three items, your next step as it relates to this study, would be to consider which sum of items within the three combos provides the greatest net income to your establishment.

In a different scenario, we may want to discover the two most popular selections amongst variables 1-5. To achieve the desired output related to such, we would have to run the following code:

turftest <- turf_ex_data[, -c(8:12)]
turftest$wgt <- 1
exampleoutput <- turf(turftest, 5, 3)
exampleoutput

Which produces the following output:

> exampleoutput <- turf(turftest, 5, 3)
3 of 5: 0 sec
total time elapsed: 0.01659989 sec
> exampleoutput
$`turf`
$`turf`[[1]]
combo rchX frqX 1 2 3 4 5
1 10 0.8333333 1.2000000 0 0 1 1 1
2 9 0.7611111 1.0222222 0 1 0 1 1
3 8 0.7388889 1.0055556 0 1 1 0 1
4 6 0.7166667 0.9388889 1 0 0 1 1
5 5 0.7000000 0.9222222 1 0 1 0 1
6 7 0.7000000 0.9055556 0 1 1 1 0
7 4 0.6555556 0.8222222 1 0 1 1 0
8 3 0.6166667 0.7444444 1 1 0 0 1
9 2 0.5388889 0.6444444 1 1 0 1 0
10 1 0.5166667 0.6277778 1 1 1 0 0

$call
turf(data = turftest, n = 5, k = 3)

For the function “turf” to function correctly, the data frame must be structured in the following manner:

id | weight | item(s)

It is also important to note, that errors will be returned and the function will not proceed if the number of item columns within the function, do not exactly equal the number of item columns within the data frame.

Example (SPSS):

To perform this example, you will need to export the “turf_ex_data” data frame as a file encoded within the .csv format. This can be achieved by modifying the code below:

write.table(turf_ex_data, file = "C:/Users/Desktop/turf_ex_data.csv", sep = ",", col.names = NA, row.names = TRUE)

Once this file has been exported to the desktop, you can proceed with importing it into the SPSS platform.

To begin, you must first select “Analyze” from the upper left drop down menu, then select “Descriptive Statistics”, followed by “TURF Analysis”.

This should present the following interface:

Utilizing the top center arrow, designate all of the item variables as the “Variables to Analyze”. Next, set “Maximum Variable Combinations” to equal “10”. After this is complete, designate “Number of Combinations to Display” to equal “3”. Finally, remove the check mark adjacent to the box labeled, “Reach and frequency plot”. Once all of this is complete, click “OK”.

This should generate the following output:

Which is the output which you initially requested. However, SPSS also provides additional output which is also interesting.

(Each item individually)

(Additional combinations)

You may also notice that SPSS provides a different output cell structure as compared to the unstructured and simplistic output of the R console.

Reach – Refers to the number of individuals satisfied, or reached*, within the current combination.

Pct of Cases – Percent of individuals within the reach category as compared to the entire sample size (x/180).

Frequency – The sum of the positive responses measured for each item within the categorical set**.

Pct of Response – The total frequency of the response (number of times individuals answered “1” to the item), divided by the total frequency measuring the positive responses pertaining to all items (number of times individuals answered “1” to any item including the response item).

*- Individuals who chose “yes” as it relates to an item variable within the combination.

**- Example: 132 positive responses for item 8, 145 positive responses for item 9, 160 positive responses for item 10. 132 + 145 + 160 = 437 (Frequency).

That’s all for now! Stay enthused, Data Heads!

Reflections of a Data Scientist

Wednesday, May 30, 2018

(R) TURF Analysis (SPSS)

No comments:

Post a Comment