tag:blogger.com,1999:blog-16087687369139309262023-01-13T20:26:21.438-05:00Reflections of a Data ScientistData Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comBlogger173125tag:blogger.com,1999:blog-1608768736913930926.post-6919489766985958022022-12-13T18:51:00.009-05:002022-12-13T19:02:10.148-05:00(R) Stein’s Paradox / The James-Stein Estimator<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1nooGqz4h0LWGJ0Gz31oPyGyfgb1jy1jY8TvGFKAU-5h-yKGOyAWvVKmkiU9C7vJYdBLXA4qmaRY8wJevhFrnqbQvg4h3WvZClcwBWHvnnaEWhonpxtD6g1OGFmDhNC4fx9VESsWOb0K2z_VPHTZbwpOiJDYubSz_85ATytweMIRwQYqwQ4vc-BTc/s1280/Okabe.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="720" data-original-width="1280" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1nooGqz4h0LWGJ0Gz31oPyGyfgb1jy1jY8TvGFKAU-5h-yKGOyAWvVKmkiU9C7vJYdBLXA4qmaRY8wJevhFrnqbQvg4h3WvZClcwBWHvnnaEWhonpxtD6g1OGFmDhNC4fx9VESsWOb0K2z_VPHTZbwpOiJDYubSz_85ATytweMIRwQYqwQ4vc-BTc/w400-h225/Okabe.jpg" width="400" /></a></div><div><br /></div>Imagine a situation in which you were provided with data samples from numerous independent populations. Now what if I told you, that combining all of the samples into a single equation, is the best methodology for estimating the mean of each population.<br /><br />Hold on. <br /><br />Hold on. <br /><br />Wait. <br /><br />You’re telling me, that combining independently sampled data into a single pool, from independent sources, can provide assumptions as it pertains to the source of each sample? <br /><br />Yes! <br /><br />And this methodology provides a better estimator than other available conventional methods? <br /><br />Yes again. <br /><br />This was the conversation which divided the math world in 1956. <br /><br />Here is an article detailing the phenomenon and findings of Charles Steins from Scientific America (.PDF Warning): <br /><br /><a href="https://efron.ckirby.su.domains//other/Article1977.pdf">https://efron.ckirby.su.domains//other/Article1977.pdf</a> <br /><br />Since we have computers, let’s give the James-Stein’s Estimator a little test-er-roo. In the digital era, we are no longer forced to accept hearsay proofs. <br /><br />(The code below is a heavily modified and simplified version of code which was originally queried from: <a href="https://bookdown.org/content/922/james-stein.html">https://bookdown.org/content/922/james-stein.html</a>)<div><br /><b>################################################################################## <br /><br />### Stein’s Paradox / The James-Stein Estimator ### <br /><br />## We begin by creating 5 independent samples generated from normally distributed data sources ## <br /><br />## Each sample is comprised of random numbers ## <br /><br /># 100 Random Numbers, Mean = 500, Standard Deviation = 155 # <br /><br />Ran_A <- rnorm(100, mean=500, sd=155) <br /><br /># 100 Random Numbers, Mean = 50, Standard Deviation = 22 # <br /><br />Ran_B <- rnorm(100, mean=50, sd= 22) <br /><br /># 100 Random Numbers, Mean = 1, Standard Deviation = 2 # <br /><br />Ran_C <- rnorm(100, mean=1, sd = 2) <br /><br /># 100 Random Numbers, Mean = 1000, Standard Deviation = 400 # <br /><br />Ran_D <- rnorm(100, mean=1000, sd=400) <br /><br /># I went ahead and sampled a few of the elements from each series which were generated by my system # <br /><br />testA <- c(482.154, 488.831, 687.691, 404.691, 604.8, 639.283, 315.656) <br /><br />testB <- c(53.342841, 63.167245, 47.223326, 44.532218, 53.527203, 40.459877, 83.823073) <br /><br />testC <-c(-1.4257942504, 2.2265732374, -0.6124066829, -1.7529138598, -0.0156957983, -0.6018709735 ) <br /><br />testD <- c(1064.62403, 1372.42996, 976.02130, 1019.49588, 570.84984, 82.81143, 517.11726, 1045.64377) <br /><br /># We now must create a series which contains all of the sample elements # <br /><br />testall <- c(testA, testB, testC, testD) <br /><br /># Then we will take the mean measurement of each sampled series # <br /><br />MLEA <- mean(testA) <br /><br />MLEB <- mean(testB) <br /><br />MLEC <- mean(testC) <br /><br />MLED <- mean(testD) <br /><br /># Next, we will derive the mean of the combined sample elements # <br /><br />p_ <- mean(testall) <br /><br /># We must assign to ‘N’, the number of sets which we are assessing # <br /><br />N <- 4 <br /><br /># We must also derive the median of the combined sample elements # <br /><br />medianden <- median(testall) <br /><br /># Sigma2 = mean(testall) * (1 – (mean(testall)) / medianden # <br /><br />sigma2 <- p_ * (1-p_) / medianden <br /><br /># Now we’re prepared to calculate the assumed population mean of each sample series # <br /><br />c_A <- p_+(1-((N-3)*sigma2/(sum((MLEA-p_)^2))))*(MLEA-p_) <br /><br />c_B <- p_+(1-((N-3)*sigma2/(sum((MLEB-p_)^2))))*(MLEB-p_) <br /><br />c_C <- p_+(1-((N-3)*sigma2/(sum((MLEC-p_)^2))))*(MLEC-p_) <br /><br />c_D <- p_+(1-((N-3)*sigma2/(sum((MLED-p_)^2))))*(MLED-p_) <br /><br />################################################################################## <br /><br /># Predictive Squared Error # <br /><br />PSE1 <- (c_A - 500) ^ 2 + (c_B - 50) ^ 2 + (c_C - 1) ^ 2 + (c_D - 1000) ^ 2 <br /><br />######################## <br /><br /># Predictive Squared Error # <br /><br />PSE2 <- (MLEA- 500) ^ 2 + (MLEB - 50) ^ 2 + (MLEC - 1) ^ 2 + (MLED - 1000) ^ 2 <br /><br />######################## <br /><br />1 - 28521.5 / 28856.74 <br /><br />################################################################################## <br /></b><br /><i>1 - 28521.5 / 28856.74 = 0.01161739</i></div><div><br /></div><div>So, we can conclude, through the utilization of MSE as an accuracy assessment technique, that Stein’s Methodology (AKA The James-Stein Estimator), provided a 1.16% better estimation of the population mean for each series, as compared to the mean of each sample series assessed independently. <br /><br />Charles Stein really was a pioneer in the field of statistics as he discovered one of the first instances of dimension reduction. <br /><br />If we consider our example data sources below: </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIF_Ce4N-2T_MoQ6SN0SfDp4VeSR6Juh7DtA4AG31Uhjo0wbTR0ptMxrwCnRrA2QYxQogIrjeAZODYo1sydbIvI1cmRFhU719NLCxR-30O1gEbx78u_x_KqAv4ZY7-CtWmLCbCfPKUaA8MQO3F1G-UXYxnggR5PoteD4CNJaE9b9kw0OcFHBwtZdVf/s975/Stein_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="435" data-original-width="975" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIF_Ce4N-2T_MoQ6SN0SfDp4VeSR6Juh7DtA4AG31Uhjo0wbTR0ptMxrwCnRrA2QYxQogIrjeAZODYo1sydbIvI1cmRFhU719NLCxR-30O1gEbx78u_x_KqAv4ZY7-CtWmLCbCfPKUaA8MQO3F1G-UXYxnggR5PoteD4CNJaE9b9kw0OcFHBwtZdVf/w640-h285/Stein_0.png" width="640" /></a></div><div><div class="separator" style="clear: both; text-align: center;"><br /></div>Applying the James-Stein Estimator to the data samples from each series’ source, removes the innate distance which exist between each sample. In simpler terms, this essentially equates to all elements within each sample being shifted towards a central point. <br /><br /></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;"> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6XMHkiUpqP-Q5Lw6dD1dswWEbBbV4hvwKxDCujpwNSenkTFuGVylO90ZDwk8ViS3qptLjavdge-IMZrOKuzDk8AfIvw5DvDqMHA-h8ID1RD34NLuGocMZZmLpcOIEcdCPEUbjFhRUdO1o-f-nj3aoVEpfcIdgJ5Q0FcpkoH95eM4chnfxVPRofbd/s674/Stein_1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="401" data-original-width="674" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6XMHkiUpqP-Q5Lw6dD1dswWEbBbV4hvwKxDCujpwNSenkTFuGVylO90ZDwk8ViS3qptLjavdge-IMZrOKuzDk8AfIvw5DvDqMHA-h8ID1RD34NLuGocMZZmLpcOIEcdCPEUbjFhRUdO1o-f-nj3aoVEpfcIdgJ5Q0FcpkoH95eM4chnfxVPRofbd/w400-h238/Stein_1.png" width="400" /></a></div></blockquote><div><br />Series elements which were already in close proximity to the mean, now move slightly closer to the mean. Series elements which were originally far from the mean, move much closer to the mean. These outside elements still maintain their order, but they are brought closer to their fellow series peers. This shifting of the more extreme elements within a series, is what makes the James-Stein Estimator so novel in design, and potent in application. <br /><br />This one really blew my noggin when I first discovered and applied it. <br /><br />For more information on this noggin blowing technique, please check out: <br /><br /><a href="https://www.youtube.com/watch?v=cUqoHQDinCM">https://www.youtube.com/watch?v=cUqoHQDinCM</a><br /><br /></div><div><b>~ and ~ </b><br /><a href="https://www.statisticshowto.com/james-stein-estimator/"><br />https://www.statisticshowto.com/james-stein-estimator/</a></div><div><br /></div><div>That's all for today.</div><div><br /></div><div>Come back again soon for more perspective altering articles.</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-50122796639483528512022-10-28T18:54:00.005-04:002022-10-28T18:59:22.542-04:00(Python) The Number: 13 (Happy, Lucky, Primes) - A Spooky Special!This Halloween, we’ll be covering a very spooky topic. <div><br />I feel that the number “<b>13</b>”, for far too long, has been un-fairly maligned. <br /><br />Today, it will have its redemption. <br /><br />Did you know that the number “<b>13</b>”, by some definitions, is both a happy and lucky number? Let’s delve deeper into each definition, and together discover why this number deserves much more respect than it currently receives.<div><br /></div><div><b><u>Happy Numbers</u></b><br /> <br />In number theory, a happy number is a number which eventually reaches 1 when replaced by the sum of the square of each digit. * </div><div><br /></div><div><b>Example:</b> <br /><br />For instance, <b>13 </b>is a happy number because: <br /><br />(1 * 1) + (3 * 3) = 10 <br /><br />(1 * 1) + (0 * 0) = 1 <br /><br />and the number <b>19</b> is also a happy number because: <br /><br />(1 * 1) + (9 * 9) = 82 <br /><br />(8 * 8) + (2 * 2) = 68 <br /><br />(6 * 6) + (8 * 8) = 100 <br /><br />(1 * 1) + (0 * 0) + (0 * 0) = 1 <br /><br /><i>*- <a href="https://en.wikipedia.org/wiki/Happy_number">https://en.wikipedia.org/wiki/Happy_number</a></i></div><div><br /></div><div><b><u>Lucky Numbers</u></b></div><br />In number theory, a lucky number is a natural number in a set which is generated by a certain "sieve". *<br /><br />In the case of our (lucky) number generation process, we will be utilizing the, "the sieve of Josephus Flavius". <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3G6zZf-mIs_UuT09qoWD24JXkL3tzBPR__AmhxgfVCGfY581EmnILZwP9tIwly20xw_qkeCLCS9zAMkyDQMEyZabIKdnyTtmSwFqk2ryK-U3wrIxT1-vQhMaHl9I_rwHdiI8AAZNKuEXd8AwJ3RB2VKaw1HF7mul11xVfqGpPn8x5RMmqc1L35Owi/s279/Flavius.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="279" data-original-width="219" height="279" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3G6zZf-mIs_UuT09qoWD24JXkL3tzBPR__AmhxgfVCGfY581EmnILZwP9tIwly20xw_qkeCLCS9zAMkyDQMEyZabIKdnyTtmSwFqk2ryK-U3wrIxT1-vQhMaHl9I_rwHdiI8AAZNKuEXd8AwJ3RB2VKaw1HF7mul11xVfqGpPn8x5RMmqc1L35Owi/s1600/Flavius.png" width="219" /></a></div><br /><b>Example:</b><br /><div><br /></div><div>Beginning with a list of integers from 1 – 20:<br /><br />{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} <br /><br />We will remove all even numbers: <br /><div><br /></div><div>1, <b>2</b>, 3, <b>4</b>, 5, <b>6</b>, 7,<b> 8</b>, 9, <b>10</b>, 11, <b>12</b>, 13, <b>14</b>, 15, <b>16</b>, 17, <b>18</b>, 19, <b>20</b><br /><br />Leaving: <br /><br />1, 3, 5, 7, 9, 11, 13, 15, 17, 19 <br /><br />The first remaining number after the number “<b>1</b>”, is the number “<b>3</b>”. Therefore, every third number within the list must be removed: <br /><br />1, 3, <b>5</b>, 7, 9, <b>11</b>, 13, 15, <b>17</b>, 19 <br /><br />Leaving: <br /><br />1, 3, 7, 9, 13, 15, 19 <br /><br />Next, we will remove each seventh entry within the remaining list, as the number “<b>7</b>” is the value which occurs subsequent to “<b>3</b>”: <br /><br />1, 3, 7, 9, 13, 15, <b>19 <br /></b><br />Leaving: <br /><br />1, 3, 7, 9, 13, 15 <br /><br />If we were to continue with this process, each ninth entry which would also be subsequently removed from the remaining list, as the number “<b>9</b>” is the number which occurs subsequent to “<b>7</b>”. Since only 6 elements remain from our initial set, the process ends here. <br /><br />We can then conclude, that the following numbers are indeed lucky: <br /><br /> 1, 3, 7, 9, 13, 15<div><br /></div><div><i>*- <a href="https://en.wikipedia.org/wiki/Lucky_number">https://en.wikipedia.org/wiki/Lucky_number</a> </i><br /><br /><b style="text-decoration: underline;">Prime Numbers</b> <br /><br />A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. * <br /><br />13 fits this categorization, as it can only be factored down to a product of 13 and 1.<br /><br /><i>*- <a href="https://en.wikipedia.org/wiki/Prime_number">https://en.wikipedia.org/wiki/Prime_number</a> <br /></i><br /><b><u>(Python) Automating the Process </u></b><br /><br />Now that I have hopefully explained each concept in an understandable way, let’s automate some of these processes.</div></div></div><div><br /><b><u>Happy Numbers </u></b><br /><br /><b># Create a list of happy numbers between 1 and 100 # </b><br /><br /><b># https://en.wikipedia.org/wiki/Happy_number # <br /><br /># This code is a modified variation of the code found at: # <br /><br /># https://www.phptpoint.com/python-program-to-print-all-happy-numbers-between-1-and-100/ # <br /><br /><br /></b></div><div><b># Python program to print all happy numbers between 1 and 100 # </b><br /><br /><br /><b># isHappyNumber() will determine whether a number is happy or not #</b><br /><br /><b>def isHappyNumber(num): <br /></b></div><div><b><br /> <span> </span>rem = sum = 0; </b><br /><br /><br /><b># Calculates the sum of squares of digits #<br /><br /> <span> </span>while(num > 0): <br /><br /> <span> <span> </span></span>rem = num%10; <br /><br /> <span> <span> </span></span>sum = sum + (rem*rem); <br /><br /> <span> <span> </span></span>num = num//10; <br /><br /> <span> </span>return sum; </b><br /><br /><br /><b># Displays all happy numbers between 1 and 100 # <br /><br />print("List of happy numbers between 1 and 100: \n 1"); <br /></b><br /><br /><br /><b># for i in range(1, 101):, always utilize n+1 as it pertains to the number of element entries within the set # <br /><br /># Therefore, for our 100 elements, we will utilize 101 as the range variable entry # <br /></b><br /><br /><b>for i in range(1, 101): <br /><br /> <span> </span>result = i; <br /><br /><br /></b></div><div><b><span> </span>while(result != 1 and result != 4): <br /><br /> <span> <span> </span></span>result = isHappyNumber(result); <br /><br /> <span> <span> </span></span>if(result == 1): <br /><br /> <span> <span> <span> </span></span></span>print(i); </b><br /><br /><br /><u>Console Output:</u> <br /><br /><i>List of happy numbers between 1 and 100: <br />1 <br />7 <br />10 <br />13 <br />19 <br />23 <br />28 <br />31 <br />32 <br />44 <br />49 <br />68 <br />70 <br />79 <br />82 <br />86 <br />91 <br />94 <br />97 <br />100</i><br /><br /><b># Code which verifies whether a number is a happy number # <br /><br /># Code Source: # https://en.wikipedia.org/wiki/Happy_number # <br /><br /># This process is unfortunately two steps # </b></div><div><b><br /></b></div><div><br /><b>def pdi_function(number, base: int = 10): <br /><br /> <span> </span>"""Perfect digital invariant function.""" <br /><br /> <span> </span>total = 0 <br /><br /> <span> </span>while number > 0: <br /><br /> <span> <span> </span></span>total += pow(number % base, 2) <br /><br /> <span> <span> </span></span>number = number // base <br /><br /> return total <br /><br /><br />def is_happy(number: int) -> bool: <br /><br /> <span> </span>"""Determine if the specified number is a happy number.""" <br /><br /> <span> </span>seen_numbers = set() <br /><br /> <span> </span>while number > 1 and number not in seen_numbers: <br /><br /> <span> <span> </span></span>seen_numbers.add(number) <br /><br /> <span> <span> </span></span>number = pdi_function(number) <br /><br /> <span> </span>return number == 1 </b><br /><br /><br /><b># First, we must run the initial function on the number in question # <br /><br /># This function will calculate the number’s perfect digital invariant value # <br /><br /># Example, for 13 # </b><br /><br /><b>pdi_function(13)</b><br /><br /><u>Console Output:</u> <br /><br /><i>10 </i><br /><br /><b># The output value of the first function must then be input into the subsequent function, in order to determine whether or not the tested value (ex. 13) can appropriately be deemed “happy”. #<br /><br />is_happy(10) </b><br /><br /><u>Console Output:</u> <br /><br /><i>True </i><br /><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /><b><u>Lucky Numbers</u></b> <br /><br /><b># https://en.wikipedia.org/wiki/Lucky_number # </b><br /><br /><b># The code below will determine whether or not a number is "lucky", as defined by the above definition of the term #</b><br /><br /><b># The variable ‘number check’, must be set equal to the number which we wish to assess # <br /><br />number_check = 99 </b><br /><br /><br /><b># Python code to convert list of # <br /><br /># string into sorted list of integer # <br /><br /># https://www.geeksforgeeks.org/python-program-to-convert-list-of-integer-to-list-of-string/ # </b><br /><br /><br /><b># List initialization <br /><br />list_int = list(range(1,(number_check + 1),1)) </b><br /><br /><br /><b># mapping <br /><br />list_string = map(str, list_int) </b><br /><br /><br /><b># Printing sorted list of integers <br /><br />numbers = (list(list_string)) <br /></b><br /><br /><b># https://stackoverflow.com/questions/64956140/lucky-numbers-in-python # </b><br /><br /><br /><b>def lucky_numbers(numbers): <br /><br /> <span> </span>index = 1 <br /><br /> <span> </span>next_freq = int(numbers[index]) <br /><br /> <span> </span>while int(next_freq) < len(numbers): <br /><br /> <span> <span> </span></span>del numbers[next_freq-1::next_freq] <br /><br /> <span> <span> </span></span>print(numbers) <br /><br /> <span> <span> </span></span>if str(next_freq) in numbers: <br /><br /> <span> <span> <span> </span></span></span>index += 1 <br /><br /> <span> <span> <span> </span></span></span>next_freq = int(numbers[index]) <br /><br /> else: <br /><br /> <span> </span>next_freq = int(numbers[index]) <br /><br /> return <br /><br /><br />lucky_numbers(numbers) <br /></b><br /><br /><u>Console Output:</u> <br /><br />['1', '3', '5', '7', '9', '11', '13', '15', '17', '19', '21', '23', '25', '27', '29', '31', '33', '35', '37', '39', '41', '43', '45', '47', '49', '51', '53', '55', '57', '59', '61', '63', '65', '67', '69', '71', '73', '75', '77', '79', '81', '83', '85', '87', '89', '91', '93', '95', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '19', '21', '25', '27', '31', '33', '37', '39', '43', '45', '49', '51', '55', '57', '61', '63', '67', '69', '73', '75', '79', '81', '85', '87', '91', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '27', '31', '33', '37', '43', '45', '49', '51', '55', '57', '63', '67', '69', '73', '75', '79', '85', '87', '91', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '45', '49', '51', '55', '63', '67', '69', '73', '75', '79', '85', '87', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '55', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] </p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">The output of this function returns a series of numbers up to and including the number which is being assessed. Therefore, from this function's application, we can conclude that the following numbers are "lucky":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><i>['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] </i><br /><br />(Only consider the final output as valid, as all other outputs are generated throughout the reduction process)<br /><br /><u><b>Prime Numbers</b></u> <br /><br /><b># The code below is rather self-explanatory #</b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><br /></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b># It is utilized to generate a a list of prime numbers included within the range() function # <br /></b><br /><b># Source: https://stackoverflow.com/questions/52821002/trying-to-get-all-prime-numbers-in-an-array-in-python # <br /><br />checkMe = range(1, 100) <br /><br />primes = [] <br /><br />for y in checkMe[1:]: <br /><br /> <span> </span>x = y <br /><br /> <span> </span>dividers = [] <br /><br /> <span> </span>for x in range(2, x): <br /><br /> <span> <span> </span></span>if (y/x).is_integer(): <br /><br /> <span> <span> <span> </span></span></span>dividers.append(x) <br /><br /> if len(dividers) < 1: <br /><span> </span><br /> <span> </span>primes.append(y) <br /><br />print("\n"+str(checkMe)+" has "+str(len(primes))+" primes") <br /><br />print(primes) <br /></b><br /><u>Console Output:</u> <br /><br /><i>range(1, 100) has 25 primes <br />[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]</i><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><i><br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u>Conclusion</u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Having performed all of the previously mentioned tests and functions, I hope that you have been provided with enough adequate information to reconsider <b>13</b>'s unlucky status. Based upon my number theory research, I feel that enough evidence exists to at least relegate the number <b>13</b> to the status of "misunderstood".</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b>13 </b>isn't to be feared or avoided. It actually shares unique company amongst other "<b>Happy Primes</b>":</p><br />7, 13, 19, 23, 31, 79, 97, 103, 109, 139, 167, 193, 239, 263, 293, 313, 331, 367, 379, 383, 397, 409, 487<p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Even intermingling with the company of "<b>Lucky Primes</b>":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">3, 7, 13, 31, 37, 43, 67, 73, 79, 127, 151, 163, 193, 211, 223, 241, 283, 307, 331, 349, 367, 409, 421, 433, 463, 487</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">And being a member of the very exclusive group, the "<b>Happy Lucky Primes</b>":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">7, 13, 31, 79, 193, 331, 367, 409, 487</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">----------------------------------------------------------------------------------------------------------------------------- <br /><br />I wish you all a very happy and safe holiday.</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">I'll see you next week with more (non-spooky) content!</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br />-RD</p></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-72483959986085384412022-10-24T19:09:00.000-04:002022-10-24T19:09:37.732-04:00(Python) Machine Learning – Keras – Pt. VThroughout the previous articles, we thoroughly explored the various machine learning techniques which employ tree methodologies as their primary mechanism. We also discussed the evolution of machine learning techniques, and how gradient boosting eventually came to overtake the various forest models as the preferred standard. However, the gradient boosted model was soon replaced by the Keras model. The latter still remains the primary method of prediction at this present time. <br /><br />Keras differs from all of the other models in that it does not utilize the tree or forest methodologies as its primary mechanism of prediction. Instead, Keras employs something similar to a binary categorical method, in that, an observation is fed through the model, and at each subsequent layer prior to the output, Keras decides what the observation is, and what the observation is not. This may sound somewhat complicated, and in all manners, it truly is. However, what I am attempting to illustrate will become less opaque as you continue along with the exercise. <br /><br />A final note prior to delving any further, Keras is a member of a machine learning family known as deep learning. Deep learning can essentially be defined as an algorithmic analysis of data which can evaluate non-linear relationships. This analysis also provides dynamic model re-calibration throughout the modeling process. <br /><br /><b><u>Keras Illustrated</u></b> <br /><br />Below is a sample illustration of a Keras model which possesses a continuous dependent variable. The series of rows on the left represents the observational data which will be sent through the model so that it may “learn”. Each circle represents what is known as “neuron”, and each column of circles represents what is known as a “layer”. The sample illustration has 4 layers. The leftmost layer is known as the input layer, the middle two layers are known as the hidden layers, and the rightmost layer is referred to as the output layer.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSHf4HSgRjy6KbE6qG_BODTXGg58ZQBUjrZ1fTCo3bgtZEI9IraIMeeuUgfCJkRFJFyM28Gpf1tZH4Awf-43VvtqRrZ4ZZDbGfMuAi4dlMwaKnSqGxDqlaaUsof0k_jCPXz18X414ZzOOKdQFym2eMhFSsRVT1iVrG-oyP1TrK5-idd0KFykZDpaFS/s900/Keras_Nue.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="553" data-original-width="900" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSHf4HSgRjy6KbE6qG_BODTXGg58ZQBUjrZ1fTCo3bgtZEI9IraIMeeuUgfCJkRFJFyM28Gpf1tZH4Awf-43VvtqRrZ4ZZDbGfMuAi4dlMwaKnSqGxDqlaaUsof0k_jCPXz18X414ZzOOKdQFym2eMhFSsRVT1iVrG-oyP1TrK5-idd0KFykZDpaFS/w400-h246/Keras_Nue.png" width="400" /></a></div><div><br /></div>Due to the model’s continuous dependent variable classification, it will only possess a single output layer. If the dependent variable was categorical, it would have an appearance similar to the graphic below:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyAfhrdF8jSOeEmldo8-lN3NlRIuz3BfaQorYzsd5pUBKLEHI6s0t8rYn1mGXLMdD9nuWXe032oeeGeKWjiQbYeh2BmDPzpAIxMqRwLthF1Cz9XAs8uzzQ5RCXvUrITlP__9wZGslIRlJL1E8z52Y8wvTpIIaMfStwo7AIxvwK1IM1EG9HHNjSe1K9/s900/Keras_Nue2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="553" data-original-width="900" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyAfhrdF8jSOeEmldo8-lN3NlRIuz3BfaQorYzsd5pUBKLEHI6s0t8rYn1mGXLMdD9nuWXe032oeeGeKWjiQbYeh2BmDPzpAIxMqRwLthF1Cz9XAs8uzzQ5RCXvUrITlP__9wZGslIRlJL1E8z52Y8wvTpIIaMfStwo7AIxvwK1IM1EG9HHNjSe1K9/w400-h246/Keras_Nue2.png" width="400" /></a></div><div><br /></div><div><b><u>How the Model Works</u></b> <br /><br />Without getting overly specific, (as many other resources exist which provide detailed explanations as it pertains the model’s inner-most mechanisms), the training of the model occurs throughout two steps. The first step being: <b>“Forward Propagation”</b>, and the second step being: <b>“Backward Propagation”</b>. Each node which exists beyond the input layers, sans the output layer, is measuring for a potential interaction amongst variables. <br /><br />Each node is initially assigned a value. Those values shift as training data is processed through the model from the left to the right (forward propagation), and further, but more specifically modified, as the same data is then passed back through the model from the right to the left (back propagation). The entire training data set is not processed in its entirety in a simultaneous manner, instead, for the sake of allocating computing resources, the data is split into smaller sets known as batches. Batch size impacts learning significantly. With a smaller batch size, a model’s predictive capacity will be hindered. However, there are certain scenarios when lower batch size is advantageous, as the impact of noisy gradients will be reduced. The default size for a batch is 32 observations. <br /><br />In many ways, the method in which the model functions is analogous to the way in which a clock operates. Each training observation shifts a certain aspect of a neuron’s value, with the neuron’s final value being representational of all of the prior shifts. <br /><br />A few other terms which are also worth mentioning, as the selection of such is integral to model creation are: <br /><br /><b>Optimizer</b> – This specifies the algorithm which will be utilized for correcting the model as errors occur. <br /><br /><b>Epoch</b> – This indicates the number of times in which observational data will be passed through a model during the training process. <br /><br /><b>Loss Function</b> – This indicates the algorithm which will be utilized to determine how errors are penalized within the model. <br /><br /><b>Metric</b> - A metric is a function which is utilized to assess the performance of a model. However, unlike the Loss Function, it does not impact model training, and is only utilized to perform post-hoc analysis. <br /><br /><b><u>Model Application</u></b> <br /><br />As with any auxiliary python library, a library must first be downloaded and enabled prior to its utilization. To achieve this within the Jupyter Notebook platform, we will employ the following lines of code: <br /><br /><b># Import ‘pip’ to import to install auxiliary packages # <br /><br />import pip <br /><br /># Install ‘TensorFlow’ to act as the underlying mechanism of the Keras UI # <br /><br />pip.main(['install', 'TensorFlow']) <br /><br /># Import pandas to enable data frame utilization # <br /><br />import pandas <br /><br /># Import numpy to enable numpy array utilization # <br /><br />import numpy <br /><br /># Import the general Keras library # <br /><br />import keras <br /><br /># Import tensorflow to act as the ‘backend’ # <br /><br />import tensorflow <br /><br /># Enable option for categorical analysis # <br /><br />from keras.utils import to_categorical <br /><br />from keras.models import Sequential <br /><br />from keras.layers import Activation, Dense <br /><br /># Import package to enable confusion matrix creation # <br /><br />from sklearn.metrics import confusion_matrix <br /><br /># Enable the ability to save and load models with the ‘load_model’ option # <br /><br />from keras.models import load_model <br /><br /># Enable the creation of confusion matrixes with the ‘sklearn.metrics’ library # <br /><br />from sklearn.metrics import confusion_matrix </b><br /><br />With all of the appropriate libraries downloaded and enabled, we can begin building our sample model. <br /><br /><b><u>Categorical Dependent Variable Model</u></b> <br /><br />For the following examples, we will be utilizing a familiar data set, the<b> “iris”</b> data set, which is available within the R platform. <br /><br /><b># Import the data set (in .csv format), as a pandas data frame # <br /><br />filepath = "C:\\Users\\Username\\Desktop\\iris.csv" <br /><br />iris = pandas.read_csv(filepath) </b><br /><br />First we will randomize the observations within the data set. Observational data should always be randomized prior to model creation. <br /><br /><b># Shuffle the data frame # <br /><br />iris = iris.sample(frac=1).reset_index(drop=True) </b><br /><br />Next, we will remove the dependent variable entries from the data frame and modify the structure of the new data frame to consist only of independent variables. <br /><br /><b>predictors = iris.drop(['Species'], axis = 1).as_matrix() <br /></b><br />Once this has been achieved, we must modify the variables contained within the original data set so that the categorical outcomes are designated by integer values. <br /><br />This can be achieved through the utilization of the following code: <br /><br /><b># Modify the dependent variable so that each entry is replaced with a corresponding integer # <br /><br />from pandasql import * <br /><br />pysqldf = lambda q: sqldf(q, globals()) <br /><br />q = """ <br /><br />SELECT *, <br /><br />CASE <br /><br />WHEN (Species = 'setosa') THEN '0' <br /><br />WHEN (Species = 'versicolor') THEN '1' <br /><br />WHEN (Species = 'virginica') THEN '2' <br /><br />ELSE 'UNKNOWN' END AS SpeciesNum <br /><br />from iris; <br /><br />""" <br /><br />df = pysqldf(q) <br /><br />print(df) <br /><br />iris0 = df </b><br /><br />Next, we must make a few further modifications. <br /><br />First, we must modify the dependent variable type to integer. <br /><br />After such, we will identify this variable as being representative of a categorical outcome. <br /><br /><b># Modify the dependent variable type from string to integer # <br /><br />iris0['SpeciesNum'] = iris0['SpeciesNum'].astype('int') <br /><br /># Modify the variable type to categorical # <br /><br />target = to_categorical(iris0.SpeciesNum) </b><br /><br />We are now ready to build our model! <br /><br /><b># We must first specify the model type # <br /><br />model = Sequential() <br /><br /># Next, we will specify the output dimensions. This value will typically be [1] unless you are working with images. # <br /><br />n_cols = predictors.shape[1] <br /><br /># This next line specifies the traits of the input layer # <br /><br />model.add(Dense(100, activation = 'relu', input_shape = (n_cols, ))) <br /><br /># This line specifies the traits of the hidden layer # <br /><br />model.add(Dense(100, activation = 'relu')) <br /><br /># This line specifies the traits of the output layer # <br /><br />model.add(Dense(3, activation = 'softmax')) <br /><br /># Compile the model by adding the optimizer, the loss function type, and the metric type # <br /><br /># If the model’s dependent variable is binary, utilize the ‘binary_crossentropy' loss function # <br /><br />model.compile(optimizer = 'adam', loss='categorical_crossentropy', <br /><br /> metrics = ['accuracy']) </b><br /><br />With our model created, we can now go about training it with the necessary information. <br /><br />As was the case with prior machine learning techniques, only a portion of the original data frame will be utilized to train the mode. <br /><br /><b>model.fit(predictors[1:100,], target[1:100,], shuffle=True, batch_size= 50, epochs=100) <br /></b><br />With the model created, we can now test its effectiveness by applying it to the remaining data observations. <br /><br /><b># Create a data frame to store the un-utilized observational data # <br /><br />iristestdata = iris0[101:150] <br /><br /># Create a data frame to store the model predictions for the un-utilized observational data # <br /><br />predictions = model.predict_classes(predictors[101:150]) <br /><br /># Create a confusion matrix to assess the model’s predictive capacity # <br /><br />cm = confusion_matrix(iristestdata['SpeciesNum'], predictions) <br /><br /># Print the confusion matrix results to the console output window # <br /><br />print(cm) </b><br /><br />Console Output: <br /><br /><i>[[16 0 0] <br /> [ 0 17 2] <br /> [ 0 0 14]] </i></div><div><br /><b><u>Continuous Dependent Variable Model</u></b> <br /><br />The utilization of differing model types is necessitated by the scenario that each situation dictates. As was the case with previous machine learning methodologies, the Keras package also contains functionality which allows for continuous dependent variables types. <br /><br />The steps for applying this model methodology are as follows: <br /><br /><b># Import the 'iris' data frame # <br /><br />filepath = "C:\\Users\\Username\\Desktop\\iris.csv" <br /><br />iris = pandas.read_csv(filepath) <br /><br /># Shuffle the data frame # <br /><br />iris = iris.sample(frac=1).reset_index(drop=True) </b><br /><br />In the subsequent lines of code, we will first identify the model’s dependent variable <b>‘Sepal.Length’</b>. This variable, and its corresponding observations will be held within the new variable <b>‘iris0’</b>. Next, we will create the variable,<b> ‘predictors’</b>. This variable will be comprised of all of the variables contained within the<b> ‘iris0’</b> data frame, with the exception of the<b> ‘Sepal.Length’</b> variable. The new data frame will stored within a matrix format. Finally, we will again define the <b>‘n_cols’</b> variable. <br /><br /><b>target = iris['Sepal.Length'] <br /><br /># Drop Species Name # <br /><br />iris0 = iris.drop(columns=['Species']) <br /><br /><br /><br /># Drop Species Name # <br /><br />predictors = iris0.drop(['Sepal.Length'], axis = 1).as_matrix() <br /><br />n_cols = predictors.shape[1] </b><br /><br />We are now ready to build our model! <br /><br />#<b> We must first specify the model type # <br /><br />modela = Sequential() <br /><br /># Next, we will specify the output dimensions. This value will typically be [1] unless you are working with images. # <br /><br />n_cols = predictors.shape[1] <br /><br /># This next line specifies the traits of the input layer # <br /><br />modela.add(Dense(100, activation = 'relu', input_shape=(n_cols,))) <br /><br /># This line specifies the traits of the hidden layer # <br /><br />modela.add(Dense(100, activation = 'relu')) <br /><br /># This line specifies the traits of the output layer # <br /><br />modela.add(Dense(1)) <br /><br /># Compile the model by adding the optimizer and the loss function type # <br /><br />modela.compile(optimizer = 'adam', loss='mean_squared_error') </b><br /><br />With the model created, we must now train the model with the following code: <br /><br /><b>modela.fit(predictors[1:100,], target[1:100,], shuffle=True, epochs=100) </b><br /><br />As was the case with the prior examples, we will only be utilizing a sample of the original data frame for the purposes of model training. <br /><br />With the model created and trained, we can now test its effectiveness by applying it to the remaining data observations.<br /><br /><b>from sklearn.metrics import mean_squared_error <br /><br />from math import sqrt <br /><br />predictions = modela.predict(predictors) <br /><br />rms = sqrt(mean_squared_error(target, predictions)) <br /><br />print(rms) <br /></b><br /><b><u>Model Functionality</u> </b><br /><br />In some ways, the Keras modeling methodology shares similarities with the hierarchal cluster model. The main differentiating factor being, in addition to the underlying mechanism, the dynamic aspects of the Keras model. <br /><br />Each Keras neuron represents a relationship between independent data variables within the training set. These relationships exhibit macro phenomenon which may not be immediately observable within the context of the initial data. When finally providing an output, the model considers which macro phenomenon illustrated the strongest indication of identification. The Keras model still relies on generalities to make predictions, therefore, certain factors which are exhibited within the observational relationships are held in higher regard. This phenomenon is known as weighing, as each neuron is assigned a weight which is adjusted as the training process occurs. <br /><br />The logistic regression methodology functions in a similar manner as it pertains to assessing variable significance. Again however, we must consider the many differentiating attributes of each model. In addition to weighing latent variable phenomenon, the Keras model is able to assess for non-linear relationships. Both attributes are absent within the aforementioned model, as logistic regression only assesses for linear relationships and can only provide values for variables explicitly found within the initial data set. <br /><br />The <b>sequential()</b> model type, which was specified within the build process, is one of the many model options available within the Keras package. The sequential option differs from the other model types in that it creates a network in which each neuron within each layer, is connected to each neuron within each subsequent layer. </div><div><br /></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSxE2K1J7gUrxSSzXx-_jD_oFN7aAnwkdEKgHOUTVVCY1X48Ynxumq0swD6oe-sAYxNKtfOIiaVDOca4iGqHIqG0huQq8DDghtwPsFK4-zpNplcmtYnFHTA_glSCClf4Yo4dalF3IljPgasYbEPr10ApzeOj_2EKSBZPXKvgo_xDSmf3I6oCJl5D9V/s505/Keras_Nue3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="505" data-original-width="311" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSxE2K1J7gUrxSSzXx-_jD_oFN7aAnwkdEKgHOUTVVCY1X48Ynxumq0swD6oe-sAYxNKtfOIiaVDOca4iGqHIqG0huQq8DDghtwPsFK4-zpNplcmtYnFHTA_glSCClf4Yo4dalF3IljPgasYbEPr10ApzeOj_2EKSBZPXKvgo_xDSmf3I6oCJl5D9V/w246-h400/Keras_Nue3.png" width="246" /></a></div></blockquote></blockquote></blockquote></blockquote><p> <b><u>Other Characteristics of the Keras Model</u></b></p><div>Depending on the size of the data set which acted as the training data for the model, significant time may be required to re-generate a model after a session is terminated. To avoid this un-necessary re-generation process, functions exist which enable the saving and reloading of model information. <br /><br /><b># Saving and Loading Model Data Requires # <br /><br />from keras.models import load_model <br /><br /># To save a model # <br /><br />modelname.save("C:\\Users\\filename.h5") <br /><br /># To load a model # <br /><br />modelname = load_model("C:\\Users\\filename.h5") <br /></b><br />It should be mentioned that as it pertains to Keras models, you do possess the ability to train existing models with additional data should the need the arise. <br /><br />For instance, if we wished to train our categorical iris model (“model”) with additional iris data, we could utilize the following code: <br /><br /><b> model.fit(newpredictors[100:150,], newtargets[100:150,], shuffle=True, batch_size= 50, epochs=100) </b><br /><br />There are errors which currently exist at the time of this article’s creation, which have yet to be resolved pertaining to learning rate fluctuation within re-loaded Keras models. Currently, a provisional fix has been suggested*, in which the "<b>adam"</b> optimizer is re-configured for re-loaded models. This re-configuring, while keeping all of the "<b>adam"</b> optimizer default configurations, significantly lowers the optimizer’s default learning rate. The purpose of this shift is to account for the differentiation in learning rates which occur in established models. <br /><br /><b># Specifying Optimizer Traits Requires # <br /><br />from keras import optimizers <br /><br /># Re-configure Optimizer # <br /><br />liladam = optimizers.adam(lr=0.00001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) <br /><br /># Utilize Custom Optimizer # <br /><br />model.compile(optimizer = liladam, loss='categorical_crossentropy', <br /><br /> <span> <span> <span> </span></span></span>metrics = ['accuracy']) </b><br /><br /><i>*Source - <a href="https://github.com/keras-team/keras/issues/2378" target="_blank">https://github.com/keras-team/keras/issues/2378 </a></i><br /><br /><b><u>Graphing Models</u></b><br /><br />As we cannot easily determine the innermost workings of a Keras model, the best method of visualization can be achieved by graphing the learning output. <br /><br />Prior to training the model, we will modify the typical fitting function to resemble something similar to the lines of code below: <br /><br /><b>history = model.fit(predictors[1:100,], target[1:100,], shuffle=True, epochs= 110, batch_size = 100, validation_data =(predictors[100:150,] , target[100:150,])) </b><br /><br />What this code enables, is the creation of the data variable<b> “history”</b>, in which, data pertaining to the model training process will be stored.<b> “validation_data” </b>is instructing the python library to assess the specified data within the context of the model after each epoch. This does not impact the learning process. The way in which this assessment will be analyzed is determined by the selection of the <b>“meteric”</b> option specified within the <b>model.fit()</b> function. <br /><br />If the above code is initiated, the model will be trained. To view the categories in which the model training history was organized upon being saved within the <b>“history” </b>variable, you may utilize the following lines of code. <br /><br /><b>history_dict = history.history <br /><br />history_dict.keys() </b><br /><br />This produces the console output: <br /><br /><i>dict_keys(['val_loss', 'val_acc', 'loss', 'acc']) <br /></i><br />To set the appropriate axis lengths for our soon to be produced graph, we will initiate the following line: <br /><br /><b>epochs = range(1, len(history.history['acc']) + 1) <br /></b><br />If we are utilizing Jupyter Notebook, we should also modify the graphic output size: <br /><br /><b>plt.rcParams["figure.figsize"] = [16,9] </b><br /><br />We are now prepared to create our outputs. The first graphic can be prepared with the following code: <br /><br /><b># Plot training & validation accuracy values # <br /><br /># (This graphic cannot be utilized to track the validation process of continuous data models) # <br /><br />plt.plot(epochs, history.history['acc'], 'b', label = 'Training acc') <br /><br />plt.plot(epochs, history.history['val_acc'], 'bo', color = 'orange', label = 'Validation acc') <br /><br />plt.title('Training and validation accuracy') <br /><br />plt.ylabel('Loss') <br /><br />plt.xlabel('Epoch') <br /><br />plt.legend(loc='upper left') <br /><br />plt.show() </b><br /><br />This produces the output:<br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTF8ZOYGVDzjjVzBfbPezztVIQ7yVuoBNlJ4mYQ1tFyR3Z9chcRE1lF-BFk0Qv7yCG3tvPVCzikUnVymLXnEyqDoPYxc6WkCCKSKmqAcPKNzSHM4NNPCYmlV-hDF2ISO0gzSAmEq20iL1nO69f3DFFe3pB8CKErWaSmKDHLYE_8wQ8zFWdn9OXHpPX/s947/Keras_graphic1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="550" data-original-width="947" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTF8ZOYGVDzjjVzBfbPezztVIQ7yVuoBNlJ4mYQ1tFyR3Z9chcRE1lF-BFk0Qv7yCG3tvPVCzikUnVymLXnEyqDoPYxc6WkCCKSKmqAcPKNzSHM4NNPCYmlV-hDF2ISO0gzSAmEq20iL1nO69f3DFFe3pB8CKErWaSmKDHLYE_8wQ8zFWdn9OXHpPX/w400-h233/Keras_graphic1.png" width="400" /></a></div><div><br /></div><b><u>Interpreting this Graphic</u></b> <br /><br />What this graphic is illustrating, is the level of accuracy in which the model predicts results. The solid blue line represents the data which was utilized to train the model, and the orange dotted line represents the data which is being utilized to test the model’s predictability. It should be evident that throughout the training process, the predictive capacity of the model improves as it pertains to both training and validation data. If a large gap emerged, similar to the gap which is observed from epoch # 20 to epoch # 40, we would assume that this divergence of data is indicative of <b>“overfitting”</b>. This term is utilized to describe a model which can predict training results accurately, but struggles to predict outcomes when applied to new data. <br /><br />The second graphic can be prepared with the following code: <br /><br /><b># Plot training & validation loss values <br /><br />plt.plot(epochs, history.history['loss'], 'b', label = 'Training loss') <br /><br />plt.plot(epochs, history.history['val_loss'], 'bo', label = 'Validation loss', color = 'orange') <br /><br />plt.title('Training and validation loss') <br /><br />plt.ylabel('Loss') <br /><br />plt.xlabel('Epoch') <br /><br />plt.legend(loc='upper left') <br /><br />plt.show()</b><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXIHoRjrsOazxs_yuELKIw_pREdGnVXq-MM1XGFWbkPUR1TidRR9ZYB6NKOxKQ1an1E3NDCJGPLiTmFC3MFIGecIZb83RmZ8Qg_zU4GZd1m7MFS2Zvodd0WiH3k8Oos6iUYSrvLZ3Fdv0ZDMHoZeseGEXDanXoLe-NOAHtm1pi1HkP09NMAcNcVUy4/s947/Keras_graphic2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="550" data-original-width="947" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXIHoRjrsOazxs_yuELKIw_pREdGnVXq-MM1XGFWbkPUR1TidRR9ZYB6NKOxKQ1an1E3NDCJGPLiTmFC3MFIGecIZb83RmZ8Qg_zU4GZd1m7MFS2Zvodd0WiH3k8Oos6iUYSrvLZ3Fdv0ZDMHoZeseGEXDanXoLe-NOAHtm1pi1HkP09NMAcNcVUy4/w400-h233/Keras_graphic2.png" width="400" /></a></div><div><br /></div><b><u>Interpreting this Graphic</u></b> <br /><br />This graphic is illustrates the improvement of the model over time. The solid blue line represents the data which was utilized to train the model, and the orange dotted line represents the data which is being utilized to test the model’s predictability. If a gap does not emerge between the lines throughout the training process, it is advisable to set the number of epochs to a figure which, after subsequent graphing occurs, demonstrates a flat plateauing of both lines. <br /><br /><b><u>Reproducing Model Training Results</u></b> <br /><br />If Keras is being utilized, and TensorFlow is the methodology selected to act as a “backend”, then the following lines of code must be utilized to guarantee reproductivity of results. <br /><br /><b># Any number could be utilized within each function # <br /><br /># Initiate the RNG of the general python library # <br /><br />import random <br /><br /># Initiate the RNG of the numpy package # <br /><br />import numpy.random <br /><br /># Set a random seed as it pertains to the general python library # <br /><br />random.seed(777) <br /><br /># Set a random seed as it pertains to the numpy library # <br /><br />numpy.random.seed(777) <br /><br /># Initiate the RNG of the tensorflow package # <br /><br />from tensorflow import set_random_seed <br /><br /># Set a random seed as it pertains to the tensorflow library # <br /><br />set_random_seed(777) <br /><br /><u>Missing Values in Keras </u></b><br /><br />Much like the previous models discussed, Keras has difficulties as it relates to variables which contain missing observational values. If a Keras model is trained on data which contains missing variable values, the training process will occur without interruption, however, the missing values will be analyzed under the assumption that they are representative of a measurement. Meaning, that the library will<b> NOT</b> automatically assume that the value is a missing value, and from such, estimate a place holder value based on other variable observations within the set. <br /><br />To make assumptions for the missing values based on the process described above, we must utilize the <b>imputer()</b> function from the python library: <b>“sklearn”</b>. Sample code which can be utilized for this purpose can be found below: <br /><b><br />from sklearn.preprocessing import Imputer <br /><br />imputer = Imputer() <br /><br />transformed_values = imputer.fit_transform(predictors) </b><br /><br />Additional details pertaining to this function, its utilization, and its underlying methodology, can be found within the previous article: <b>“(R) Machine Learning - The Random Forest Model – Pt. III”</b>. <br /><br />Having tested this method of variable generation on sets which I purposely modified, I can attest that its capability for achieving such is excellent. After generating fictitious placeholder values and then subsequently utilizing the Keras package to create a model, comparatively speaking, I saw no differentiation between the predicted results related to each individual set. <br /><br /><b style="text-decoration: underline;">Early Stopping</b> <br /><br />There may be instances which necessitate the creation of a model that will be applicable to a very large data set. This essentially, in most cases, guarantees a very long training time. To help assist in shortening this process, we can utilize an <b>“early stopping monitor”</b>. <br /><br />First, we must import the package related to this feature: <br /><br /><b>from keras import losses </b><br /><br />Next we will create and define the parameters pertaining to the feature: <br /><br /><b># If model improvement stagnates after 2 epochs, the fitting process will cease # <br /><br />early_stopping_monitor = keras.callbacks.EarlyStopping(monitor='loss', patience = 2, min_delta=0, verbose=0, mode='auto', baseline=None, restore_best_weights=True) <br /></b><br />Many of the options present within the code above are defaults. However, there are few worth mentioning. <br /><br /><b>monitor = ‘loss’</b> - This option is specifically instructing the function to monitor the loss value during each training epoch. <br /><br /><b>patience = 2</b> – This option is instructing the function to cease training if the loss value ceases to decline after 2 epochs. <br /><br /><b>restore_best_weights=True</b> – This option is indicating to the function that the values which occurred prior to lack of loss within the training process, should be the last values applied as it pertains to model training. The subsequent training values will be discarded. <br /><br />With the early stopping feature defined, we can add it to the training function below: <br /><br /><b>history = model.fit(predictors[101:150,], target[101:150,], shuffle=True, epochs=100, callbacks =[early_stopping_monitor], validation_data =(predictors[100:150,] , target[100:150,])) </b><br /><br /><b><u>Final Thoughts on Keras</u></b> <br /><br />In my final thoughts pertaining to the Keras model, I would like to discuss the pros and cons of the methodology. Keras is, without doubt, the machine learning model type which possesses the greatest predictive capacity. Keras can also be utilized to identify images, which is a feature that is lacking within most other predive models. However, despite these accolades, Keras does fall short in a few categories. <br /><br />For one, the mathematics which act a mechanism for the model’s predicative capacity are incredibly complex. As a result of such, model creation can only occur within a digital medium. With this complexity comes an inability to easily verify or reproduce results. Additionally, creating the optimal model configuration as it pertains to the number of neurons, layers, epochs, etc., becomes almost a matter of personal taste. This sort of heuristic approach is negative for the field of machine learning, statistics, and science in general. <br /><br />Another potential flaw relates to the package documentation. The website for the package is poorly organized. The videos created by researchers who attempt to provide instruction are also poorly organized, riddled with heuristic approach, and scuttled by a severe lack of awareness. It would seem that no single individual truly understands how to appropriately utilize all of the features of the Keras package. In my attempts to properly understand and learn the Keras package, I purchased the book, DEEP LEARNING with Python, written by the package’s creator, Francois Chollet. This book was also poorly organized, and suffered from the assumption that the reader could inherently understand the writer’s thoughts. <br /><br />This being said, I do believe that the future of statistics and predictive analytics lies parallel with the innovations demonstrated within the Keras package. However, the package is so relatively new, that not a single individual, including the creator, has had the opportunity to utilize and document its potential. In this lies latent opportunity for the patient individual to prosper by pioneering the sparse landscape. <br /><br />It is my opinion that at this current time, the application of the Keras model should be paired with other traditional statistical and machine learning methodologies. This pairing of multiple models will enable the user and potential outside researchers to gain a greater understanding as to what may be motivating the Keras model’s predictive outputs. </div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-25645155042821402842022-10-21T16:17:00.002-04:002022-10-21T16:22:14.791-04:00(R) Machine Learning - Gradient Boosted Algorithms – Pt. IV Of all the models which have been discussed thus far, the most complicated, and the most effective of the models which utilize the tree methodology, are the models which belong to a primary sub-group known as, <b>“gradient boosted algorithms”</b>. <br /><br />Gradient boosted models are similar to the random forest model, the primary difference between the two is that the gradient boosted models synthesize their individual trees differently. Whereas random forests seek to minimize errors through a randomization process, gradient boosted models address each incorrect model within each tree as it is created. Meaning, that each tree is re-assessed after its creation occurs, and the subsequent tree is optimized based on acknowledgement of the prior tree’s errors. <br /><br /><b><u>Model Creation Options</u></b> <br /><br />As the gradient boosted algorithm possesses components of all of the previously discussed model methodologies, the complexities of the algorithm’s internal mechanism are evident by design. In essence, the evolved capacity of the model, possessing various foundational elements which were initially designated as aspects of prior methodologies, ultimately, through various stages of synthesis, produces a model with a greater number of options. These options can remain at their default assignments in which they were initially designated. As such, they will assume predetermined values in accordance to the surrounding circumstances. However, if you would like to customize the model’s synthesis, the following options are available for such: <br /><br /><b><u>distribution</u></b> – This option refers to the distribution type which the model will assume when analyzing the data utilized within the model design process. The following distribution types are available within the <b>“gbm”</b> package: <b>“gaussian”</b>, <b>“laplace”</b>, <b>“tdist”</b>, <b>“bernoulli”</b>, <b>“huberized”</b>, <b>“adaboost”</b>, <b>“poisson”,</b> <b>“coph”</b>, <b>“quantile” </b>and <b>“pairwise”</b>. If this option is not explicitly indicated by the user, they system will automatically decide between <b>“gaussian”</b> and <b>“bernoulli”</b>, as to which distribution type best suits the model data. <br /><br /><b><u>n.minobsinnode</u></b> – This option indicates the integer specifying the minimum number of observations in the terminal nodes of the trees. <br /><br /><b><u>n.trees</u></b> – The number of trees which will be utilized to create the final model. <br /><br /><b><u>interaction.depth</u></b> - Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1. <br /><br /><b><u>cv.folds</u></b> – Specifies the number of “cross-validation” folds to perform. This option essentially provides additional model output in the form of additional testing results. Similar output is generated by default within the random forest model package. <br /><br /><b><u>shrinkage</u></b> - A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually works, but a smaller learning rate typically requires more trees. Default is 0.1. <br /><br /><b><u>Optimizing a Model with the “CARET” Package</u></b> <br /><br />For the everyday analyst, being confronted with the task of appropriately assigning values to the aforementioned fields can be disconcerting. This task is also undertaken with the understanding that by incorrectly assigning a variable field, that an individual can vastly compromise the validity of a model’s results. Thankfully, the <b>“CARET”</b> package exists to assist us with our model optimization needs. <br /><br /><b>“CARET”</b> is an auxiliary package with numerous uses, primarily among them, is a function which can be utilized to assess model optimization prior to synthesis. It the case of our example, we will be utilizing the following packages to demonstrate this capability: <br /><br /><b># With the “CARET” package downloaded and enabled # <br /><br /># With the “e1071” package downloaded and enabled # </b><br /><br />With the above packages downloaded and enabled, we can run the following <b>“CARET”</b> function to generate console output pertaining to the various model types which <b>“CARET”</b> can be utilized to optimize: <br /><br /><b># List different models which train() function can optimize # <br /><br />names(getModelInfo()) </b><br /><br />The console output is too voluminous to present in its entirety within this article. However, a few notable options which warrant mentioning as they pertain to previously discussed methodologies are: <br /><br />rf – Which refers to the random forest model. <br /><br />treebag – Which refers to the bootstrap aggregation model. <br /><br />glm – Which refers to the general linear model. <br /><br />(and) <br /><br />gbm – Which refers to the gradient boosted model. <br /><br />Let’s start by regenerating the random sets which comprise of random observations from our favorite <b>“iris”</b> set. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. # <br /><br />train(Species~.,data=raniris[1:100,], method = "gbm") </b><br /><br />This produces a voluminous amount of console output, however, the primary portion of the output which we will focus upon is the bottom most section. <br /><br />This output should resemble something similar to: <br /><br /><i>Tuning parameter 'shrinkage' was held constant at a value of 0.1 <br />Tuning parameter 'n.minobsinnode' was held constant at a value of 10 <br />Accuracy was used to select the optimal model using the largest value. <br />The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10. </i><br /><br />From this information, we discover the optimal parameters in which to establish a gradient boosted model. <br /><br />In this particular case: <br /><br />n.trees = 50 <br /><br />interaction.depth = 2 <br /><br />shrinkage = 0.1 <br /><br />n.minobsinnode = 10 <br /><br /><b style="text-decoration: underline;">A Real Application Demonstration (Classification)</b> <br /><br />With the optimal parameters discerned, we may continue with the model building process. The model created for this example is of the classification type. Typically for a classification model type, the “multinomial” option should be specified. <br /><br /><b># Create Model # <br /><br />model <- gbm(Species ~., data = raniris[1:100,], distribution = 'multinomial', n.trees = 50, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10) <br /><br /># Test Model # <br /><br />modelprediction <- predict(model, n.trees = 50, newdata = raniris[101:150,] , type = 'response') <br /><br /># View Results # <br /><br />modelprediction0 <- apply(modelprediction, 1, which.max) <br /><br /># View Results in a readable format # <br /><br />modelprediction0 <- colnames(modelprediction)[modelprediction0] <br /><br /># Create Confusion Matrix # <br /><br />table(raniris[101:150,]$Species, predicted = modelprediction0) <br /></b><br /><u>Console Output:</u> <br /><br /> predicted <br /> setosa versicolor virginica <br />setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 <br /><br /><b style="text-decoration: underline;">A Real Application Demonstration (Continuous Dependent Variable</b><b><span style="text-decoration: underline;">)</span> </b><br /><br />As was the case with the previous example, we will again be utilizing the <b>train()</b> function within the <b>“CARET”</b> package to determine model optimization. As it pertains to continuous dependent variables, the <b>“gaussian”</b> option should be specified if the data is normally distributed, and the <b>“tdist” </b>option should be specified if the data is non-parametric. <br /><br /><b># Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. # <br /><br />model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", method = "gbm") <br /><br />model </b><br /><br /><u>Console Output:</u> <br /><br />Stochastic Gradient Boosting <br /><br />100 samples <br /> 3 predictor <br /><br />No pre-processing <br />Resampling: Bootstrapped (25 reps) <br />Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... <br />Resampling results across tuning parameters: <br /><br /><i> interaction.depth n.trees RMSE Rsquared MAE <br /> 1 50 0.4256570 0.7506086 0.3316030 <br /> 1 100 0.4083072 0.7623251 0.3258838 <br /> 1 150 0.4067113 0.7607363 0.3270202 <br /> 2 50 0.4241599 0.7471639 0.3347628 <br /> 2 100 0.4184793 0.7466858 0.3335772 <br /> 2 150 0.4212821 0.7427328 0.3369379 <br /> 3 50 0.4248178 0.7433384 0.3345428 <br /> 3 100 0.4260524 0.7391382 0.3385778 <br /> 3 150 0.4278416 0.7345970 0.3398392 </i><br /><br /><i>Tuning parameter 'shrinkage' was held constant at a value of 0.1 <br />Tuning parameter 'n.minobsinnode' was held constant at a value of 10 <br />RMSE was used to select the optimal model using the smallest value. <br />The final values used for the model were n.trees = 150, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10. </i><br /><br /><b># Optimal Model Parameters # <br /><br /># n.trees = 150 # <br /><br /># interaction.depth = 1 # <br /><br /># shrinkage = 0.1 # <br /><br /># n.minobsinnode = 10 # <br /><br /># Create Model # <br /><br />tmodel <- gbm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", n.trees = 150, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10) <br /><br /># Test Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response') <br /><br /># Compute the Root Mean Standard Error (RMSE) of model testing data # <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, tmodelprediction) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response') <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, tmodelprediction) </b><br /><br /><u>Console Output:</u> <br /><br /><i>[1] 0.4060854 <br /><br />[1] 0.3144518 </i><br /><br /><b># Mean Absolute Error # <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA # <br /><br /># Utilize MAE function on model testing data # <br /><br /># Regenerate Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response') <br /><br /># Generate Output # <br /><br />MAE(raniris[101:150,]$Sepal.Length, tmodelprediction) <br /><br /># Utilize MAE function on model training data # <br /><br /># Regenerate Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response') <br /><br /># Generate Output # <br /><br />MAE(raniris[1:100,]$Sepal.Length, tmodelprediction) </b><br /><br /><u>Console Output:</u> <br /><br /><i>[1] 0.3320722 <br /><br />[1] 0.2563723 </i><br /><br /><b><u>Graphing and Interpreting Output</u></b> <br /><br />The following method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of each model, the code samples below produce the subsequent outputs: <br /><br /><b># Multinomial Model # <br /><br />summary(model)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMfFlkmkL6zrZVyE1sPSBvhNz7SzLuDlOuMzDHpGHzfBR8QjbvVXwPVvq827y9dWSm6Z_vLLq8ES9AOr0DrEXSVAgKGUv7LFRME79Oj9YIFOHW74f7Zvqd2WJxqRgsc_tSoEFhhUBjrnj5B_CKxWzkjg4FnJcX__g93VuTb8feWl3j-KKilsHC5dsY/s583/GBM1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="493" data-original-width="583" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMfFlkmkL6zrZVyE1sPSBvhNz7SzLuDlOuMzDHpGHzfBR8QjbvVXwPVvq827y9dWSm6Z_vLLq8ES9AOr0DrEXSVAgKGUv7LFRME79Oj9YIFOHW74f7Zvqd2WJxqRgsc_tSoEFhhUBjrnj5B_CKxWzkjg4FnJcX__g93VuTb8feWl3j-KKilsHC5dsY/w400-h338/GBM1.png" width="400" /></a></div><div><br /></div><u>Console Output:</u> <br /><br /><i> var rel.inf <br />Petal.Length Petal.Length 59.0666833 <br />Petal.Width Petal.Width 38.6911265 <br />Sepal.Width Sepal.Width 2.1148704 <br />Sepal.Length Sepal.Length 0.1273199 </i><br /><br /><br /><b>####################################### <br /><br /># T-Distribution Model # <br /><br />summary (tmodel)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzVbBmvn_C8vtmIQz77frRAXUlu_yTONs424YVpXBQGSEj54iSWjx5sacBiCn7bD4FNrKabQAQDbDwqA8TgfRbF7cFNed4f5uUkE7-FFiDwRnV_aC9K4_VYx3Dob_rG2dsBhaRc3IAx8UcjpR-XH1Gte47WdhBiJnk5-QSxGMvtAn838vpr0pqR_XV/s588/GBM2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="588" height="341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzVbBmvn_C8vtmIQz77frRAXUlu_yTONs424YVpXBQGSEj54iSWjx5sacBiCn7bD4FNrKabQAQDbDwqA8TgfRbF7cFNed4f5uUkE7-FFiDwRnV_aC9K4_VYx3Dob_rG2dsBhaRc3IAx8UcjpR-XH1Gte47WdhBiJnk5-QSxGMvtAn838vpr0pqR_XV/w400-h341/GBM2.png" width="400" /></a></div><div><br /><u>Console Output:</u> <br /><i><br /> var rel.inf <br />Petal.Length Petal.Length 74.11473 <br />Sepal.Width Sepal.Width 14.18743 <br />Petal.Width Petal.Width 11.69784</i></div><div><i><br /></i></div><div>That's all for now.</div><div><br /></div><div>I'll see you next time, Data Heads!</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-15162184907209897942022-10-13T10:30:00.005-04:002022-10-13T22:34:28.397-04:00(R) Machine Learning - The Random Forest Model – Pt. IIIWhile unsupervised machine learning methodologies were enduring their initial genesis, the Random Forest Model ruled the machine learning landscape as the best predictive model type available. In this article, we will review the Random Forest Model. If you haven’t done so already, I would highly recommend reading the prior articles pertaining to Bagging and Tree Modeling, as these articles illustrate many of the internal aspects which together converge into the Random Forest model methodology.<br /><br /><b><u>What a Random Forest and How is it Different?</u></b><br /><br />The random forest method of model creation contains certain elements of both the bagging, and standard tree methodologies. The random forest sampling step is similar to that of the bagging model. Also, in a similar manner, the random forest model is comprised of numerous individual trees, with the output figure being the majority consensus reached as data is passed through each individual tree model. The only real differentiating factor which is present within the random forest model, is the initial nodal split designation, which occurs proceeding the model’s root pathway.<br /><br />For example, if the following data frame was structured and prepared to serve as a random forest model’s foundation, the first step which would occur during the initial algorithmic process, would be random sampling. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-eqKZNfAcK3FCfKlGq170pcash-VDAqVUKI47CQJaoBDKyKvtEuQyOEU1-rwI32xzPGFdgbAEim3D5-tmb4OT1WaV93II-G3CeKR9T7IzoUGY62gOU-WSmQNMyki8VqYm9l3AztSmKI6gYiQxGPEeSCNddMTwGwJZhveSGOd5B96v2uEmtQf2mriI/s505/RF0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="253" data-original-width="505" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-eqKZNfAcK3FCfKlGq170pcash-VDAqVUKI47CQJaoBDKyKvtEuQyOEU1-rwI32xzPGFdgbAEim3D5-tmb4OT1WaV93II-G3CeKR9T7IzoUGY62gOU-WSmQNMyki8VqYm9l3AztSmKI6gYiQxGPEeSCNddMTwGwJZhveSGOd5B96v2uEmtQf2mriI/s16000/RF0.png" /></a></div><div><br /></div><div>Like the bagging model’s sampling process, the performance of this step might also resemble:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9y6v-kyBeRlDdckAzIfAjVZqhGtb3OVoZHtc2hmDW_q84K-SmL_QZLErHJKuV-8LGOiLRKvM2yMQrxZklldMbMmTUf4AOvG2HXD1_UnH4rdoaGrzmIlqJaSL91aOBuBM-EiO_dktHXqJluWbzSsdvR8fo9dnK9uTDksklZJuKMiJAml2lo7G1n_D9/s817/RF_01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="817" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9y6v-kyBeRlDdckAzIfAjVZqhGtb3OVoZHtc2hmDW_q84K-SmL_QZLErHJKuV-8LGOiLRKvM2yMQrxZklldMbMmTUf4AOvG2HXD1_UnH4rdoaGrzmIlqJaSL91aOBuBM-EiO_dktHXqJluWbzSsdvR8fo9dnK9uTDksklZJuKMiJAml2lo7G1n_D9/w640-h226/RF_01.png" width="640" /></a></div><div><br /></div>As was previously mentioned, the main differentiating factor which separates the random forest model from the other models whose parts it incorporates, is the manner in which the initial nodal split is designated. In the bagging model, numerous individual trees are created, and each tree is created from the same algorithmic equation as it is applied to each individual data set. In this manner, the optimization pattern is static, while the data for each set is dynamic. <br /><br />As it pertains to the random forest model, after the creation of each individual set has been established, a pre-selected number of independent variable categories are designated at random from each set, this selection will be assessed by the algorithm, with the most optimal pathway being ultimately selected from amongst the selection of pre-determined variables. <br /><br />For example, we’ll assume that the number of pre-designate variables which will be selected prior to the creation of each individual tree is 3. If this were the case, each tree within the model will have its initial nodal designation decided upon by which one of the three variables is optimal as it pertains to performing the initial filtering process. The other two variables which are not selected, are then considered for additional nodal splits, along with all of the other variables which the model finds particularly worthy. <br /><br />With this in mind, a set of variables which would consist of three randomly selected independent variables, might resemble the following as it relates to the initial nodal split:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc7AHZahkp-9_WECfPmt_BYEof5-Q-3lhc0yEL5yQcb8wwaP-6MAGFQ2ijbZYn7XKnkSzdmQMFPLRdtNOAAaKONIu41I5tDmVsAKEGyGOemvq8eK6dgXKTdZCTVu9OdxmvWC4tZ62a3Fu5Bu_k-zSEWtlkv5eunbk_adSHNIFwYKNNxfRtM9hHL7q6/s599/RForest.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="244" data-original-width="599" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc7AHZahkp-9_WECfPmt_BYEof5-Q-3lhc0yEL5yQcb8wwaP-6MAGFQ2ijbZYn7XKnkSzdmQMFPLRdtNOAAaKONIu41I5tDmVsAKEGyGOemvq8eK6dgXKTdZCTVu9OdxmvWC4tZ62a3Fu5Bu_k-zSEWtlkv5eunbk_adSHNIFwYKNNxfRtM9hHL7q6/w640-h260/RForest.png" width="640" /></a></div><div><br /></div>In this case, the blank node’s logical discretion would be selected from the optimal selection of a single variable from the set: {Sepal.Length, Sepal.Width, Petal.Length}. <br /><br />One variable would be selected from the set, with the other two variables then being returned to the larger set of all other variables from the initial data frame. From this larger set, all additional nodes would be established based on the optimal placement values determined by the underlying algorithm. <br /><br /><b><u>The Decision-Making Process</u></b> <br /><br />In a manner which exactly resembles the bagging-boostrap aggregation method described within the prior article, the predictive output figure consists of the majority consensus reached as data is passed through each individual tree model.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8qYwS70xlvOQNcPMEwg_1irPLA1nDN7ViYp0642oAsHHHZXPZ9Y8ngWcTIw5uwfMCa_KjD31H8uVy7xastMZY9AT5-K7yq2049jxhN4S24gAjZBohiJh53aE4OJuJ8VpcK9Ue9WWpzjLNkWL5Vq1zgxu82vfdy9YrGgO3FAzK_0P2hr3Rdepb6Pm/s625/RF_3x.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="245" data-original-width="625" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8qYwS70xlvOQNcPMEwg_1irPLA1nDN7ViYp0642oAsHHHZXPZ9Y8ngWcTIw5uwfMCa_KjD31H8uVy7xastMZY9AT5-K7yq2049jxhN4S24gAjZBohiJh53aE4OJuJ8VpcK9Ue9WWpzjLNkWL5Vq1zgxu82vfdy9YrGgO3FAzK_0P2hr3Rdepb6Pm/w640-h250/RF_3x.png" width="640" /></a></div><div><br /></div>The above graphical representation illustrates observation 8 being passed through the model. The model, being comprised of three separate decision trees, which were synthesized from three separate data sets, produces three different internal outcomes. The average of these outcomes is what is eventually returned to the user as the ultimate product of the model. <br /><br /><b><u>A Real Application Demonstration (Classification)</u></b> <br /><br />Again, we will utilize the "iris" data set which comes embedded within the R data platform. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "randomForest" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- randomForest(Species ~., data= raniris[1:100,], type = "class") <br /><br /># View the model summary # <br /><br />mod</b><br /><br /><u>Console Output:</u> <br /><br /><i>Call: <br /> randomForest(formula = Species ~ ., data = raniris[1:100, ], type = "class") <br /> Type of random forest: classification <br /> Number of trees: 500 <br />No. of variables tried at each split: 2 <br /><br /> OOB estimate of error rate: 4% <br />Confusion matrix: <br /> setosa versicolor virginica class.error <br />setosa 31 0 0 0.00000000 <br />versicolor 0 34 1 0.02857143 <br />virginica 0 3 31 0.08823529 </i><br /><br /><b><u>Deciphering the Output</u> </b><br /><br /><b>Call:</b> - The formula which initially generated the console output. <br /><br /><b>Type of random forest: Classification </b>– The model type applied to the data frame passed through the “randomForest()” function. <br /><br /><b>Number of trees: 500 </b>– The number of individual trees from which the data model is comprised of. <br /><br /><b>No. of variable tried at each split: 2</b> – The number of randomly selected variables considered as candidates for the initial nodal split criteria. <br /><br /><b>OOB estimate of error rate: 4%</b> - The amount of erroneous predictions which were discovered within the model as a result of passing OOB (out of bag) data through the completed model. <br /><br /><b>Class.error</b> – The percentage which appears within the rightmost column represents the total number of observations within the row divided by the number of incorrectly categorized observations within the row. <br /><br /><b><u>OOB and the Confusion Matrix</u></b> <br /><br />OOB is an abbreviation for “Out of Bag”. As it pertains to the random forest model, as each individual tree is being established within the model, additional observations from the original data set will, as a consequence of the method, not be selected for inclusion as it pertains to the creation the subsets. To generate both the OOB estimate of the error rate, and the confusion matrix within the object summary, the withheld data is passed through each individual tree once it is created. Through an internal tallying and consensus methodology, the confusion matrix presents an estimate of all observational predictions which existed within the initial data set, however, not all of the observational values which were predicted through this method were evenly assessed throughout the entire series of tree models. The consensus is that this test of prediction specificity is superior to testing the complete model with the entire set of initial variables. However, due to the level of complexity which is innate within the methodology, which, as an aspect of such, makes explaining findings to others extremely difficult, I will often also run the standard prediction function as well. <br /><br /><b># View model classification results with training data # <br /><br />prediction <- predict(mod, raniris[1:100,], type="class") <br /><br />table(raniris[1:100,]$Species, predicted = prediction ) <br /><br /> # View model classification results with test data # <br /><br />prediction <- predict(mod, raniris[101:150,], type="class") <br /><br />table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br /><u>Console Output (1):</u> <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 31 0 0 <br /> versicolor 0 35 0 <br /> virginica 0 0 34 </i><br /><br /><u>Console Output (2):</u> <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 </i><br /><br />As you probably already noticed, the “Console Output (1)” values differ from those produced within the object’s Confusion Matrix. This is a result of the phenomenon which was just previously discussed. <br /><br />To further illustrate this concept, if I were to change the number of trees to be created to: 2, thus, overriding the package default, the Confusion Matrix will lack enough observations to reflect the total number of observations within the initial set. The result would be the following: <br /><br /><b># With the package "randomForest" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- randomForest(Species ~., data= raniris[1:100,], ntree= 2, type = "class") <br /><br /># View the model summary # <br /><br />mod </b><br /><br /><i>Call: <br /> randomForest(formula = Species ~ ., data = raniris[1:100, ], ntree = 2, type = "class") <br /> Type of random forest: classification <br /> Number of trees: 2 <br />No. of variables tried at each split: 2 </i><br /><br /><i> OOB estimate of error rate: 3.57% <br />Confusion matrix: <br /> setosa versicolor virginica class.error <br />setosa 15 0 0 0.00000000 <br />versicolor 0 19 0 0.00000000 <br />virginica 0 2 20 0.09090909 </i><br /><br /><b><u>Peculiar Aspects of randomForest</u></b> <br /><br />There are few particular aspects of the randomForest package which differ from the previously discussed packages. One of which is how the randomForest() assesses variables within a data frame. Specifically, as it relates to such, the package function requires that variables which will be analyzed must have their types specifically assigned. <br /><br />To address this, we must first view the data type in which each variable is assigned. <br /><br />This can be accomplished with the following code: <br /><br /><b>str(raniris) <br /></b><br />Which produces the output: <br /><br /><i>'data.frame': 150 obs. of 5 variables: <br /> $ Sepal.Length: num 5 5.6 4.6 6.4 5.7 7.7 6 5.8 6.7 5.6 ... <br /> $ Sepal.Width : num 3.4 2.5 3.6 3.1 2.5 3.8 3 2.7 3.1 3 ... <br /> $ Petal.Length: num 1.5 3.9 1 5.5 5 6.7 4.8 5.1 4.4 4.5 ... <br /> $ Petal.Width : num 0.2 1.1 0.2 1.8 2 2.2 1.8 1.9 1.4 1.5 ... <br /> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 1 3 3 3 3 3 2 2 ... </i><br /><br />While this data frame does not require additional modification, if there was a need to change or assign variable types, this can be achieved through the following lines of code: <br /><br /><b># Change variable type to continuous # <br /><br />dataframe$contvar <- as.integer(dataframe$contvar) <br /><br /># Change variable type to categorical # <br /><br />dataframe$catvar <- as.factor(dataframe$catvar) </b><br /><br />Another unique differentiation which applies to the randomForest() function is the way in which it handles missing observational variable entries. You may recall from when we were previously building tree models within the <b>“rpart”</b> package, that the model methodology included within such contained an internal algorithm which assessed missing variable observational values, and assigned those values <b>“surrogate values”</b> based on other similar variable observations. <br /><br />Unfortunately, the randomForest() function requires that the user take a on more manual approach as it pertains to working around, and otherwise including these observational values within the eventual model. <br /><br />First, be sure that all variables within the model are appropriately assigned to the correct corresponding data types. <br /><br />Next, you will need to impute the data. To achieve this, you will need to utilize the following code for each variable column which is absent data. <br /><br /><b># Impute missing variable values # <br /><br />rfImpute(variablename ~., data=dataframename, iter = 500) </b><br /><br />This function instructs the randomForest package library to create new variable entries for whatever the specified variable may be by considering similar entries contained with other variable columns. “iter = “ specifies the number of iterations to utilize when accomplishing this task, as for whatever reason, this method of variable generation requires the creation of numerous tree models. A maximum of 6 iterations is enough to accomplish this task, however, I err on the side of extreme caution. If your data frame is colossal, 6 iterations should suffice. <br /><br />Though it’s un-necessary, let’s apply this function to each variable within our “iris” data frame: <br /><br /><b>raniris[1:100,]$Sepal.Length <- rfImpute(Sepal.Length ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Sepal.Width <- rfImpute(Sepal.Width ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Petal.Length <- rfImpute(Petal.Length ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Petal.Width <- rfImpute(Petal.Width ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Species <- rfImpute(Species ~., data=raniris[1:100,], iter = 500) </b><br /><br />You will receive the error message: <br /><br /><i>Error in rfImpute.default(m, y, ...) : No NAs found in m </i><br /><br />Which correctly indicates that there were no NA values to be found in the initial set. <br /><br /><b><u>Variables to Consider for Initial Nodal Split</u></b> <br /><br />The randomForest package has embedded within its namesake function, a default assignment as it pertains to the number of variables which are consider for each initial nodal split. This value can be modified by the user for optimal utilization of the model’s capabilities. The functional option to specify this modification is <b>“mtry”</b>. <br /><br />How would a researcher decide what the optimal value of this option ought to be? Thankfully, a Youtube user named: <b>StatQuest with Josh Starmer</b>, has created the following code to assist us with this decision. <br /><br /><b># Optimal mtry assessment # <br /><br /># vector(length = ) must equal the number of independent variables within the function # <br /><br /># for(i in 1: ) must have a value which equals the number of independent variables within the function # <br /><br />oob.values <- vector(length = 4) <br /><br />for(i in 1:4) { <br /><br /> temp.model <- randomForest(Species ~., data=raniris[1:100,], mtry=i, ntree=1000) <br /><br /> oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate), 1] <br /><br />} <br /><br /># View the object # <br /><br />oob.values </b><br /><br /><u>Console Output</u> <br /><br /><i>[1] 0.04 0.04 0.04 0.04</i> <br /><br />The values produced are the OOB error rates which are associated with each number of variable inclusions. <br /><br />Therefore, the leftmost value would be the OOB error rate with one variable included within the model. The rightmost value would be the OOB error rate with four variables included with the model. <br /><br />In the case of our model, as there is no change in OOB error as it pertains to the number of variables utilized for initial nodal split consideration, the option “mtry” can remain unaltered. However, if for whatever reason, we wished to consider a set of 3 random variables for each initial split within our model, we would utilize the following code: <br /><br /><b>mod <- randomForest(Species ~., data= raniris[1:100,], mtry= 3, type = "class") </b><br /><br /><b><u>Graphing Output</u></b> <br /><br />There are numerous ways to graphically represent the inner aspects of a random forest model as its aspects work in tandem to generate a predictive analysis. In this section, we will review two of the simplest methods for generating illustrative output. <br /><br />The first method creates a general error plot of the model. This can be achieved through the utilization of the following code: <br /><br /><b># Plot model # <br /><br />plot(mod) <br /><br /># include legend # <br /><br />layout(matrix(c(1,2),nrow=1), <br /><br /> width=c(4,1)) <br /><br />par(mar=c(5,4,4,0)) #No margin on the right side <br /><br />plot(mod, log="y") <br /><br />par(mar=c(5,0,4,2)) #No margin on the left side <br /><br />plot(c(0,1),type="n", axes=F, xlab="", ylab="") <br /><br /># “col=” and “fill=” must both be set to one plus the total number of independent variables within the model # <br /><br />legend("topleft", colnames(mod$err.rate),col=1:4,cex=0.8,fill=1:4) <br /><br /># Source of Inspiration: <a href="https://stackoverflow.com/questions/20328452/legend-for-random-forest-plot-in-r">https://stackoverflow.com/questions/20328452/legend-for-random-forest-plot-in-r</a> # </b><br /><br />This produces the following output: <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr2xW_Q8x3Xlwm9Zh8sMDSaQd-Nco3DniDC7XQ8JsyoO4A1DlKNQ0kE3_h0Rgc3ZPWsqN4QmnaOQB8_1n5rl7qFKAfanwsRMMsu1JE506diGujimbQwps9L5hcwPtJaW_dcd7sFXSYQQDKCQtEs_0GlRvr4SgKg_x_S0F7ZxVxxAUfiMbOV6i0Qsvc/s868/RForest2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="868" data-original-width="718" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr2xW_Q8x3Xlwm9Zh8sMDSaQd-Nco3DniDC7XQ8JsyoO4A1DlKNQ0kE3_h0Rgc3ZPWsqN4QmnaOQB8_1n5rl7qFKAfanwsRMMsu1JE506diGujimbQwps9L5hcwPtJaW_dcd7sFXSYQQDKCQtEs_0GlRvr4SgKg_x_S0F7ZxVxxAUfiMbOV6i0Qsvc/w529-h640/RForest2.png" width="529" /></a></div><div><br /></div>This next method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of our classification model, the following graphical output is produced from the line of code below: <br /><br /><b>varImpPlot(mod)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7tFWDmq_PUyOuXrX1RtA-5YADudkaoq1CHbph35EEhUcOQoOjCgSUtdRpefA76oz8ggrsRYiY3sNeJNtCUM9nWWywzXv9YoYgUBUPTtybDN1KTGe-GJEdQxuetovXjk6N_PKPIUF_fP8Q5pvFDLSWXmIQih5nlDSqJW3AnOt0ihpg7OvWLAeAVqVy/s557/RForest3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="294" data-original-width="557" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7tFWDmq_PUyOuXrX1RtA-5YADudkaoq1CHbph35EEhUcOQoOjCgSUtdRpefA76oz8ggrsRYiY3sNeJNtCUM9nWWywzXv9YoYgUBUPTtybDN1KTGe-GJEdQxuetovXjk6N_PKPIUF_fP8Q5pvFDLSWXmIQih5nlDSqJW3AnOt0ihpg7OvWLAeAVqVy/s16000/RForest3.png" /></a></div><div><br /><b><u>A Real Application Demonstration (ANOVA)</u></b> <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />anmod <- randomForest(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") <br /></b><br />Like the previously discussed methodologies, you also have the option of utilizing Root Mean Standard Error, or Mean Absolute Error, to analyze the model’s predictive capacity. <br /><br /><b># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />prediction <- predict(anmod, raniris[1:100,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, prediction ) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model testing data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, prediction ) <br /><br /># Mean Absolute Error # <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA # <br /><br /># Regenerate Predictive Model # <br /><br />anprediction <- predict(anmodel , raniris[1:100,]) <br /><br /># Utilize MAE function on model training data # <br /><br />MAE(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Mean Absolute Error # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># Utilize MAE function on model testing data # <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction) </b><br /><br /><u>Console Output (RMSE)</u> <br /><br /><i>[1] 0.2044091 <br /><br />[1] 0.3709858 </i><br /><br /><u>Console Output (MAE)</u> <br /><br /><i>[1] 0.2215909 <br /><br />[1] 0.2632491 </i><br /><br /> Just like the classification variation of the random forest model, graphical outputs can also be created to illustrate the internal aspects of the ANOVA version of the model. <br /><br /><b># Plot model # <br /><br />plot(anmod)</b></div><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu6a1NezRetETof1Tm_U30uidFLjCq_zyBbPoVL0YYG9QoR5whDXt5rYQdqdJjKwAi0K1UZRhYsvixsjj-L0uC3lEsIqNMSBrS4ULynLPHPc2xdxQW1xkBh3NRezT8JqAOLTg4KmfhabppEVWW8KZzq3khyukrroFGPjXYHL2H5lNY6rf4K5UxIGya/s517/RForest4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="293" data-original-width="517" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu6a1NezRetETof1Tm_U30uidFLjCq_zyBbPoVL0YYG9QoR5whDXt5rYQdqdJjKwAi0K1UZRhYsvixsjj-L0uC3lEsIqNMSBrS4ULynLPHPc2xdxQW1xkBh3NRezT8JqAOLTg4KmfhabppEVWW8KZzq3khyukrroFGPjXYHL2H5lNY6rf4K5UxIGya/w640-h362/RForest4.png" width="640" /></a></div><div><br /></div><b># Measure variable significance # <br /><br />varImpPlot(anmod)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQmXeLw-ISzJZPUTG-Z4xn6t3iHY5kjBq-C5wDPvbLYo2DuLBVmxRqwuyHAVFrDg5dDShg3gXqtdRT-kmzf1RoTOpjskQ22xgirtD9wBkUEYUXP0n4pNwzEQn0EuZugBowmIFzxoRMuuJelfIpufzW2kX3L8sse6HxLKlhe-RUVWVEEQylPZ_Hh8g4/s707/RForest5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="293" data-original-width="707" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQmXeLw-ISzJZPUTG-Z4xn6t3iHY5kjBq-C5wDPvbLYo2DuLBVmxRqwuyHAVFrDg5dDShg3gXqtdRT-kmzf1RoTOpjskQ22xgirtD9wBkUEYUXP0n4pNwzEQn0EuZugBowmIFzxoRMuuJelfIpufzW2kX3L8sse6HxLKlhe-RUVWVEEQylPZ_Hh8g4/w640-h266/RForest5.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">That's all for this entry, Data Heads.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">We'll continue on the topic of machine learning next week.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Until then, stay studious!</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">-RD</div><div><br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-68109324435135815272022-10-08T17:45:00.003-04:002022-10-08T22:37:16.944-04:00(R) Machine Learning - Bagging, Boosting and Bootstrap Aggregation – Pt. IINow that you have a fundamental understanding of tree based modeling, we can begin to discuss the concept of <b>"Bootstrap Aggregation"</b>. Both of the previously mentioned concepts will come to serve as compositional aspects of a separate model known as <b>"The Random Forest"</b>. This methodology will be discussed in the subsequent article.<br /><br />All three of these concepts classify as <b>"Machine Learning"</b>, specifically, supervised machine learning. <br /><br /><b>"Bagging", </b>is a word play synonym, which serves as a short abbreviation for <b>"Boot</b>strap <b>Agg</b>regation<b>"</b>. Bootstrap aggregation is a term which is utilized to describe a methodology in which multiple randomized observations are drawn from a sample data set. <b>"Boosting" </b>refers to the algorithm which analyzes numerous sample sets which were composed as a result of the previous process. Ultimately, from these sets, numerous decision trees are created. Into which, test data is eventually passed. Each observation within the test data set is analyzed as it passes through the numerous nodes of each individual tree. The results of the predictive output being the consensus of the results reached from a majority of the individual internal models. <br /><br /><b><u>How Bagging is Utilized<br /></u></b><br />As previously discussed, <b>"Bagging" </b>is a data sampling methodology. For demonstrative purposes, let's consider its application as it is applied to a randomized version of the <b>"iris" </b>data frame. Here is a portion of the data frame as it currently exists within the "R" platform.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJyJYn8AIpR7HWH9BtRYP3zozxC1u9CmhjykcNKrgxyAPpJsOWciLGiyQUtmLZRCzjVk8d6jpLq7IsJq-tjNBcerp_oXfDUoSQ2QM1CzszvHyZlzwHqhe7PTGNSkPkOkhwuNn0WwpYboXqp5gAvMw0HBusBSQASls-FsdOmj-XV_a5qjNGnDtLi42V/s505/bag1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="253" data-original-width="505" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJyJYn8AIpR7HWH9BtRYP3zozxC1u9CmhjykcNKrgxyAPpJsOWciLGiyQUtmLZRCzjVk8d6jpLq7IsJq-tjNBcerp_oXfDUoSQ2QM1CzszvHyZlzwHqhe7PTGNSkPkOkhwuNn0WwpYboXqp5gAvMw0HBusBSQASls-FsdOmj-XV_a5qjNGnDtLi42V/s16000/bag1.png" /></a></div><div><br /></div>From this data frame, we could utilize the<b> "bagging" </b>methodology to create numerous subsets which contain aspects of the observations contained therein. This methodology will sample from the data frame a pre-determined number of times until it has created a single data sub-set. Once this task has been completed, the process will be completed until a pre-determined number of subsets have been created. Observations from the initial data frame can be sampled multiple times in order to build each individual subset. Therefore, each data frame may contain multiple instances of the same observation. <br /><br />A graphical representation of this process is illustrated below:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEVwnEEMBJuq6g3-LTk4fdsOJ0yuQ097Am3VD-hxgYn_OsbUX45vxaAmRwv9q4zF9VoaaVgXYAfLIrirB3H6KIGPpnaQ-1sJcywRkBVQA1TZpRy845gqFhw11m-Mim296NHJgtH5XFBl0REEw64kf6HgEKtGtb6vkH1P-bCVYOEECiKVNYKVQE3eLY/s817/bag2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="817" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEVwnEEMBJuq6g3-LTk4fdsOJ0yuQ097Am3VD-hxgYn_OsbUX45vxaAmRwv9q4zF9VoaaVgXYAfLIrirB3H6KIGPpnaQ-1sJcywRkBVQA1TZpRy845gqFhw11m-Mim296NHJgtH5XFBl0REEw64kf6HgEKtGtb6vkH1P-bCVYOEECiKVNYKVQE3eLY/w640-h226/bag2.png" width="640" /></a></div><div><br /></div>In the case of our illustrated example, three new data samples were created. Each new sample contains a similar number of observations, however, observations from the original data frame are not exclusive in each set. Also, as demonstrated in the above graphic, data observations can repeat within the same sample. <br /><br /><b><u>Boosting Described</u></b> <br /><br />Once new data samples have been created, the <b>"boosting"</b> process, which is the portion of the algorithm which is initiated following the <b>"bagging"</b> methodology's application, begins to create individualized decision trees for each newly created set. Once each decision tree has been created, the model’s creation process is complete. <br /><br /><b><u>The Decision Making Process</u></b> <br /><br />With the model created, the process of predicting dependent variable values can be initiated. <br /><br />Remember that each decision tree was created from data observations from which each corresponding set was comprised of.<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQyQ6_E29WDC-6z3qH9n8TKZZ_QmuRQRr5y1G9sJNGf3_d5YtvOqKC62KnWw043yVQGSALVE6LUMcNvkQm2XmBlNVtPsuR2XxjW23kMoH5jBgiY90dVW2LZuTrDwHfpwZSkG_TeFkI1YD_fkc86bFyoTrWPBPoDsgpOgp8E3g51eHUkXItK5CKBFz/s625/bag3x.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="245" data-original-width="625" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQyQ6_E29WDC-6z3qH9n8TKZZ_QmuRQRr5y1G9sJNGf3_d5YtvOqKC62KnWw043yVQGSALVE6LUMcNvkQm2XmBlNVtPsuR2XxjW23kMoH5jBgiY90dVW2LZuTrDwHfpwZSkG_TeFkI1YD_fkc86bFyoTrWPBPoDsgpOgp8E3g51eHUkXItK5CKBFz/s16000/bag3x.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The above graphical representation illustrates observation - 8 ,being passed through the model. The model, being comprised of three separate decision trees, which were synthesized from three separate data subsets, produces three different internal outcomes. The average of these outcomes is what is eventually returned to the user as the ultimate product of the model. <br /><br /><u style="font-weight: bold;">A Real Application Demonstration (Classification)</u> <br /><br />Again, we will utilize the <b>"iris"</b> data set which comes embedded within the R data platform. <br /><br />A short note on the standard notation utilized for this model type: <br /><br /><b>D = The training data set. <br /><br />n = The number of observations within the training data set. <br /><br />n^1 = "n prime". The number of observations within each data subset. <br /><br />m = The number of subsets. </b><br /><br />In this example we will allow the bagging package command to perform its default function without specifying any additional options. If n^1 = n, then each subset which is created from the training data set is expected to contain at least (1 - 1/e) (≈63.2%) of the unique observations contained within the training data set. If this does not occur, the <b>bagging() </b>function will automatically enable an option which ensures this occurrence. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- bagging(Species ~., data= raniris[1:100,], type = "class") <br /><br /># View model classification results with training data # <br /><br />prediction <- predict(mod, raniris[1:100,], type="class") <br /><br />table(raniris[1:100,]$Species, predicted = prediction ) <br /><br /># View model classification results with test data # <br /><br />prediction <- predict(mod, raniris[101:150,], type="class") <br /><br />table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br /><u>Console Output (1):</u> <br /><br /><i> predicted <br /><br /> setosa versicolor virginica <br /><br /> setosa 31 0 0 <br /><br /> versicolor 0 35 0 <br /><br /> virginica 0 0 34 </i><br /><br /><u>Console Output (2):</u> <br /><br /><i> predicted <br /><br /> setosa versicolor virginica <br /><br />setosa 19 0 0 <br /><br />versicolor 0 13 2 <br /><br />virginica 0 2 14 </i><br /><br /><b><u>A Real Application Demonstration (ANOVA)</u></b> <br /><br />In this second example demonstration, all of the notational aspects of the model and the restrictions of the function still apply. However, in this case, the dependent variable is continuous, not categorical. To test the predictability of the model, the Root Mean Standard Error and the Mean Absolute Error values are calculated. For more information as it pertains to the calculation and interpretation of these measurements of predictability, please consult the prior article. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") <br /><br /># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />prediction <- predict(anmod, raniris[1:100,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, prediction ) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model test data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, prediction ) </b><br /><br /><u>Console Output (1) - Training Data:</u> <br /><br /><i>[1] 0.3032058 </i><br /><br /><u>Console Output (2) - Test Data:</u> <br /><br /><i>[1] 0.3427076 </i><br /><br /><b># Create a function to calculate Mean Absolute Error # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Compute the Mean Absolute Error (MAE) of model training data # <br /><br />anprediction <- predict(anmodel , raniris[1:100,]) <br /><br />MAE(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Compute the Mean Absolute Error (MAE) of model test data # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction) </b><br /><br /><u>Console Output (1) - Training Data:</u> <br /><br /><i>[1] 0.2289299 <br /></i><br /><u>Console Output (2) - Test Data:</u> <br /><br /><i>[1] 0.2706003 <br /></i><br /><b><u>Conclusions</u></b> <br /><br />The method from which the <b>Bagging()</b> function was derived, was initially postulated by Leo Breiman, the same individual who created the tree model methodology. You will likely never be inclined to use this methodology as a standalone method of analysis. As was previously mentioned within this article, the justification for this topic’s discussion pertains solely its applicability as an aspect of the random forest model. Therefore, from a pragmatic standpoint, if tree models are the model type which you wish to utilize when performing data analysis, you would either be inclined to select the basic tree model for its simplicity, or the random forest model for its enhanced ability. <br /></div><div><br /></div><div>That's all for today.</div><div><br /></div><div>I'll see you next week,</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-23605355899129620292022-09-25T14:56:00.004-04:002022-09-25T14:56:58.287-04:00(R) Machine Learning - Trees - Pt. I<p>This article will serve as the first of many articles which will be discussing the topic of Machine Learning. Throughout the series of subsequent articles published on this site, we will discuss Machine Learning as a topic, and the theories and algorithms which ultimately serve as the subject’s foundation. <br /><br />While I do not personally consider the equations embedded within the<b> “rpart” </b>package to be machine learning from a literal perspective, those who act as authorities on such matters define it as otherwise. By the definition postulated by the greater community, Tree-Based models represent an aspect of machine learning known as "supervised learning". What this essentially implies, is that the computer software implements a statistical solution to an evidence based question posed by the user. After which, the user has the opportunity to review the solution and the rational, and make model edits where necessary. <br /><br />The functionality which is implemented within tree-based models, is often drawn from an abstract or white paper written by mathematicians. Therefore, in many cases, the algorithms which ultimately animate the decision making process, are often too difficult, or too cumbersome, for a human being to apply by hand. This does not mean that such undertakings are impossible, however, given the time commitment dependent on the size of the data frame which will ultimately be analyzed, the more pragmatic approach would be to leave the process entirely to the machines which are designed to perform such functions. </p><b><u>Introducing Tree-Based Models with "rpart" </u></b><br /><br />Like the K-Means Cluster, <b>"rpart" </b>is reliant on an underlying algorithm which, due to its complexity, produces results which are difficult to verify. Meaning, that unlike a process such as categorical regression, there is much that occurs outside of the observation of the user from a mathematical standpoint. Due to the nature of the analysis, no equation is output for the user to check, only the model itself. Without this proof of concept, the user can only assume that the analysis was appropriately performed, and the model produced was the optimal variation necessary for future application. <br /><br />For the examples included within this article, we will be using the R data set <b>"iris"</b>. <div><br /><b><u>Perparing for Analysis </u></b><br /><br />Before we begin, you will need to download two separate auxiliary packages from the CRAN repository, those being: <br /><br /><b>"rpart" </b><br /><br />and <br /><br /><b>"rpart.plot" </b><br /><br />Once you have completed this task, we will move forward by reviewing the data set prior to analysis. <br /><br />This can be achieved by initiating the following functions: <br /><br /><b>summary(iris) <br /><br />head(iris) </b><br /><br />Since the data frame is initially sorted and organized by <b>"Species"</b>, prior to performing the analysis, we must take steps to randomize the data contained within the data frame.</div><div><br /><b><u>Justification for Randomization </u></b><br /><br />Presenting a machine with data which is performing analysis through the utilization of an algorithm, is somewhat analogous to teaching a young child. To better illustrate this concept, I will present a demonstrative scenario. <br /><br />Let's imagine, that for some particular reason, you were attempting to instruct a very young child on the topic of dogs, and to accomplish such, you presented the child with a series of pictures which consisted of only golden Labradors. As you might imagine, the child would walk away from the exercise with the notion that dogs, as an object, always consisted of the features associated with the Labradors of the golden variety. Instead of believing that a dog is a generalized descriptor which encompasses numerous minute and discretely defined features, the child will believe that all dogs are golden Labradors, and that golden Labradors, are the only type of dog. <br /><br />Machines learn* in a similar manner. Each algorithm provides a distinct and unique applicable methodology as it pertains to the overall outcome of the analysis, however, the typical algorithmic standard possesses a bias, in a similar manner to the way in which humans also possess such, based solely on the data as it is initially presented. This is why randomization of data, which instead presents a diverse and robust summary of the data source, is so integral to the process. <br /><br />This method of randomization was inspired by the YouTube user: Jalayer Academy. A link to the video which describes this randomization technique can be found below. <br /><br /><i>* - or the algorithm that is associated with the application which creates the appearance of such. </i><br /><br /><b># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ]</b></div><div><br /></div><div><i>Jalayer Academy: <a href="https://www.youtube.com/watch?v=XLNsl1Da5MA">https://www.youtube.com/watch?v=XLNsl1Da5MA</a></i></div><div><br /><b><u>Training Data and The "rpart" Algorithm </u></b><br /><br />Before we apply the algorithm within the <b>"rpart" </b>package, there are two separate topics which I wish to discuss. <br /><br />The<b> "rpart"</b> algorithm, as was previously mentioned, is one of many machine learning methodologies which can be utilized to analyze data. The differentiating factor which separates methodologies is typically based on the underlying algorithm which is applied to initial data frame. In the case of <b>"rpart"</b>, the methodology utilized, was initially postulated by: Breiman, Friedman, Olshen and Stone.</div><div><br /><b><u>Classiﬁcation and Regression Trees </u></b><br /><br />On the topic of training data, let us again return to our previous child training example. When teaching a child, if utilizing the flash card method that was discussed prior, you may be inclined to set a few of the cards which you have designed aside. The reason for such, is that these cards could be utilized after the initial training, in order to test the child's comprehension of the subject matter. <br /><br />Most machines are trained in a similar manner. A portion of the initial data frame is typically set aside in order to test the overall strength of the model after the model's synthesis is complete. After passing the additional data through the model, a rough conclusion can be drawn as it pertains to the overall effectiveness of the model's design. </div><div><br /><b><u>Method of Application (categorical variable) </u></b><br /><br />As is the case as it pertains to linear regression, we must designate the dependent variable that we wish to predict. If the variable is a categorical variable, we will specify the <b>rpart() </b>function to include a method option of <b>"case"</b>. If the variable is a continuous variable, we will specify the <b>“rpart”</b> function to include a method option of <b>"anova"</b>. <br /><br />In this first case, we will attempt to create a model which can be utilized to, through the assessment of independent variables, properly predict the species variable. <br /><br />The structure of the <b>rpart()</b> function is incredibly similar to the linear model function which is native within R. <br /><br /><b>model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = raniris [1:100,], method="class")</b><br /><br /><b><u>Let's break this structure down: </u></b><br /><br /><b>Species</b> - Is the model's dependent variable. <br /><br /><b>Sepal.Length + Sepal.Width + Petal.Length + Petal.Width</b> - Are the model's independent variables. <br /><br /><b>data = irisr[1:100,] </b>- This option is specifying the data which will be included within the analysis. As we discussed previously, for the purposes our model, only the first 100 row entries of the initial data frame will be included as the foundational aspects in which to structure the model. <br /><br /><b>method = "case"</b> - This option indicates to the computer that the dependent variable is categorical and not continuous. <br /><br />After running the above function, we are left with newly created variable:<b> "model"</b>.</div><div><br /><b style="text-decoration: underline;">Conclusions</b> <br /><br />From this variable we can draw various conclusions. <br /><br />Running the variable: <b>"model"</b> within the terminal should produce the following console output: <br /><br /><i>n= 100 <br /><br />node), split, n, loss, yval, (yprob) <br /><br /> * denotes terminal node <br /><br />1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000) <br /><br />2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * <br /><br />3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) <br /><br />6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * <br /><br />7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * </i><br /></div><div><br /><b><u>Let's break this structure down:</u></b><br /><br /><b>Structure Summary </b><br /><br /><b>n = 100</b> - This is the initial number of observations passed into the model. <br /><br /><b>Logic of the nodal split</b> – Example: Petal.Length>=2.45 <br /><br /><b>Total Observations Included within node</b> - Example: 69 <br /><br /><b>Observations which were incorrectly designated</b> - Example: 34 <br /><br /><b>Nodal Designation </b>– Example: versicolor <br /><br /><b>Percentage of categorical observations occupying each category </b>– Example: <i>(0.00000000 0.50724638 0.49275362) <br /></i><br /><b>The Structure Itself</b></div><div><br /><b><i>root 100 65 versicolor (0.31000000 0.35000000 0.34000000)</i> </b>- Root is the initial number of observations which are fed through the tree model, hence the term root. The numbers which are found within the parenthesis are the percentage breakdowns of the observations by category. <br /><br /><b><i>Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *</i> </b>- The first split which filters model data between two branches. The first branch, sorts data to the left leaf, in which, 31 of the observations are setosa (100%). The condition which determines the discrimination of data is the Petal.Length (<2.45) variable value of the observation. The (*) symbol is indicating that the node is a terminal node. This means that this node leads to a leaf. <br /><br /><i><b>Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)</b> </i>- This branch indicates a split based on the right sided alternative to the prior condition. The initial number within the first set of numbers indicates the number of cases which remain prior to further sorting, and the subsequent number indicates the number of cases which are virginica (and not veriscolor) . The next set of numbers indicates the percentage of the remaining 69 cases which are versicol (50%), and the percentage of the remaining 69 cases which are virginica (49%) . <br /><br /><b><i>Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * </i></b>- This branch indicates a left split. The (*) symbol is indicates that the node is a terminal node. Of the cases sorted through the node, 35 of the observations are veriscolor (95%) and 2 of the observations are virginica (5%). <br /><br /><b><i>Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *</i></b> - This branch indicates a right split alternative. The (*) symbol indicates that the node is a terminal node. Of the cases sorted through the node, 32 of the observations are veriscolor (100%), and 0 of the observations are virginica (0%).</div><div><br />Further information, for inference, can be generated by running the following code within the terminal: <br /><b><br />summary(model) <br /></b><br />This produces the following console output: <br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><br />(I have created annotations beneath each relevant portion of output)<br /><br /><i>Call: </i><br /><br /><i>rpart(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + </i><br /><br /><i> Petal.Width, data = raniris[1:100, ], method = "class") </i><br /><br /><i> n= 100 </i><br /><br /><i> CP nsplit rel error xerror xstd </i><br /><br /><b><i>1 0.4846154 0 1.00000000 1.26153846 0.05910576 </i><br /><br /><i>2 0.0100000 2 0.03076923 0.04615385 0.02624419 </i><br /><br />This portion of the output will be useful as we explore the process of "pruning" later in the article. </b><br /><br /><i>Variable importance <br /><br /> Petal.Width Petal.Length Sepal.Length Sepal.Width <br /><br /> 35 31 20 14 </i><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><b><i>Node number 1: 100 observations, complexity param=0.4846154 <br /><br /> predicted class=versicolor expected loss=0.65 P(node) =1 <br /><br /> class counts: 31 35 34 <br /><br /> probabilities: 0.310 0.350 0.340 <br /><br /> left son=2 (31 obs) right son=3 (69 obs) <br /><br /> Primary splits: <br /><br /> Petal.Length < 2.45 to the left, improve=32.08725, (0 missing) <br /><br /> Petal.Width < 0.8 to the left, improve=32.08725, (0 missing) <br /><br /> Sepal.Length < 5.55 to the left, improve=18.52595, (0 missing) <br /><br /> Sepal.Width < 3.05 to the right, improve=12.67416, (0 missing) <br /><br /> Surrogate splits: <br /><br /> Petal.Width < 0.8 to the left, agree=1.00, adj=1.000, (0 split) <br /><br /> Sepal.Length < 5.45 to the left, agree=0.89, adj=0.645, (0 split) <br /><br /> Sepal.Width < 3.35 to the right, agree=0.83, adj=0.452, (0 split) <br /></i></b><br /><br /><br /><b>The initial split from the root. </b><br /><br /><br /><br /><b><i>Node number 2: 31 observations <br /><br /> predicted class=setosa expected loss=0 P(node) =0.31 <br /><br /> class counts: 31 0 0 <br /><br /> probabilities: 1.000 0.000 0.000 </i></b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "setosa" leaf. </b><br /><br /><br /> <br /><br /><b><i>Node number 3: 69 observations, complexity param=0.4846154 <br /><br /> predicted class=versicolor expected loss=0.4927536 P(node) =0.69 <br /><br /> class counts: 0 35 34 <br /><br /> probabilities: 0.000 0.507 0.493 <br /><br /> left son=6 (37 obs) right son=7 (32 obs) </i></b><br /><br /><br /> <br /><br /><b>The results of the aforementioned split prior to being filtered through the pedal width conditional. </b><br /><br /><br /> <br /><br /><i> Primary splits: <br /><br /> Petal.Width < 1.65 to the left, improve=30.708970, (0 missing) <br /><br /> Petal.Length < 4.75 to the left, improve=25.420120, (0 missing) <br /><br /> Sepal.Length < 6.35 to the left, improve= 7.401845, (0 missing) <br /><br /> Sepal.Width < 2.95 to the left, improve= 3.878961, (0 missing) <br /><br /> Surrogate splits: <br /><br /> Petal.Length < 4.75 to the left, agree=0.899, adj=0.781, (0 split) <br /><br /> Sepal.Length < 6.15 to the left, agree=0.754, adj=0.469, (0 split) <br /><br /> Sepal.Width < 2.95 to the left, agree=0.696, adj=0.344, (0 split) </i><br /><br /><b><i>Node number 6: 37 observations <br /><br /> predicted class=versicolor expected loss=0.05405405 P(node) =0.37 <br /><br /> class counts: 0 35 2 <br /><br /> probabilities: 0.000 0.946 0.054 </i></b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "versicolor" leaf. </b><br /><br /><br /> <br /><br /><b>Node number 7: 32 observations <br /><br /> predicted class=virginica expected loss=0 P(node) =0.32 <br /><br /> class counts: 0 0 32 <br /><br /> probabilities: 0.000 0.000 1.000 </b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "virginica" leaf. </b><br /><br /><br /><br /><br /><b><u>Visualizing Output with a Well Needed Illustration</u> </b><br /><br />If you got lost somewhere along the way during the prior section, don't be ashamed, it is understandable. I am not in any way operating under the pretense that any of this is innate or easily scalable. <br /><br />However, much of what I attempted to explain the preceding paragraphs can be best surmised through the utilization of the <b>"rpart.plot" </b>package. <br /><br /><b># Model Illustration Code # <br /><br />rpart.plot(model, type = 3, extra = 101) </b><br /><br />Console Output:</p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s671/rpart1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="405" data-original-width="671" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s16000/rpart1.png" /></a></div><div><br /></div>What is being illustrated in the graphic are the decision branches, and the leaves which ultimately serve as the destinations for the final categorical filtering process. <br /><br />The leaf <b>"setosa"</b> contains 31 observations which were correctly identified as <b>"setosa"</b> observations. The total number of observations equates for 31% of the total observational rows which were passed through the model. <br /><br />The leaf <b>"versicolor" </b>contains 35 observations which were correctly identified as <b>"versicolor"</b>, and 2 observations which were misidentified. The misidentified observations would instead belong within the <b>“virginica”</b> categorical leaf. The total number of observation contained within the <b>"versicolor" </b>leaf, both correct and incorrect, equal for a total of 37% of the observational rows which were passed through the model. <br /><br />The leaf <b>"virginica" </b>contains 32 observations which were correctly identified as <b>"virginica"</b>. The total number of observations equates for 32% of the total observational rows which were passed through the model.</div><div><br /><b><u>Testing the Model</u></b> <br /><br />Now that our decision tree model has been built, let's test its predictive ability with the data which we left absent from our initial analyzation. <br /><br /><b># Create "confusion matrix" to test model accuracy # <br /><br />prediction <- predict(model, raniris[101:150,], type="class")</b><br /><div><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px; min-height: 13px;"><br /><b>table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br />A variable named <b>"prediction"</b> is created through the utilization of the <b>predict() </b>function. Passed to this function as options are: the model variable, the remaining rows of the randomized <b>"iris" </b>data frame, and the model type. <br /><br />Next, a table is created which illustrates the differentiation between what was predicted and what the observation occurrence actually equals. The option <b>"predicted = " </b>will always equal your prediction variable. The numbers within the brackets <b>[101:150, ] </b>specify the rows of the randomized data frame which will act as test observations for the model. <b><i>“raniris” </i></b>is the data frame from which these observations will be drawn, and <b>“$Species”</b> specifies the data frame variable which will be assessed. <br /><br />The result of initiating the above lines of code produces the following console output: <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 </i><br /><br />This output table is known as a <b>“confusion matrix”</b>. Its purpose of existence is to sort the output provided into a readable format which illustrates the number of correctly predicted outcomes, and the number of incorrectly predicted outcomes within each category. In this particular case, all setosa observations were correctly predicted. 13 virsicolor observations were correctly predicted with 2 observations misattributed as virginica observations. 14 virginica observations were correctly attributed, with 2 observations misattributed as versicolor categorical entries. <br /><br /><b><u>Method of Application (continuous variable)</u></b> <br /><br />Now that we’ve successfully analyzed categorical data, we will progress within our study by also demonstrating rpart’s capacity as it pertains to the analysis of continuous data. <br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Again, we will be utilizing the <b>“iris”</b> data set. However, in this scenario, we will omit <b>“species”</b> from our model, and instead of attempting to identify the species of the iris in question, we will attempt to identify the septal length of an iris plant based on its other attributes. Therefore, in this example, our dependent variable will be <b>“Sepal.Length”</b>. <br /><br />The main differentiation between the continuous data model and the categorical data model within the <b>“rpart” </b>package is the option which specifies the analytical methodology. Instead of specifying (method=”class”), we will instruct the package function to utilize (method=”anova”). Therefore, the function which will lead to creation of the model will resemble: <br /><br /><b>anmodel <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") </b><br /><br />Once the model is built, let’s take a look at the summary of its internal aspects: <br /><br /><b>summary(anmodel) </b><br /><br />This produces the output: <br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br /><i>Call: <br />rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, <br /> data = raniris[1:100, ], method = "anova") <br /> n= 100 <br /><br /> CP nsplit rel error xerror xstd <br />1 0.57720991 0 1.0000000 1.0240753 0.12908984 <br />2 0.12187301 1 0.4227901 0.4792432 0.07380297 <br />3 0.06212228 2 0.3009171 0.3499328 0.04643313 <br />4 0.03392768 3 0.2387948 0.2920761 0.04577809 <br />5 0.01783361 4 0.2048671 0.2920798 0.04349656 <br />6 0.01614077 5 0.1870335 0.2838212 0.04639387 <br />7 0.01092541 6 0.1708927 0.2792003 0.04602130 <br />8 0.01000000 7 0.1599673 0.2849910 0.04586765 <br /><br />Variable importance <br />Petal.Length Petal.Width Sepal.Width <br /> 46 37 17 <br /><br />Node number 1: 100 observations, complexity param=0.5772099 <br /> mean=5.834, MSE=0.614244 <br /> left son=2 (49 obs) right son=3 (51 obs) <br /> Primary splits: <br /> Petal.Length < 4.25 to the left, improve=0.57720990, (0 missing) <br /> Petal.Width < 1.15 to the left, improve=0.53758000, (0 missing) <br /> Sepal.Width < 3.35 to the right, improve=0.02830809, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 1.35 to the left, agree=0.96, adj=0.918, (0 split) <br /> Sepal.Width < 3.35 to the right, agree=0.65, adj=0.286, (0 split) <br /><br />Node number 2: 49 observations, complexity param=0.06212228 <br /> mean=5.226531, MSE=0.1786839 <br /> left son=4 (34 obs) right son=5 (15 obs) <br /> Primary splits: <br /> Petal.Length < 3.45 to the left, improve=0.4358197, (0 missing) <br /> Petal.Width < 0.35 to the left, improve=0.3640792, (0 missing) <br /> Sepal.Width < 2.95 to the right, improve=0.1686580, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 0.8 to the left, agree=0.939, adj=0.8, (0 split) <br /> Sepal.Width < 2.95 to the right, agree=0.878, adj=0.6, (0 split) <br /><br />Node number 3: 51 observations, complexity param=0.121873 <br /> mean=6.417647, MSE=0.3375317 <br /> left son=6 (39 obs) right son=7 (12 obs) <br /> Primary splits: <br /> Petal.Length < 5.65 to the left, improve=0.4348743, (0 missing) <br /> Sepal.Width < 3.05 to the left, improve=0.1970339, (0 missing) <br /> Petal.Width < 1.95 to the left, improve=0.1805629, (0 missing) <br /> Surrogate splits: <br /> Sepal.Width < 3.15 to the left, agree=0.843, adj=0.333, (0 split) <br /> Petal.Width < 2.15 to the left, agree=0.824, adj=0.250, (0 split) <br /><br />Node number 4: 34 observations, complexity param=0.03392768 <br /> mean=5.041176, MSE=0.1288927 <br /> left son=8 (26 obs) right son=9 (8 obs) <br /> Primary splits: <br /> Sepal.Width < 3.65 to the left, improve=0.47554080, (0 missing) <br /> Petal.Length < 1.35 to the left, improve=0.07911083, (0 missing) <br /> Petal.Width < 0.25 to the left, improve=0.06421307, (0 missing) <br /><br />Node number 5: 15 observations <br /> mean=5.646667, MSE=0.03715556 <br /><br />Node number 6: 39 observations, complexity param=0.01783361 <br /> mean=6.205128, MSE=0.1799737 <br /> left son=12 (30 obs) right son=13 (9 obs) <br /> Primary splits: <br /> Sepal.Width < 3.05 to the left, improve=0.1560654, (0 missing) <br /> Petal.Width < 2.05 to the left, improve=0.1506123, (0 missing) <br /> Petal.Length < 4.55 to the left, improve=0.1334125, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 2.25 to the left, agree=0.846, adj=0.333, (0 split) <br /><br />Node number 7: 12 observations <br /> mean=7.108333, MSE=0.2257639 <br /><br />Node number 8: 26 observations <br /> mean=4.903846, MSE=0.07344675 <br /><br />Node number 9: 8 observations <br /> mean=5.4875, MSE=0.04859375 <br /><br />Node number 12: 30 observations, complexity param=0.01614077 <br /> mean=6.113333, MSE=0.1658222 <br /> left son=24 (23 obs) right son=25 (7 obs) <br /> Primary splits: <br /> Petal.Length < 5.15 to the left, improve=0.19929710, (0 missing) <br /> Petal.Width < 1.45 to the right, improve=0.07411631, (0 missing) <br /> Sepal.Width < 2.75 to the left, improve=0.06794425, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 2.05 to the left, agree=0.867, adj=0.429, (0 split) <br /><br />Node number 13: 9 observations <br /> mean=6.511111, MSE=0.1054321 <br /><br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><i>Node number 24: 23 observations, complexity param=0.01092541 <br /> mean=6.013043, MSE=0.1620038 <br /> left son=48 (9 obs) right son=49 (14 obs) <br /> Primary splits: <br /> Petal.Width < 1.65 to the right, improve=0.18010500, (0 missing) <br /> Petal.Length < 4.55 to the left, improve=0.12257150, (0 missing) <br /> Sepal.Width < 2.75 to the left, improve=0.03274482, (0 missing) <br /> Surrogate splits: <br /> Petal.Length < 4.75 to the right, agree=0.783, adj=0.444, (0 split) <br /><br />Node number 25: 7 observations <br /> mean=6.442857, MSE=0.03673469 <br /><br />Node number 48: 9 observations <br /> mean=5.8, MSE=0.1466667 <br /><br />Node number 49: 14 observations <br /> mean=6.15, MSE=0.1239286 </i><br /><br />The largest distinguishing factor between outputs is that instead of categorical sorting, <b>“rpart”</b> has organized the data by mean value and sorted in this manner. <b>“MSE”</b> is an abbreviation for <b>“Mean Squared Error”</b>, which measures the level of differentiation of other values in regards to the mean. The larger this value is, the greater the spatial differential between the set’s data points. </p><br />As always, the phenomenon which is demonstrated within the raw output will look better in graphical form. To create an illustration of the model, utilize the code below: <br /><br /><b># Note: R-Part Part will not round off the numerical figures within an ANOVA model’s output graphic # <br /><br /># For this reason, I have explicitly disabled the “roundint” option # <br /><br />rpart.plot(anmodel,extra = 101, type =3, roundint = FALSE) </b><br /><br />This creates the following output:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR6HiFlrce51NqgzY7C0k5HTw1VxHRAXcnSMilWJkIF8T6chChx9G4LEb5ZHwnkfuyuXO83PLQPbDeY0R-2KI4GOfayzya01n5RWdrNL5lYVBojIiugSjOerxNAtKAnRuF-5n1mXkF4j7oOQ49bYPYCb01syOISp0VSEwZjFSYlAlKnRf57Jf5iSFK/s638/rpartan.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="430" data-original-width="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR6HiFlrce51NqgzY7C0k5HTw1VxHRAXcnSMilWJkIF8T6chChx9G4LEb5ZHwnkfuyuXO83PLQPbDeY0R-2KI4GOfayzya01n5RWdrNL5lYVBojIiugSjOerxNAtKAnRuF-5n1mXkF4j7oOQ49bYPYCb01syOISp0VSEwZjFSYlAlKnRf57Jf5iSFK/s16000/rpartan.png" /></a></div><div><br /></div>In the leaves at the bottom of the graphic, the topmost value represents the mean value, the n value represents the number of observations which occupy that assigned filtered category, and the percentage value represents the percentage ratio of the number observations within the mean, divided by the number of observations within the entire set. <br /><br /><b><u>Testing the Model</u> </b><br /><br />Now that our decision tree model has been built, let's test its predictive ability with the data which was left absent from our initial analyzation. <br /><br />When assessing non-categorical models for their predictive capacity, there are numerous methodologies which can be employed. In this article, we will be discussing two specifically. <br /><br /><b><u>Mean Absolute Error</u> </b><br /><br />The first method of predictive capacity that we will be discussing is known as The Mean Absolute Error. The Mean Absolute Error is essentially the mean of the absolute value of the total sum of differentiations, which are derived from subtracting the predictive observed values from the assigned observational values. <br /><br /><a href="https://en.wikipedia.org/wiki/Mean_absolute_error">https://en.wikipedia.org/wiki/Mean_absolute_error</a> <br /><br />Within the R platform, deriving this value can be achieved through the utilization of the following code: <br /><br /><b># Create predictive model # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: <a href="https://www.youtube.com/watch?v=XLNsl1Da5MA">https://www.youtube.com/watch?v=XLNsl1Da5MA</a> # <br /><br /># Utilize MAE function # <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction)</b></div><div><div><br /><u>Console Output:</u> <br /><br /><i>[1] 0.2976927 </i><br /><br />The above output is indicating that there is, on average, a difference of 0.298 inches between the predicted value of sepal length and the actual value of sepal length. <br /><br /><b><u>Root Mean Squared Error </u></b><br /><br />The Root Mean Squared Error is a value produced by a methodology utilized to measure the predictive capacity of models. Like the Mean Absolute Error, this formula is applied to the observational values as they appear within the initial data frame, and the predicted observational values which are generated by the predictive model. <br /><br />However, the manner in which the output value is synthesized is less straight forward. The value itself is generated by solving for the square root of the average of squared differences between the predicted observational value and the original observational value. As a result, the interpretation of the final output value of the Root Mean Squared Error is more difficult to interpret than its Mean Absolute Error counterpart. <br /><br />The Root Mean Squared Error is more sensitive to large differentiations between predictive value and observational error. With Mean Absolute Error, with enough observations, the eventual output value is smoothed out enough to provide the appearance of less distance between individual values than is actually the case. However, as was previously mentioned, Root Mean Squared Error maintains, through the method in which the eventual value is synthesized, a certain amount of distance variation regardless of the size of the set. <br /><br /><a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">https://en.wikipedia.org/wiki/Root-mean-square_deviation</a> <br /><br />Within the R platform, deriving this value can be achieved through the utilization of the following code:</div><br /><b># Create predictive model # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model test data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /></b><br /><u>Console Output:</u> <br /><br /><i>[1] 1.128444 </i><br /><br /><u><b>Decision Tree Nomenclature</b></u> <br /><br />As much of the terminology within the field of “machine learning” is synonymously applied regardless of model type, it is important to understand the basic descriptive terms in order to familiarize oneself with the contextual aspects of the subject matter. <br /><br />In generating the initial graphic with the code: <br /><br /><b>rpart.plot(model, type = 3, extra = 101) </b><br /><br />We were presented with the illustration below:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s671/rpart1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="405" data-original-width="671" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s16000/rpart1.png" /></a></div></div><br />The <b>“rpart” </b>package, as it pertains to the model output provided, identifies each aspect of the model in the following manner: <br /><br /><b># Generate model output with the following code # <br /><br />model </b><br /><br /><i>> model <br />n= 100 </i><br /><br /><i>node), split, n, loss, yval, (yprob) <br />* denotes terminal node <br /><br />1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000) <br /> 2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * <br /> 3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) <br /> 6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * <br /> 7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * <br /></i><br />If this identification was provided within a graphical representation of the model, the illustration would resemble the graphic below:<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-BcXBnonkkeGlwIxr66l4Uh6MVLqrd_fyxL8wWlagLVodGz8shmbVfmunzkz5xgghs3Jz9aNCtHgOoiOg_U3lPGRE-rOwHIcAEXluhOafyWQeYObhP1ue3DR2gQtnB6fOCnp06OGoHm8d2AtDN-jZwkdbdNYxDy0nY0AGnd1GVn8ZOGbvpFAGmEaB/s681/rootnode2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="394" data-original-width="681" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-BcXBnonkkeGlwIxr66l4Uh6MVLqrd_fyxL8wWlagLVodGz8shmbVfmunzkz5xgghs3Jz9aNCtHgOoiOg_U3lPGRE-rOwHIcAEXluhOafyWQeYObhP1ue3DR2gQtnB6fOCnp06OGoHm8d2AtDN-jZwkdbdNYxDy0nY0AGnd1GVn8ZOGbvpFAGmEaB/s16000/rootnode2.png" /></a></div><div><br /><div>However, universally, the following graphic is a better representation of what each term is utilized to describe within the context of the field of study. <br /><br /><b># Illustrate the model # <br /><br />rpart.plot(model)</b></div><div><b><br /></b></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWAiyd3qPIE7UnIcrh4ndTUup4ssn6VXCu0jtLx1aQXMzeoQLoBoGz_ZGhHbNaEkXP4H9GV3LMaV8etEMee5a5QkBwRCDAT4Eg3iDi_TUoe0ZtppWzFGQJ0yamcZhVM0TDs-iGuyqf5i2kYh6-twEOK_CJoUmJavN-aGRZ9OgnxxmQYuj2GzbtkbYY/s599/rootnode.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="487" data-original-width="599" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWAiyd3qPIE7UnIcrh4ndTUup4ssn6VXCu0jtLx1aQXMzeoQLoBoGz_ZGhHbNaEkXP4H9GV3LMaV8etEMee5a5QkBwRCDAT4Eg3iDi_TUoe0ZtppWzFGQJ0yamcZhVM0TDs-iGuyqf5i2kYh6-twEOK_CJoUmJavN-aGRZ9OgnxxmQYuj2GzbtkbYY/s16000/rootnode.png" /></a></div><br />The first graphic provides a much more pragmatic representation of the model, a representation which is perfectly in accordance with the manner in which the <b>rpart()</b> function surmises the data. The latter graphic, illustrates the technique which is traditionally synonymous with the way in which a model of this type would be represented. <br /><br />Therefore, if an individual were discussing this model with an outside researcher, he would refer to the model as possessing 3 leaves and 2 nodes. The tree being in possession of 1 root is essentially inherent. The term<b> “branches”</b> is the descriptor utilized to describe the black line which connect the various other aspects of the model. However, like the root of the tree, the branches themselves do not warrant mention. In summary, when referring to a tree model, it is a common practice to define it generally by the number of nodes and leaves it possesses.</div><div><br /><b style="text-decoration: underline;">Pruning with prune()</b> <br /><br />There will be instances in which you may wish to simplify a model by removing some of its extraneous nodes. The motivation for accomplishing such can be motivated by either a desire to simplify the model, or, as an attempt to optimize the model’s predictive capacity. <br /><br />We will apply the pruning function to the second example model that we previously created. <br /><br />First, we must find the CP value of the model that we wish to prune. This can be achieved through the utilization of the code: <br /><br /><b>printcp(anmodel) </b><br /><br />This presents the following console output: <br /><br /><i>> printcp(anmodel) </i><br /><br /><i>Regression tree: <br />rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, <br /> data = raniris[1:100, ], method = "anova") <br /><br />Variables actually used in tree construction: <br />[1] Petal.Length Petal.Width Sepal.Width <br /><br />Root node error: 61.424/100 = 0.61424 <br /><br />n= 100 <br /><br /> CP nsplit rel error xerror xstd <br />1 0.577210 0 1.00000 1.04319 0.133636 <br />2 0.121873 1 0.42279 0.52552 0.081797 <br />3 0.062122 2 0.30092 0.39343 0.051912 <br />4 0.033928 3 0.23879 0.32049 0.050067 <br />5 0.017834 4 0.20487 0.32167 0.050154 <br />6 0.016141 5 0.18703 0.29403 0.047955 <br />7 0.010925 6 0.17089 0.29242 0.048231 <br />8 0.010000 7 0.15997 0.29256 0.048205 </i><br /><br />Each value in the list represents a node, with the initial value (1) representing the model’s root. They typical course of action for pruning an <b>“rpart”</b> tree is to first identify the node with the lowest cross-validation error (<b>xerror</b>). Once this value has been identified, we must make note of the value’s corresponding CP score (<b>0.01925</b>). It is this value which will be utilized within our pruning function to modify the model. <br /><br />With the above information ascertained, we can move forward in the pruning process by initiating the following code within R console. <br /><br /><b>prunedmodel <- prune(anmodel, 0.010925) </b><br /><br />In the case of our example, due to the small CP value, no modifications were made to the original model. However, this not always the case. I encourage you to experiment with this function as it pertains to your own <b>rpart</b> models, the best way to learn is through repetition. <br /><br /><b><u>Dealing with Missing Values</u></b> <br /><br />Typically when analyzing real world data sets, there will be instances in which different variable observation values are absent. You should not let this all too common occurrence hinder your model ambitions. Thankfully, within the rpart function, there exists a mechanism for dealing with missing values. However, this mechanism only applies to observations which consist of missing independent variables, values which will be designated as dependent variables which are missing entries should be removed prior to analysis. <br /><br />After testing the functionality of the method with data sets which I had previously removed portions of data from, there appeared to be very little impact on the model creation or prediction capacity. The algorithms which animate the data functions also exist in such a manner in which incomplete data sets can be passed through the model to generate predictions. <br /><br />I’m not exactly sure how the underlying functionality of the rpart package specifically estimates the values of the missing, or “surrogate” variable observations. From reading various articles and the manual associated with the rpart package, I can only assume, from what was described, that the values of the missing variables are derived from full observations which share variable similarities. <br /><br /><b><u>Conclusion</u></b> <br /><br />The basic tree model as it is discussed within the contents of this article, is often passed over in favor of the random forest model. However, as you will observe in future articles, the basic tree model is not without merit, as due its singular nature, it is the easier model to explain and conceptually visualize. Both of the latter concepts are extremely valuable as it relates to data presentation and research publication. In the next article well will be discussing <b>“Bagging”</b>. Until then, stay subscribed, Data Heads. <br /></div></div><br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-39095513535894063572022-09-21T17:44:00.002-04:002022-09-21T17:44:26.502-04:00(Python) Enabling the Nvidia GPU - Tensor Flow and Keras UtilizationThough this website does not typically feature articles related to Tensor Flow or Keras (machine learning libraries), due to reader requests, in this entry, I will illustrate how to enable Nvidia GPU utilization as it pertains to the aforementioned packages. <br /><br />I will be operating under the following assumptions: <br /><br /><div>1. You are in possession of a Windows PC which contains a Nvidia GPU. </div><div><br />2. You are relatively familiar with the Tensor Flow and Keras libraries. </div><div><br /></div><div>3. You are utilizing the Anaconda Python distribution. <br /><br />If the above assumptions are correct, then I have designated 4 steps towards the completion of this process: <br /><br /></div><div>1. Installation of the most recent Nvidia GPU drivers. </div><div><br /></div><div>2. Installation of the<b> “tensorflow-gpu” </b>package library. <br /><br /></div><div>3. Installation of the CUDA toolkit. <br /><br /></div><div>4. Trouble-shooting. <br /><br /><b><u>Installing the most recent Nvidia drivers: </u></b><br /><br />This step is relatively self explanatory. If you are in possession of a computer which contains a Nvidia GPU, you should have the following program located on your hard drive: <b>“GeForce Experience”</b>. Depending on the type of Nvidia GPU which you possess, the name of the program may vary. To locate this program, or a similar program which achieves the same result, search: <b>“Nvidia”</b>, from the desktop start bar. <br /><br />After you have launched the Nvidia desktop interface, you will be asked to create a Nvidia Account. To achieve this, enter the appropriate information within the coinciding menu prompts. Once this has been completed, follow the link within the conformation e-mail to finalize the creation of your user account. <br /><br />With your new Nvidia account created, you will possess the ability to access the latest driver updates within the Nvidia console interface. Be sure to update all of the drivers which are listed, prior to proceeding to the next step.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8xgW1vCVy3aoTXh24ZgS702hIhKmwJ6ErHZ0S5LluUrFcd4Gfu98owbzhA0yf3I57KgVfH5maP5SpP2kocP4bXl1QuTdL4G-fuUTz2P95MESp7jA_hQK4MF9EZDrOM_qaFl7Vu5M2RNkd9e2ACo6zQcpofB-jxLMtAZdHuSdln59VLVRyk_2RI8rk/s789/CUDA_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="216" data-original-width="789" height="110" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8xgW1vCVy3aoTXh24ZgS702hIhKmwJ6ErHZ0S5LluUrFcd4Gfu98owbzhA0yf3I57KgVfH5maP5SpP2kocP4bXl1QuTdL4G-fuUTz2P95MESp7jA_hQK4MF9EZDrOM_qaFl7Vu5M2RNkd9e2ACo6zQcpofB-jxLMtAZdHuSdln59VLVRyk_2RI8rk/w400-h110/CUDA_0.png" width="400" /></a></div><br /><b><u>Installing the TensorFlow GPU Package Library </u></b><br /><br />Completing this simple pre-requisite can be achieved by either: <br /><br />A. Running the following code within the Jupyter Notebook programming environment: <br /><br /><b>import pip <br /><br />pip.main(['install', 'tensorflow-gpu']) <br /></b><br />B. Running the following code within the <b>“Anaconda Prompt”</b>: <br /><br /><b>conda install tensorflow-gpu </b><br /><br /><i>To reach the <b>“Anaconda Prompt”</b> terminal, type “Anaconda Prompt” into the Windows desktop search bar. <br /><br /></i><b><u>Installing the CUDA toolkit </u></b><br /><br />You are now prepared to complete the final pre-requisite, which is the most complicated of all of the required steps. <br /><br />First, you must click the link below: <br /><br /><a href="https://developer.nvidia.com/cuda-downloads">https://developer.nvidia.com/cuda-downloads</a> <br /><br />The address above will direct you to the Nvidia webpage. <br /><br />Select the appropriate options which pertain to your operating system from the list of selections. Doing such, will present a download link to the version of the CUDA software which is best suited for your PC. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGB5ujA4PGEOY5TOBpdeFk1hfDPrJsOns3vLEQQdBg70lQX4l-OZvIyY9dMiHHwcgah7qeSB3q_Xwx_94rVJ2wlS79gyg5rWOSC1UAAtd3zKnElT0BmbZdRyljbmRCUIuu5lyEGwGAVKW1T0hquKqTj7P7tp3YEictiI-QBoz8mFdBuOBsTYD2nNjy/s937/CUDA_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="937" data-original-width="925" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGB5ujA4PGEOY5TOBpdeFk1hfDPrJsOns3vLEQQdBg70lQX4l-OZvIyY9dMiHHwcgah7qeSB3q_Xwx_94rVJ2wlS79gyg5rWOSC1UAAtd3zKnElT0BmbZdRyljbmRCUIuu5lyEGwGAVKW1T0hquKqTj7P7tp3YEictiI-QBoz8mFdBuOBsTYD2nNjy/w395-h400/CUDA_1.png" width="395" /></a></div><div><br /></div><div>The rest of the installation process is relatively straight-forward. <br /><br />The product of the download will produce the file below: </div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg72VQHg21sU_WVeTV3nEoFkln0Q95uoOuFiXNRtJtkvGmBK7IdibQ_hHrjHsn6WCdDUzmkM24IRCZ7Ne8HPpsRumHzHCRhLtz7Qxj9JrjR50VVNov981TZ75-bMX12oou6gs7iOWmOdR7EckoEOptJZsaE083Ewq2s0ampEPfdZob4kbLMJOyOuJZp/s60/CUDA_2A.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="55" data-original-width="60" height="55" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg72VQHg21sU_WVeTV3nEoFkln0Q95uoOuFiXNRtJtkvGmBK7IdibQ_hHrjHsn6WCdDUzmkM24IRCZ7Ne8HPpsRumHzHCRhLtz7Qxj9JrjR50VVNov981TZ75-bMX12oou6gs7iOWmOdR7EckoEOptJZsaE083Ewq2s0ampEPfdZob4kbLMJOyOuJZp/s1600/CUDA_2A.png" width="60" /></a></div><br /><i>(File name will vary based on operating system and version selection) </i><br /><br />Double clicking this file icon will begin the installation process.</div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_XOLevAEm6TkK3sdK1HX85LfGPrRT2n--xIoQc3cmQAK7M0W1Lgk5nDDR7YsrEb13Wb4et8OqTMqFUQeSdnpwwRSn1_gCSZ9opl3u_3iZdVgDABVkG59RDzf2BsRivLBMrOYLMJce0P6uFPrTL5LaQYGkR6BljHmNYt0eILS-toamvKV-Ed8NJ6YY/s416/CUDA_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="163" data-original-width="416" height="156" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_XOLevAEm6TkK3sdK1HX85LfGPrRT2n--xIoQc3cmQAK7M0W1Lgk5nDDR7YsrEb13Wb4et8OqTMqFUQeSdnpwwRSn1_gCSZ9opl3u_3iZdVgDABVkG59RDzf2BsRivLBMrOYLMJce0P6uFPrTL5LaQYGkR6BljHmNYt0eILS-toamvKV-Ed8NJ6YY/w400-h156/CUDA_3.png" width="400" /></a></div></div><div><br /></div>After clicking through the associated options, the following screen should appear:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1upE9sHury2sfLY1Njzb_ub0TPDkst-dK9j6W6fCYiltFi3ZPRCTtxJiZWPUzig291-TLmmn19ZnzUTsY4dg2_SuMkwl-xNsSF839fVa_lrLcfulqjqWinWiXNCErLBS13Ek3Hu3xsjwC5BDtBeJhyr5aiGTF749_2V5Ig67BKghd6wxGAmhreWh/s594/CUDA_4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="444" data-original-width="594" height="297" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1upE9sHury2sfLY1Njzb_ub0TPDkst-dK9j6W6fCYiltFi3ZPRCTtxJiZWPUzig291-TLmmn19ZnzUTsY4dg2_SuMkwl-xNsSF839fVa_lrLcfulqjqWinWiXNCErLBS13Ek3Hu3xsjwC5BDtBeJhyr5aiGTF749_2V5Ig67BKghd6wxGAmhreWh/w400-h297/CUDA_4.png" width="400" /></a></div><div><br /></div>If you do not have Microsoft Visual Studios installed on your PC, you will be presented with an installation error. However, if you do not intend to utilize the CUDA software for development related to such, you can continue the installation process without further hesitation. <br /><br />Once the process is fully completed, the following shortcut icon should appear on your PC’s desktop:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62bb1Wbs9wI-fddI5KGt2F-wfVWvhX9CiQkqzqumBHszM7XJ2VQ9FHmwhH5rLOvRcmzArSWWPyZXZUL_qx4XyXJRNX4y7d80iwZVtG6msbv8D-hzxA9X8expNofh7OJ4xrmmphBSXwd6HdKDAQk_EESw0yEnT3gp-eKmaZ60Zf6uV8lVTF-n4ZkLU/s96/CUDA_5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="77" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62bb1Wbs9wI-fddI5KGt2F-wfVWvhX9CiQkqzqumBHszM7XJ2VQ9FHmwhH5rLOvRcmzArSWWPyZXZUL_qx4XyXJRNX4y7d80iwZVtG6msbv8D-hzxA9X8expNofh7OJ4xrmmphBSXwd6HdKDAQk_EESw0yEnT3gp-eKmaZ60Zf6uV8lVTF-n4ZkLU/s1600/CUDA_5.png" width="77" /></a></div><div><br /></div>I would now advise that you re-start your PC prior to implementing GPU utilization within your machine learning projects. <br /><br /><b><u>Trouble Shooting </u></b><br /><br />I’ve found that GPU enabled Tensor Flow projects, at least from my experience, tend to be more error prone from session to session. However, I accept this shortcoming, due to the significant speed increase enabled by GPU utilization. <br /><br />Utilizing the newly installed GPU implementation is relatively simple, as the implementation is automatically assumed within the model structure. Meaning, that the alteration of pre-existing machine learning project code is un-necessary as it pertains to GPU optimization. If you run code which has been previously created for the purpose of utilizing Keras and Tensor Flow library implementation, then the computer will automatically assume that you now desire to have the analysis performed through the GPU hardware architecture. <br /><br />To ensure that GPU functionality is enabled, you may run the following lines of code within the Anaconda coding platform: <br /><br />f<b>rom tensorflow.python.client import device_lib <br /><br />from keras import backend as K <br /><br />print(device_lib.list_local_devices()) <br /><br />K.tensorflow_backend._get_available_gpus() <br /></b><br />This should produce output which includes the term: <b>‘GPU’</b>. If this is the case, then GPU utilization has been successfully enabled. <br /><br />If, for whatever reason, errors occur as it relates to keras or tensorflow implementation following the installation of the prior programs, try completing any of the following steps to remedy this occurrence. <br /><br />1. Restart the Anaconda platform and Jupyter Notebook. <br /><br />2. Uninstall and re-install both the tensorflow and tensorflow-gpu libraries from the Anaconda Prompt (command line). This can be achieved by utilizing the code below: <br /><br /><b>conda uninstall tensorflow <br /><br />conda uninstall tensorflow-gpu <br /><br />conda install tensorflow <br /><br />conda install tensorflow-gpu </b><br /><br />Assuming that these remedies addressed and solved the previously present issue, assuming that an issue previously existed, then you should be prepared to experience the blazing speed enabled by Nvidia GPU utilization. <br /><br />That’s all for this entry. <br /><br />Stay busy, Data Heads!Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-20378421144402363462021-07-18T11:40:00.010-04:002021-07-18T11:45:57.839-04:00Pivot Tables (MS-Excel)You didn’t honestly believe that I would continue to write articles without mentioning every analyst’s favorite Excel technique, did you? <br /><br /><b><u>Example / Demonstration</u>: </b><br /><br />For this demonstration, we are going to be utilizing the, <b>“Removing Duplicate Entries (MS-Excel).csv”</b> data file. This file can found within GitHub data repo, upload data: July 12, 2018. If you are too lazy to navigate over the repo site, the raw .csv data can be found down below:<br /><br /><i>VARA,VARB,VARC,VARD<br />Mike,1,Red,Spade<br />Mike,2,Blue,Club<br />Mike,1,Red,Spade<br />Troy,2,Green,Diamond<br />Troy,1,Red,Heart<br />Archie,2,Orange,Heart<br />Archie,2,Yellow,Diamond<br />Archie,2,Orange,Heart<br />Archie,1,Red,Spade<br />Archie,1,Blue,Spade<br />Archie,2,Red,Club<br />Archie,2,Red,Club<br />Jack,1,Red,Diamond<br />Jack,2,Blue,Diamond<br />Jack,2,Blue,Diamond<br />Rob,1,Green,Club<br />Rob,2,Orange,Spade<br />Brad,1,Red,Heart<br />Susan,2,Blue,Heart<br />Susan,2,Yellow,Club<br />Susan,1,Pink,Heart<br />Seth,2,Grey,Heart<br />Seth,1,Green,Club<br />Joanna,2,Pink,Club<br />Joanna,1,Green,Spade<br />Joanna,1,Green,Spade<br />Bertha,2,Grey,Diamond<br />Bertha,1,Grey,Diamond<br />Liz,1,Green,Spade</i><br /><br />Let’s get started! <br /><br />First, we’ll take a nice look at the data as it exists within MS-Excel:<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-eXRAaqVNGbI/YPREvq4JfeI/AAAAAAAABXw/eYi5NeUSynk3grAlE_TMbY0lUya_O4dIQCLcBGAsYHQ/s628/Pivot_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="628" data-original-width="291" height="400" src="https://1.bp.blogspot.com/-eXRAaqVNGbI/YPREvq4JfeI/AAAAAAAABXw/eYi5NeUSynk3grAlE_TMbY0lUya_O4dIQCLcBGAsYHQ/w185-h400/Pivot_0.png" width="185" /></a></div><div><br /></div>Now we’ll pivot to excellence! <br /><br />The easiest way to start building pivot tables, is to utilize the <b>“Recommended PivotTables” </b>option button located within the <b>“Insert”</b> menu, listed within Excel’s ribbon menu.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-AeTxI-0CD-c/YPRE8Dfo_QI/AAAAAAAABX0/8w7CB-7bTSswjstGd--7LfVM8NtANYhZACLcBGAsYHQ/s294/Pivot_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="285" data-original-width="294" src="https://1.bp.blogspot.com/-AeTxI-0CD-c/YPRE8Dfo_QI/AAAAAAAABX0/8w7CB-7bTSswjstGd--7LfVM8NtANYhZACLcBGAsYHQ/s0/Pivot_1.png" /></a></div><div><br />This should bring up the menu below:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-hZD0z_FL58s/YPRFHOcIk0I/AAAAAAAABX4/W4FiFpRnt1gjECTIfWngbi25xyqQqvagwCLcBGAsYHQ/s702/Pivot_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="634" data-original-width="702" height="361" src="https://1.bp.blogspot.com/-hZD0z_FL58s/YPRFHOcIk0I/AAAAAAAABX4/W4FiFpRnt1gjECTIfWngbi25xyqQqvagwCLcBGAsYHQ/w400-h361/Pivot_2.png" width="400" /></a></div><div><br /></div>Go ahead and select all row entries, across all variable columns. <br /><br />Once this has been completed, click <b>“OK”</b>. <br /><br />This should generate the following menu:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-xQ94Y2Jw5QM/YPRFOLMJb7I/AAAAAAAABYA/k38frTPutiM1-OILmfIQdnHqkQjfHN1ngCLcBGAsYHQ/s541/Pivot_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="510" data-original-width="541" height="378" src="https://1.bp.blogspot.com/-xQ94Y2Jw5QM/YPRFOLMJb7I/AAAAAAAABYA/k38frTPutiM1-OILmfIQdnHqkQjfHN1ngCLcBGAsYHQ/w400-h378/Pivot_3.png" width="400" /></a></div><div><br /></div>Let’s break down each recommendation. <br /><br /><b>“Sum of VARB by VARD” </b>– This table is summing the total of the numerical values contained within <b>VARB</b>, as they correspond with <b>VARD </b>entries. <br /><br /><b>“Count of VARA by VARD”</b> – This table is counting the total number of occurrences of categorical values within variable column <b>VARD</b>. <br /><br /><b>“Sum of VARB by VARC”</b> – This table is summing the total of numerical values contained within <b>VARB</b>, as they correspond with <b>VARC</b> entries.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Ywr03FZOT00/YPRFTW-IjZI/AAAAAAAABYE/m31u0lXpzFctjRZfLUM-4tlJR8_hLZDegCLcBGAsYHQ/s548/Pivot_4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="515" data-original-width="548" height="376" src="https://1.bp.blogspot.com/-Ywr03FZOT00/YPRFTW-IjZI/AAAAAAAABYE/m31u0lXpzFctjRZfLUM-4tlJR8_hLZDegCLcBGAsYHQ/w400-h376/Pivot_4.png" width="400" /></a></div><div><br /></div><b>“Count of VARA by VARC”</b> – This table is counting the total number of occurrences of categorical values within variable column <b>VARA</b>. <br /><br /><b>“Sum of VARB by VARA” </b>– This table is summing the total of the numerical values contained within <b>VARB</b>, as they correspond with <b>VARA</b> entries. <br /><br />Now, there may come a time in which none of the above options match exactly what you are looking for. In this case, you will want to utilize the<b> “PivotTable”</b> option button, located within the<b> “Insert”</b> menu, listed within Excel’s ribbon menu.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-eWEvgpsnfJc/YPRFdDn8LMI/AAAAAAAABYI/Qbr64P9t9j4H-llmJPBXKJ9vOofkhIOhACLcBGAsYHQ/s306/Pivot_5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="306" data-original-width="227" src="https://1.bp.blogspot.com/-eWEvgpsnfJc/YPRFdDn8LMI/AAAAAAAABYI/Qbr64P9t9j4H-llmJPBXKJ9vOofkhIOhACLcBGAsYHQ/s0/Pivot_5.png" /></a></div><div><br />Go ahead and select all row entries, across all variable columns. <br /><br />Change the option button to <b>“New Worksheet”</b>, instead of <b>“Existing Worksheet”</b>. <br /><br />Once this has been complete, click <b>“OK”</b>. <br /><br />Once this has been accomplished, you’ll be graced with a new menu, on a new Excel sheet (same workbook).</div><div><br /></div><div>I won’t go into every single output option that you have available, but I will list a few you may want to try yourself. Each output variation can be created by dragging and dropping the variables listed within the topmost box, in varying order, into the boxes below: </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-8uLR8QY1JLk/YPRMpnEXb_I/AAAAAAAABZM/SLxHGfBaFc4Jbf9c6dlSm11KK2A7CeSZgCLcBGAsYHQ/s748/Pivot_7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="748" data-original-width="347" height="400" src="https://1.bp.blogspot.com/-8uLR8QY1JLk/YPRMpnEXb_I/AAAAAAAABZM/SLxHGfBaFc4Jbf9c6dlSm11KK2A7CeSZgCLcBGAsYHQ/w185-h400/Pivot_7.png" width="185" /></a></div><div><br />If <b>VARA</b> and <b>VARC</b> are both added to Rows, you will view the categorical occurrences of variable entries from <b>VARC</b>, with V<b>ARA</b> acting as the unique ID. <br /><br />Order matters in each pivot table variable designation place. <br /><br />So, if we reverse the position of <b>VARA</b> and <b>VARC</b>, and instead list <b>VARC</b> first, followed by <b>VARA</b>, then we will a table which lists the categorical occurrences of <b>VARA</b>, with <b>VARC</b> acting as a unique ID. <br /><br />If we include <b>VARA</b> and <b>VARC</b> as rows (in that order), and set the values variable to Sum of <b>VARB</b>, then the output should more so resemble an accounting sheet, with the sum of each numerical value corresponding with <b>VARA</b>, categorized by <b>VARC</b>, is summed (<b>VARB</b>). </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iPUl8afBfwc/YPRG5qZjGmI/AAAAAAAABYg/Ztr8XebQVrMopxkSOPZcYn7EpK4XoS2DQCLcBGAsYHQ/s646/Pivot_11.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="172" height="400" src="https://1.bp.blogspot.com/-iPUl8afBfwc/YPRG5qZjGmI/AAAAAAAABYg/Ztr8XebQVrMopxkSOPZcYn7EpK4XoS2DQCLcBGAsYHQ/w106-h400/Pivot_11.png" width="106" /></a></div><div style="text-align: center;"><br /></div>If we instead wanted the count, as opposed to the sum, we could click on the drop down arrow located next to <b>“Count of VARB”</b>, which presents the following options:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-tZBQ7k6ElPw/YPRHQYnX7kI/AAAAAAAABYs/iIcdKvDcYpohWGLSZ_EVQ2aFQ16yqmKcQCLcBGAsYHQ/s401/Pivot_8.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="401" data-original-width="347" height="320" src="https://1.bp.blogspot.com/-tZBQ7k6ElPw/YPRHQYnX7kI/AAAAAAAABYs/iIcdKvDcYpohWGLSZ_EVQ2aFQ16yqmKcQCLcBGAsYHQ/s320/Pivot_8.png" /></a></div><div style="text-align: center;"><br /></div>From the options listed, we well select <b>“Value Field Settings”</b>.<br /><br />This presents the following menu, from which we will select <b>“Count”</b>.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-C1T4Rmy-T7E/YPRHo2YmSHI/AAAAAAAABY0/YEI7gBM69dcVsuVSN3LctU8YOrJah3JoACLcBGAsYHQ/s390/Pivot_9.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="333" data-original-width="390" src="https://1.bp.blogspot.com/-C1T4Rmy-T7E/YPRHo2YmSHI/AAAAAAAABY0/YEI7gBM69dcVsuVSN3LctU8YOrJah3JoACLcBGAsYHQ/s320/Pivot_9.png" width="320" /></a></div><div><br /></div>The result of following the previously listed steps is illustrated below:<br /><br /><div style="text-align: center;"><a href="https://1.bp.blogspot.com/-Oufu6b3hi2Y/YPRGra32p2I/AAAAAAAABYc/IoDiZ6ARsq4Z17YqKT6O7PBN5BaKKdS-gCLcBGAsYHQ/s639/Pivot_10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="639" data-original-width="177" height="400" src="https://1.bp.blogspot.com/-Oufu6b3hi2Y/YPRGra32p2I/AAAAAAAABYc/IoDiZ6ARsq4Z17YqKT6O7PBN5BaKKdS-gCLcBGAsYHQ/w111-h400/Pivot_10.png" width="111" /></a></div><div style="text-align: center;"><br /></div><div style="text-align: left;">The Pivot Table creation menu also allows for further customization through the addition of column variables.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">In the case of our example, we will make the following modifications to our table output:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-8gHSCiLKLMk/YPRIRelQgcI/AAAAAAAABY8/8kja4AehEeMqbHaY1S83Y5L-uOlot62igCLcBGAsYHQ/s354/Pivot_x1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="354" data-original-width="345" height="320" src="https://1.bp.blogspot.com/-8gHSCiLKLMk/YPRIRelQgcI/AAAAAAAABY8/8kja4AehEeMqbHaY1S83Y5L-uOlot62igCLcBGAsYHQ/s320/Pivot_x1.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><b>VARC</b> will now be designated as a column variable, <b>VARA</b> will be a row variable, and the count of <b>VARB </b>will be out values variable. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The result of these modifications is shown below:</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-NXkYi3Nfvro/YPRIye-F1lI/AAAAAAAABZE/MK6yzHZ3UTYBmtearq0QWWSpiDM_aotAQCLcBGAsYHQ/s550/Pivot_x2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="292" data-original-width="550" height="211" src="https://1.bp.blogspot.com/-NXkYi3Nfvro/YPRIye-F1lI/AAAAAAAABZE/MK6yzHZ3UTYBmtearq0QWWSpiDM_aotAQCLcBGAsYHQ/w400-h211/Pivot_x2.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Our output format now contains a table which contains the count of each occurrence of each color (<b>VARC</b>), as each color corresponds with each individual listed (<b>VARA</b>) within the original data set. </div><br /><div class="separator" style="clear: both; text-align: left;">In conclusion, the pivot table option within MS-Excel, offers a variety of different display outputs which can be utilized to display statistical summary data.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The most important skill to develop as it pertains to this feature, is the ability to ascertain when a pivot table is necessary for your data project needs.</div><div class="separator" style="clear: both; text-align: left;"><br /></div>So with that, we will end this article.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I will see you next time, Data Head.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-90336341986793670742021-07-14T23:35:00.003-04:002021-07-14T23:44:08.013-04:00Getting to Know the GreeksIn today’s article, we are going to go a bit off the beaten path and discuss, The Greek Alphabet!<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-i1a5JO7k-NI/YO-mMqhCllI/AAAAAAAABXk/IzXL2FF_F-si6Os2eOtFBNjcoBlFyQ8YgCLcBGAsYHQ/s334/Plato.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="229" data-original-width="334" src="https://1.bp.blogspot.com/-i1a5JO7k-NI/YO-mMqhCllI/AAAAAAAABXk/IzXL2FF_F-si6Os2eOtFBNjcoBlFyQ8YgCLcBGAsYHQ/s320/Plato.png" width="320" /></a></div><div><br />You might be wondering, why the sudden change of subject content…?<br /><br />In order to truly master of the craft of data science, you will be required to stretch your mind in creative ways. The Greek Alphabet is utilized throughout the fields of statistics, mathematics, finance, computer science, astronomy and other western intellectual pursuits. For this reason, it really ought to be taught in elementary schools. However, to my knowledge, in most cases, it is not. <br /><br />The Romans borrowed heavily from Greek Civilization, and contemporary western civilization borrowed heavily from the Romans. Therefore, to truly be a person of culture, you should learn the Greek Alphabet, and really, as much as you possibly can about Ancient Greek Culture. This includes the legends, heroes, and philosophers. We might be getting more into this in other articles, but for today, we will be sticking to the alphabet. <br /><br /><b><u>The Greek Alphabet </u></b><br /><br />The best way to learn the Greek alphabet is to be Greek (j/k, but not really). In all other cases, application is probably the best way to commit various Greek letters, as symbols, to memory. <br /><br />I would recommend drawing each letter in order, uppercase, and lowercase, and saying the name of the letter as it is written. <br /><br />Let’s try this together! <br /><br /><b>Α α (Alpha) (Pronounced: AL-FUH)</b> - Utilized in statistics as the symbol which connotates significance level. In finance, it is the percentage return of an investment above or below a predetermined index. <br /><br /><b>B β (Beta) (Pronounced: BAY-TUH)</b> - In statistics, this symbol is utilized to represent type II errors. In finance, it is utilized to determine asset volatility. <br /><br /><b>Γ γ (Gamma) (Pronounced: GAM-UH)</b> - In physics, this symbol is utilized to represent particle decay (Gamma Decay). There also exists Alpha Decay, and Beta Decay. The type of decay situationally differs depending on the circumstances. <br /><br /><b>Δ δ (Delta) (Pronounced: DEL-TUH)</b> - This is currently the most common strain of the novel coronavirus (7/2021). In the field of chemistry, uppercase Delta is utilized to symbolize heat being added to a reaction. <br /><br /><b>Ε ε (Epsilon) (Pronounced: EP-SIL-ON)</b> - “Machine Epsilon” is utilized in computer science as a way of dealing with floating point values and their assessment within logical statements. <br /><br /><b>Ζ ζ (Zeta) (Pronounced: ZAY-TUH)</b> - The most common utilization assignment which I have witnessed for this letter, is its designation as the variable which represents the Reimann Zeta Function (number theory). <br /><br /><b>Η η (Eta) (Pronounced: EE -TUH)</b> - I’ve mostly seen this letter designated as variable for the Dedekind eta function (number theory). <br /><br /><b>Θ θ (Theta) (Pronounced: THAY-TUH)</b> - Theta is utilized as the symbol to represent a pentaquark, a transitive subatomic particle. <br /><br /><b>Ι ι (Iota) (Pronounced: EYE-OL-TA)</b> - I’ve never seen this symbol utilized for anything outside of astronomical designations. Maybe if you make it big in science, you could give Iota the love that it so deserves. <br /><br /><b>Κ k (Kappa) (Pronounced: CAP-UH) </b>- Kappa is the chosen variable designation for Einstein’s gravitational constant. <br /><br /><b>Λ λ (Lambda) (Pronounced: LAMB-DUH)</b> - A potential emergent novel coronavirus variant (7/2021). Lowercase Lambda is also utilized throughout the Poisson Distribution function. <br /><br /><b>Μ μ (Mu) (Pronounced: MEW)</b> - Lowercase Mu is utilized to symbolize the mean of a population (statistics). In particle physics, it can also be applied to represent the elementary particle: Muon. <br /><br /><b>Ν ν (Nu) (Pronounced: NEW)</b> - As a symbol, this letter represents degrees of freedom (statistics). <br /><br /><b>Ξ ξ (Xi) (Pronounced: SEE) </b>- In mathematics, uppercase Xi can be utilized to represent the Reimann Xi Function. <br /><br /><b>Ο ο (Omnicron) (Pronounced: OMNI-CRON)</b> - A symbol which does not get very much love, or use, unlike its subsequent neighbor… <br /><br /><b>Π π (Pi) (Pronounced: PIE) </b>- In mathematics, lowercase Pi often represents the mathematical real transcendental constant ≈ 3.1415…etc. <br /><br /><b>Ρ ρ (Rho) (Pronounced: ROW) </b>- In the Black-Scholes model, Rho represent the rate of change of a portfolio with respect to interest rates <br /><br /><b>Σ σ (Sigma) (Pronounced: SIG-MA) </b>- Lower case Sigma represents the standard deviation of a population (statistics). Upper case sigma represents a sum function (mathematics). <br /><br /><b>Τ τ (Tau) (Pronounced: TAIL)</b> - Lower case Tau represents an elementary particle within the field of particle physics <br /><br /><b>Υ υ (Upsilon) (Pronounced: EEP-SIL-ON)</b> - Does not really get very much use… <br /><br /><b>Φ φ (Phi) (Pronounced: FAI) </b>- Lowercase Phi is utilized to represent the Golden Ratio. <br /><br /><b>Χ χ (Chi) (Pronounced: KAI) </b>- Lower case Chi is utilized as a variable throughout the Chi-Square distribution function. <br /><br /><b>Ψ ψ (Psi) (Pronounced: PSY)</b> - Lower case Psi is used to represent the (generalized) positional states of a qubit within a quantum computer.</div><div><br /><b>Ω ω (Omega) (Pronounced: OHMEGA)</b> - Utilized for just about everything.</div><br />Αυτα για τωρα. Θα σε δω την επόμενη φορά!<div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-76559890037969361732021-06-25T11:30:00.002-04:002021-06-25T11:34:13.663-04:00(R) The Levene's TestIn today’s article we will be discussing a technique which is not specifically interesting or pragmatically applicable. Still, for the sake of true data science proficiency, today we will be discussing, <b>THE LEVENE'S TEST! </b><br /><br />The Levene's Test is utilized to compare the variances of two separate data sets. <br /><br />So naturally, our hypothesis would be: <br /><br /><b>Null Hypothesis:</b> The variance measurements of the two data sets do not significantly differ. <br /><br /><b>Alternative Hypothesis:</b> The variance measurements of the two data sets do significantly different. <br /><br /><b><u>The Levene's Test Example</u>:</b><br /><br /><b># The leveneTest() Function is included within the “car” package # <br /><br />library(car) <br /><br />N1 <- c(70, 74, 76, 72, 75, 74, 71, 71) <br /><br /> N2 <- c(74, 75, 73, 76, 74, 77, 78, 75) <br /><br />N_LEV <- c(N1, N2) <br /><br />group <- as.factor(c(rep(1, length(N1)), rep(2, length(N2)))) <br /><br />leveneTest(N_LEV, group) <br /><br /># The above code is a modification of code provided by StackExchange user: ocram. # <br /><br /># Source https://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r # <br /></b><div><br />This produces the output: <br /><br /><i>Levene's Test for Homogeneity of Variance (center = median) <br /> Df F value Pr(>F) <br />group 1 1.7677 0.2049 <br /> 14 <br /></i><br />Since the p-value of the output exceeds .05, we will not reject the null hypothesis (alpha = .05). <br /><br /><b><u>Conclusions</u>: </b><br /><br />The Levene’s Test for Equality of Variances did not indicate a significant differentiation in the variance measurement of Sample N1, as compared to the variance measurement of Sample N2, F(1,14) = 1.78, p= .21. <br /><br />So, what is the overall purpose of this test? Meaning, when would its application be appropriate? The Levene’s Test is typically utilized as a pre-test prior to the application of the standard T-Test. However, it is uncommon to structure a research experiment in this manner. Therefore, the Levene’s Test is more so something which is witnessed within the classroom, and not within the field. <br /><br />Still, if you find yourself in circumstances in which this test is requested, know that it is often required to determine whether a standard T-Test is applicable. If variances are found to be un-equal, a Welch’s T-Test is typically preferred as an alternative to the standard T-Test. <br /><br />----------------------------------------------------------------------------------------------------------------------------- <br /><br />I promise that my next article will be more exciting. <br /><br />Until next time. <br /><br />-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-75809943870570482482021-06-18T10:50:00.004-04:002021-06-18T10:59:03.863-04:00(R) Imputing Missing Data with the MICE() Package<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-bMkdAJMAANI/YMJsQxkfdWI/AAAAAAAABWI/mg8J6fDAYIIYEtWsywiW1hBZllYxKEKfwCLcBGAsYHQ/s220/HouseMouse.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="106" data-original-width="220" src="https://1.bp.blogspot.com/-bMkdAJMAANI/YMJsQxkfdWI/AAAAAAAABWI/mg8J6fDAYIIYEtWsywiW1hBZllYxKEKfwCLcBGAsYHQ/s0/HouseMouse.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>In today’s article we are going to discuss basic utilization of the MICE package. <br /><br />The MICE package, is a package which assists with performing analysis on shoddily assembled data frames. <br /><br />In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter. <br /><br />Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”, <br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br /><b>~OR~</b><br /></p><div class="separator" style="clear: both;"><div style="text-align: center;"><a href="https://1.bp.blogspot.com/-PGQflZAapDU/YMJsZnHOxKI/AAAAAAAABWM/vbf2_M6z4IQmJGuweB_s1260QLExHRcXwCLcBGAsYHQ/s464/DeleteAll.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="236" data-original-width="464" src="https://1.bp.blogspot.com/-PGQflZAapDU/YMJsZnHOxKI/AAAAAAAABWM/vbf2_M6z4IQmJGuweB_s1260QLExHRcXwCLcBGAsYHQ/s320/DeleteAll.png" width="320" /></a></div><div style="text-align: center;"><i style="text-align: left;">(Don’t succumb to temptation!) </i></div></div><br />Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses. <br /><br />Fortunately, for all of us, there is a way forward. <br /><br /><b><u>Example</u></b>:<br /><br />Let’s say that you were given this small data set for analysis:<div><br /><div style="text-align: left;"> <a href="https://1.bp.blogspot.com/-5jA3b1zAxO4/YMJskB8oL5I/AAAAAAAABWU/9naKWUrXs6Q_c99qYTR_ZxnRKaxapDwCACLcBGAsYHQ/s453/DataFrameB.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="453" data-original-width="359" height="320" src="https://1.bp.blogspot.com/-5jA3b1zAxO4/YMJskB8oL5I/AAAAAAAABWU/9naKWUrXs6Q_c99qYTR_ZxnRKaxapDwCACLcBGAsYHQ/s320/DataFrameB.png" /></a></div><br />The data is provided in an .xls format, because why wouldn’t it be?<br /><br />For the sake of not having you download an example data file, I have re-coded this data into the R format.<br /><br /><b># Create Data Frame: "SheetB" #<br /><br />VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)<br /><br />VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)<br /><br />VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)<br /><br />VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)<br /><br />VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)</b><br /><b><br />SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)</b><br /><br />If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:<br /><br /><b># Create Data Frame: "SheetA" #<br /><br />VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)<br /><br />VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)<br /><br />VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)<br /><br />VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)<br /><br />VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)<br /><br />SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)</b><br /><br />In our example, we’ll assume that the sheet which contains all values is unavailable to you (<b>“SheetA”</b>). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.<br /><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1BtIb3hYvqg/YMJs4YQ_BQI/AAAAAAAABWg/3SwZx3zKMyM2sYJgRkt6VpZP49Odsa1SQCLcBGAsYHQ/s250/LogMouse.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="139" data-original-width="250" src="https://1.bp.blogspot.com/-1BtIb3hYvqg/YMJs4YQ_BQI/AAAAAAAABWg/3SwZx3zKMyM2sYJgRkt6VpZP49Odsa1SQCLcBGAsYHQ/s0/LogMouse.png" /></a></div><br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;">We will opt to do the latter, and the function which we will utilize, is the <b>mice() </b>function. <br /><br />First, we will initialize the appropriate library: <br /><b><br /># Initalitze Library # <br /><br />library(mice) <br /></b><br />Next, we will perform the imputation function contained within the library. <br /><br /><b># Perform Imputation # </b><br /><br /><b>SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)</b><br /><br /><b>SheetB</b>:<b> </b>is the data frame which is being called by the function. <br /><br /><b>m = 1</b>: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary. <br /><br /><b>maxit</b>: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell. <br /><br /><b> method</b>: Is the method which will be utilized to perform this function. <br /><br /><b>seed</b>: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate. <br /><br />After performing the above function, you should be greeted with the output below: <br /><i><br />iter imp variable <br /> 1 1 VarA VarB VarC VarD VarE <br /> 2 1 VarA VarB VarC VarD VarE <br /> 3 1 VarA VarB VarC VarD VarE <br /> 4 1 VarA VarB VarC VarD VarE <br /> 5 1 VarA VarB VarC VarD VarE <br /> 6 1 VarA VarB VarC VarD VarE<br /> 7 1 VarA VarB VarC VarD VarE <br /> 8 1 VarA VarB VarC VarD VarE <br /> 9 1 VarA VarB VarC VarD VarE <br /> 10 1 VarA VarB VarC VarD VarE <br /> 11 1 VarA VarB VarC VarD VarE <br /> 12 1 VarA VarB VarC VarD VarE <br /> 13 1 VarA VarB VarC VarD VarE <br /> 14 1 VarA VarB VarC VarD VarE <br /> 15 1 VarA VarB VarC VarD VarE <br /> 16 1 VarA VarB VarC VarD VarE <br /> 17 1 VarA VarB VarC VarD VarE <br /> 18 1 VarA VarB VarC VarD VarE <br /> 19 1 VarA VarB VarC VarD VarE <br /> 20 1 VarA VarB VarC VarD VarE <br /> 21 1 VarA VarB VarC VarD VarE <br /> 22 1 VarA VarB VarC VarD VarE <br /> 23 1 VarA VarB VarC VarD VarE <br /> 24 1 VarA VarB VarC VarD VarE <br /> 25 1 VarA VarB VarC VarD VarE <br /> 26 1 VarA VarB VarC VarD VarE <br /> 27 1 VarA VarB VarC VarD VarE <br /> 28 1 VarA VarB VarC VarD VarE <br /> 29 1 VarA VarB VarC VarD VarE <br /> 30 1 VarA VarB VarC VarD VarE <br /> 31 1 VarA VarB VarC VarD VarE <br /> 32 1 VarA VarB VarC VarD VarE <br /> 33 1 VarA VarB VarC VarD VarE <br /> 34 1 VarA VarB VarC VarD VarE <br /> 35 1 VarA VarB VarC VarD VarE <br /> 36 1 VarA VarB VarC VarD VarE <br /> 37 1 VarA VarB VarC VarD VarE <br /> 38 1 VarA VarB VarC VarD VarE <br /> 39 1 VarA VarB VarC VarD VarE <br /> 40 1 VarA VarB VarC VarD VarE <br /> 41 1 VarA VarB VarC VarD VarE <br /> 42 1 VarA VarB VarC VarD VarE <br /> 43 1 VarA VarB VarC VarD VarE <br /> 44 1 VarA VarB VarC VarD VarE <br /> 45 1 VarA VarB VarC VarD VarE <br /> 46 1 VarA VarB VarC VarD VarE <br /> 47 1 VarA VarB VarC VarD VarE <br /> 48 1 VarA VarB VarC VarD VarE <br /> 49 1 VarA VarB VarC VarD VarE <br /> 50 1 VarA VarB VarC VarD VarE <br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br />The output is informing you that the iteration was performed a total of 50 times on one single set.<br /><br />The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank. <br /><br /><b># Assign Original Values with Imputations to Data Frame # <br /><br />SheetB_Imputed_Complete <- complete(SheetB_Imputed) </b><br /><br />The outcome should resemble something like:<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-D0hsed6_Z1g/YMJtDdwHW2I/AAAAAAAABWk/udrP9oEtcfYVCg_rjTrzgf6tUa2qmVvDgCLcBGAsYHQ/s486/DataFrameBImputations.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="486" data-original-width="332" height="320" src="https://1.bp.blogspot.com/-D0hsed6_Z1g/YMJtDdwHW2I/AAAAAAAABWk/udrP9oEtcfYVCg_rjTrzgf6tUa2qmVvDgCLcBGAsYHQ/s320/DataFrameBImputations.png" /></a></div><div style="text-align: center;"><i>(Beautiful!)</i></div><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br />A quick warning, the <b>mice()</b> function cannot be utilized on data frames which contain unencoded categorical variable entries. <br /><br />An example of this:<br /><br /> <a href="https://1.bp.blogspot.com/-y6WUjCmuJ1k/YMJtMnYiXuI/AAAAAAAABWo/U9bLPf02Mnk8-EXyCjpkfE4E8LKYIjungCLcBGAsYHQ/s453/DataFrameC.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="453" data-original-width="360" height="320" src="https://1.bp.blogspot.com/-y6WUjCmuJ1k/YMJtMnYiXuI/AAAAAAAABWo/U9bLPf02Mnk8-EXyCjpkfE4E8LKYIjungCLcBGAsYHQ/s320/DataFrameC.png" /></a><br /><br />To get <b>mice() </b>to work correctly on this data set, you must recode "<b>VARC"</b> prior to proceeding. You could do this by changing each instance of "<b>Spade"</b> to 1, "<b>Club"</b> to 2, <b>“Diamond" </b>to 3, and "<b>Heart" </b>to 4. <br /><br />For more information as it relates to this function, please check out this <a href="https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html" target="_blank">link</a>. <br /><br />That’s all for now, internet.<br /><br />-RD<br /></p></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-66659182036579343022021-06-12T10:48:00.004-04:002021-06-12T10:50:47.930-04:00(R) 2-Sample Test for Equality of ProportionsIn today’s article we are going to revisit in greater detail, a topic which was reviewed in a prior article. <br /><br />What the 2-Sample Test for Equality of Proportions seeks to achieve, is an assessment of differentiation as it pertains to one survey group’s response, as measured against another. <br /><br />To illustrate the application of this methodology, I will utilize a prior example which was previously published to this site (10/15/2017). <br /><br /><b><u>Example</u>: </b><br /><br /><i>A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased. </i><br /><br /><b># Model Hypothesis # <br /><br /># H0: p1 - p2 = 0 #</b><div><b><br /> # (The proportions are the same) # </b><div><b><br /># Ha: p1 - p2 > 0 # <br /><br /># (The proportions are NOT the same) # <br /><br /># Disable Scientific Notation in R Output #<br /> <br /> options(scipen = 999) <br /><br /># Model Application # <br /><br />prop.test(x = c(600,500), n=c(1300,1500), conf.level = .95, correct = FALSE) </b><br /><br />Which produces the output: <br /><br /><i>2-sample test for equality of proportions without continuity correction <br /><br />data: c(600, 500) out of c(1300, 1500) <br />X-squared = 47.991, df = 1, p-value = 0.000000000004281 <br />alternative hypothesis: two.sided <br />95 percent confidence interval: <br /> 0.09210145 0.16430881 <br />sample estimates: <br /> prop 1 prop 2 <br />0.4615385 0.3333333 <br /><br /></i>We are now prepared to state the details of our model’s application, and the subsequent findings and analysis which occurred as a result of such. <br /><br /><b><u>Conclusions</u>: </b><br /><br />A 2-Sample Test for Equality of Proportions without Continuity Correction was performed to analyze whether the poll survey results for Candidate A., significantly differed from subsequent poll survey results gathered weeks later. A 90% confidence interval was assumed for significance. <br /><br />There was a significant difference in Candidate A’s favorability score as from the initial poll findings: 46% (600/1300), as compared to Candidate A’s favorability score the subsequent poll findings: 33% (500/1500); χ2 (1, N = 316) = 47.99, p > .10.<br /></div></div><div><br /></div><div>-----------------------------------------------------------------------------------------------------------------------------</div><div><br /></div><div>That's all for now.</div><div><br /></div><div>I'll see you next time, Data Heads.</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-51364545410228856962021-06-05T21:10:00.003-04:002021-06-05T21:20:44.148-04:00(R) Pearson’s Chi-Square Test Residuals and Post Hoc AnalysisIn today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results. <br /><br />If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding. <br /><br /><b><u>Example</u>:</b><br /><br />To demonstrate this post hoc technique, we will utilize a prior article’s example: <br /><br /><b>The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.</b><br /><br /><b># To test for goodness of fit # <br /><br /> Model <-matrix(c(5, 1, 2, 2), <br /><br /> nrow = 2,<br /> <br /> dimnames = list("Smoker" = c("Yes", "No"),<br /> <br /> "Obese" = c("Yes", "No")))<br /> <br /> # To run the chi-square test #<br /> <br /> # 'correct = FALSE' disables the Yates’ continuity correction #<br /> <br /> chisq.test(Model, correct = FALSE)</b><br /><br />This produces the output:<br /> <br /><i> Pearson's Chi-squared test<br /> <br /> data: Model<br /> X-squared = 1.2698, df = 1, p-value = 0.2598</i><div><br />From the output provided, we can easily conclude that our results were not significant. <br /><br />However, let’s delve a bit deeper into our findings. <br /><br />First, let’s take a look at the matrix of the model. <br /><br /><b>Model</b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 5 2 <br /> No 1 2 </i><br /><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br />Now, let’s take a look at the expected model values. <br /><br /><b>chi.result <- chisq.test(Model, correct = FALSE) <br /><br />chi.result$expected </b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 4.2 2.8 <br /> No 1.8 1.2 </i><br /></p><br />What does this mean? <br /><br />The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-o6LCL1Kvv3U/YLwbgpj0itI/AAAAAAAABV4/_3OJmgIONfsqkYtE2mLoLxGwt3AGPyqUwCLcBGAsYHQ/s288/Karl_Pearson.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="219" src="https://1.bp.blogspot.com/-o6LCL1Kvv3U/YLwbgpj0itI/AAAAAAAABV4/_3OJmgIONfsqkYtE2mLoLxGwt3AGPyqUwCLcBGAsYHQ/s0/Karl_Pearson.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><i>(Karl Pearson)</i></div><br />From the previously derived values, we can derived the Pearson Residual Values.<br /><div><br /><b>print(chi.result$residuals) </b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 0.3903600 -0.4780914 <br /> No -0.5962848 0.7302967 <br /></i><br />What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data. <br /><br />The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values. </div><div><br />Thus: <b>Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value</b><br /><br /><b><u>Observed Values </u></b><br /><br /> Obese <br />Smoker Yes No <br /> Yes 5 2 <br /> No 1 2 <br /><br /><b><u>Expected Values </u></b><br /><br /> Obese <br />Smoker Yes No <br /> Yes 4.2 2.8 <br /> No 1.8 1.2</div><br />(5 – 4.2) / √ 4.2 = 0.3903600 <div><br />(1 – 1.8) / √ 1.8 = -0.5962848 <br /><br /></div><div>(2 – 2.8) / √ 2.8 = -0.4780914 <br /><br /></div><div>(2 – 1.2) / √ 1.2 = 0.7302967 <br /><br /><b><u>~ OR ~ </u></b><br /><br /><b>(5 - 4.2) / sqrt(4.2) <br /><br />(1 - 1.8) / sqrt(1.8) <br /><br />(2 - 2.8) / sqrt(2.8) <br /><br />(2 - 1.2) / sqrt(1.2) </b><br /><br /><i>[1] 0.39036 <br />[1] -0.5962848 <br />[1] -0.4780914 <br />[1] 0.7302967</i></div><br />The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.<br /><br /><b><u>The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test </u></b><br /><br />Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis. <br /><br />Why would we do this? <br /><br />Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance. <br /><br />A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-fk-35lHxDt0/YLwdtI0MRMI/AAAAAAAABWA/BBmgv1hHmqAqwHKuAjHJKK0mCpFWWdamwCLcBGAsYHQ/s378/R_Fisher.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="378" data-original-width="277" height="320" src="https://1.bp.blogspot.com/-fk-35lHxDt0/YLwdtI0MRMI/AAAAAAAABWA/BBmgv1hHmqAqwHKuAjHJKK0mCpFWWdamwCLcBGAsYHQ/s320/R_Fisher.png" /></a></div><div style="text-align: center;"><i>(Sir Ronald Fisher</i><i>)</i></div><br /><b>fisher.result <- fisher.test(Model) <br /><br />print(fisher.result$p.value) </b><br /><br /><i>[1] 0.5 <br /></i><br /><div><Yikes!></div><div><br /><u><b>Conclusions </b></u><br /><br />Now that we have considered our analysis every which way, we can state our findings in APA Format. <br /><br />This would resemble the following: <br /><br />A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05. </div><div><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 12px;"><br />In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data. <br /><br />In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a<b> non-significant</b> relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05). </p><div><br /></div><div>-----------------------------------------------------------------------------------------------------------------------------</div><br />I hope that you found all of this helpful and entertaining. <br /><br />Until next time, <br /><br />-RD<div><div><div><div><div><br /></div></div></div></div></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-23976497720373503432020-11-09T20:54:00.010-05:002020-11-09T21:00:08.517-05:00(R) Cohen’s d In today’s entry, we are going to discuss Cohen’s d, what it is, and when to utilize it. We will also discuss how to appropriately apply the methodology needed to derive this value, through the utilization of the R software package. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uDmaBoW3_vw/X6nvSwdS0jI/AAAAAAAABUA/sehzljcss1sMej3PEz0hFrBDyT0VwTvwwCLcBGAsYHQ/s572/Drake.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="513" data-original-width="572" height="358" src="https://1.bp.blogspot.com/-uDmaBoW3_vw/X6nvSwdS0jI/AAAAAAAABUA/sehzljcss1sMej3PEz0hFrBDyT0VwTvwwCLcBGAsYHQ/w400-h358/Drake.png" width="400" /></a></div><br /><div style="text-align: center;"><i>(SPSS does not contain the innate functionality necessary to perform this calculation)</i></div><br /><b><u>Cohen’s d - (What it is)</u>:</b><br /><br />Cohen’s d is utilized as a method to assess the magnitude of impact as it relates to two sample groups which are subject to differing conditions. For example, if a two sample t-test was being implemented to test a single group which received a drug, against another group which did not receive the drug, then the p-value of this test would determine whether or not the findings were significant.<br /><br /><i>Cohen’s d would measure the magnitude of the potential impact</i><i>. </i><br /><br /><b><u>Cohen’s d - (When to use it)</u>: </b><br /><br />In your statistics class. <br /><br />You could also utilize this test to perform post-hoc analysis as it relates to the ANOVA model and the Student’s T-Test. However, I have never witnessed the utilization of this test outside of an academic setting. <br /><br /><b><u>Cohen’s d – (How to interpret it)</u>: </b><br /><br />General Interpretation Guidelines: <br /><br />Greater than or equal to 0.2 = small <br />Greater than or equal to 0.5 = medium <br />Greater than or equal to 0.8 = large <br /><br /><b><u>Cohen’s d – (How to state your findings)</u>: </b><br /><br />The effect size for this analysis (d = x.xx) was found to exceed Cohen’s convention for a [small, medium, large] effect (d = .xx). <br /><br /><b><u>Cohen’s d – (How to derive it)</u>:</b><br /><br /><b># Within the R-Programming Code Space # <br /><br />################################## <br /><br /># length of sample 1 (x) # <br />lenx <- <br /># length of sample 2 (y) # <br />leny <- <br /># mean of sample 1 (x) # <br />meanx <- <br /># mean of sample 2 (y)# <br />meany <- <br /># SD of sample 1 (x) # <br />sdx <- <br /># SD of sample 2 (y) # <br />sdy <- <br /><br />varx <- sdx^2 <br />vary <- sdy^2 <br />lx <- lenx - 1 <br />ly <- leny - 1 <br />md <- abs(meanx - meany) ## mean difference (numerator) <br />csd <- lx * varx + ly * vary <br />csd <- csd/(lx + ly) <br />csd <- sqrt(csd) ## common sd computation <br />cd <- md/csd ## cohen's d <br /><br />cd <br /><br />################################## </b><br /><br /><b># The above code is a modified version of the code found at: # <br /><br /># https://stackoverflow.com/questions/15436702/estimate-cohens-d-for-effect-size #</b><br /><br /><b><u>Cohen’s d – (Example)</u></b>: <br /><br /><b>FIRST WE MUST RUN A TEST IN WHICH COHEN’S d CAN BE APPLIED AS AN APPROPRIATE POST-HOC TEST METHODOLOGY.</b><div><b> <br />Two Sample T-Test</b><br /> <br /> This test is utilized if you randomly sample different sets of items from two separate control groups. <br /><br /><b> Example:</b><br /> <br /> A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:</div><div><br /> 70, 74, 76, 72, 75, 74, 71, 71<br /> <br /> He then measures temperature in samples which the chemical was not applied.<br /> <br /> 74, 75, 73, 76, 74, 77, 78, 75<br /> <br /> Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?<br /> <br /> For this, we will use the code:<div><br /><b> N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)<br /> <br /> N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)<br /> <br /> t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)</b><br /> <br /> Which produces the output:</div><div><br /><i>Two Sample t-test <br /><br />data: N2 and N1 <br />t = 2.4558, df = 14, p-value = 0.02773 <br />alternative hypothesis: true difference in means is not equal to 0 <br />95 percent confidence interval: <br /> 0.3007929 4.4492071 <br />sample estimates: <br />mean of x mean of y <br /> 75.250 72.875 </i><br /><br /><b> # Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #</b><br /> <br /><b> # An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #<br /></b> <br /> From this output we can conclude:<br /> <br /> With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.</div></div><br /><b><u>Application of Cohen’s d</u></b> <br /><br /><b>length(N1) # 8 # <br />length(N2) # 8 # <br /><br />mean(N1) # 72.875 # <br />mean(N2) # 75.25 # <br /><br />sd(N1) # 2.167124 # <br />sd(N2) # 1.669046 # <br /><br /># length of sample 1 (x) # <br />lenx <- 8 <br /># length of sample 2 (y) # <br />leny <- 8 <br /># mean of sample 1 (x) # <br />meanx <- 72.875 <br /># mean of sample 2 (y)# <br />meany <- 75.25 <br /># SD of sample 1 (x) # <br />sdx <- 2.167124 <br /># SD of sample 2 (y) # <br />sdy <- 1.669046 <br /><br />varx <- sdx^2 <br />vary <- sdy^2 <br />lx <- lenx - 1 <br />ly <- leny - 1 <br />md <- abs(meanx - meany) ## mean difference (numerator) <br />csd <- lx * varx + ly * vary <br />csd <- csd/(lx + ly) <br />csd <- sqrt(csd) ## common sd computation <br />cd <- md/csd ## cohen's d <br /><br />cd </b><br /><br />Which produces the output: <br /><br /><i>[1] 1.227908</i><div><i><br /></i></div><div><b>################################## </b><i><br /></i><br />From this output we can conclude: <br /><br />The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80). <br /><br /><b>Combining both conclusions, our final written product would resemble: </b><br /><br /> With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water. <br /><br />The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80). <br /><br /></div><div>And that is it for this article. <br /><br />Until next time, <br /><br />-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-25937816034542672552020-10-16T23:22:00.007-04:002020-10-16T23:27:19.962-04:00(R) Fisher’s Exact Test In today’s entry, we are going to briefly review <b>Fisher’s Exact Test</b>, and its appropriate application within the R programming language. <br /><br />Like the F-Test, Fisher’s Exact Test utilizes the F-Distribution as its primary mechanism of functionality. The F-Distribution being initially derived by Sir. Ronald Fisher. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-wANIfKQjV8I/X4pgr0H_UqI/AAAAAAAABTg/u_3HU2q8t9Itu_4OQ6RS0jVjqvtTkqqtwCLcBGAsYHQ/s308/Fisher1016.JPG" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="308" data-original-width="220" src="https://1.bp.blogspot.com/-wANIfKQjV8I/X4pgr0H_UqI/AAAAAAAABTg/u_3HU2q8t9Itu_4OQ6RS0jVjqvtTkqqtwCLcBGAsYHQ/s0/Fisher1016.JPG" /></a></div><div style="text-align: center;"><i>(The Man)</i></div><div style="text-align: center;"><i><br /></i></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-AeP7Tv1jxoE/X4pg71qIRoI/AAAAAAAABTo/R-xTULHNAFA2scHBJwvStW2XWsDj9vySwCLcBGAsYHQ/s1024/Fisher1016.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://1.bp.blogspot.com/-AeP7Tv1jxoE/X4pg71qIRoI/AAAAAAAABTo/R-xTULHNAFA2scHBJwvStW2XWsDj9vySwCLcBGAsYHQ/w400-h300/Fisher1016.png" width="400" /></a></div><div style="text-align: center;"><i>(The Distribution)</i></div><div style="text-align: center;"><br /></div>The Fisher’s Exact Test is very similar to The Chi-Squared Test. Both tests are utilized to assess categorical data classifications. The Fisher’s Exact Test was designed specifically for 2x2 contingency sorted data, though, more rows could theoretically be added if necessary. A general rule for application as it relates to selecting the appropriate test for the given circumstances (Fisher’s Exact vs. Chi-Squared), pertains directly to the sample size. If a cell within the contingency table would contain less than 5 observations, a Fisher’s Exact Test would be more appropriate. <br /><br />The test itself was created for the purpose of studying small observational samples. For this reason, the test is considered to be “conservative”, as compared to The Chi-Squared Test. Or, in layman terms, you are less likely to reject the null hypothesis when utilizing a Fisher’s Exact Test, as the test errs on the side of caution. As previously mentioned, the test was designed for smaller observational series, therefore, its conservative nature is a feature, not an error. <br /><br />Let’s give it a try in today’s…</div><br /><b><u>Example: </u></b><br /><br />A professor instructs two classes on the subject of Remedial Calculus. He believes, based on a book that he recently completed, that students who consume avocados prior to taking an exam, will generally perform better than students who did not consume avocados prior to taking an exam. To test this hypothesis, the professor has one of classes consume avocados prior to a very difficult pass/fail examination. The other class does not consume avocados, and also completes the same examination. He collects the results of his experiment, which are as follows: <br /><br />Class 1 (Avocado Consumers) <br /><br />Pass: 15 <br /><br />Fail: 5 <br /><br />Class 2 (Avocado Abstainers) <br /><br />Pass: 10 <br /><br />Fail: 15 <br /><br />It is also worth mentioning that professor will be assuming an alpha value of .05. <br /><br /><b># The data must first be entered into a matrix # <br /><br />Model <- matrix(c(15, 10, 5, 15), nrow = 2, ncol=2) <br /><br /># Let’s examine the matrix to make sure everything was entered correctly # <br /><br />Model</b><br /><br /><u>Console Output: </u><br /><div><i><br /></i></div><div><i> [,1] [,2] <br />[1,] 15 5 <br />[2,] 10 15 </i><br /><br /><b># Now to apply Fisher’s Exact Test # <br /><br />fisher.test(Model) <br /></b><br /><u>Console Output: <br /></u><br /> <i><span> <span> </span></span>Fisher's Exact Test for Count Data <br /><br />data: Model <br />p-value = 0.03373 <br />alternative hypothesis: true odds ratio is not equal to 1 <br />95 percent confidence interval: <br /> 1.063497 20.550173 <br />sample estimates: <br />odds ratio <br /> 4.341278</i><br /><br /><u><b>Findings: </b></u><br /><br />Fisher’s Exact Test was applied to our experimental findings for analysis. The results of such indicated a significant relationship as it pertains to avocado consumption and examination success: 75% (15/20), as compared to non-consumption and examination success: 40% (10/25); (p = .03). <br /><br />If we were to apply the Chi-Squared Test to the same data matrix, we would receive the following output: <br /><br /><b># Application of Chi-Squared Test to prior experimental observations # </b><br /><br /><b>chisq.test(Model, correct = FALSE)</b><br /><br /><u>Console Output: </u><br /><br /><i> Pearson's Chi-squared test <br /><br />data: Model <br />X-squared = 5.5125, df = 1, p-value = 0.01888</i><br /><br /><b><u>Findings:</u> </b></div><div><br /></div><div>As you might have expected, the application of the Chi-Squared Test yielded an even smaller p-value! If we were to utilize this test in lieu of The Fisher’s Exact Test, our results would also demonstrate significance. <br /><br />That is all for this entry. <br /><br />Thank you for your patronage. <br /><br />I hope to see you again soon. <br /><br />-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-4012950854500466502020-10-14T17:58:00.002-04:002020-10-14T18:00:42.658-04:00Why Isn’t My Excel Function Working?! (MS-Excel)Even an old data scientist can learn a new trick every once in a while. <br /><br />Today was such a day.<br /><br />Imagine my shock, as I spent about two and a half hours trying to get the most basic MS-Excel Functions to correctly execute. <br /><br />This brings us to today’s example.<br /><br />I’m not sure if this is now a default option within the latest version of Excel, or why this option would even exist, however, I feel that it is my duty to warn you of its existence.<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-b9styC0PrRA/X4dx7I4nrPI/AAAAAAAABS4/ayrX50GIQRYHMvfqfVdXKjTakyyl0C-BACLcBGAsYHQ/s306/1012A.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="115" data-original-width="306" src="https://1.bp.blogspot.com/-b9styC0PrRA/X4dx7I4nrPI/AAAAAAAABS4/ayrX50GIQRYHMvfqfVdXKjTakyyl0C-BACLcBGAsYHQ/s16000/1012A.png" /></a></div><div><br /></div>For the sake this demonstration, we’ll hypothetically assume that you are attempting to write a <b>=COUNTIF </b>function within cell: <b>C2</b>, in order assess the value contained within cell: <b>A2</b>. If we were to drag this formula to the cells beneath: <b>C2</b>, in order to apply the function to cells: <b>C3 </b>and <b>C4</b>, a mis-application occurs, as the value <b>“Car”</b> is not contained within <b>A3</b> or <b>A4</b>, and yet, the value <b>1 </b>is returned. <br /><br />If this “error” arises, it is likely due to the option <b>“Manual”</b> being pre-selected within the <b>“Calculator Options”</b> drop-down menu, which itself, is contained within the<b> “Formulas”</b> ribbon menu. To remedy this situation, change the selection to <b>“Automatic”</b> within the <b>“Calculator Options”</b> drop down. <br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-jFISEQOLqf8/X4dy8djCwUI/AAAAAAAABTE/m4FiEnJWDbYmeZlY3YGeeghfCeKgEpg0gCLcBGAsYHQ/s1056/1012B.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="207" data-original-width="1056" height="78" src="https://1.bp.blogspot.com/-jFISEQOLqf8/X4dy8djCwUI/AAAAAAAABTE/m4FiEnJWDbYmeZlY3YGeeghfCeKgEpg0gCLcBGAsYHQ/w400-h78/1012B.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><i>(Click on image to enlarge)</i></div><div><br /></div>The result should be the previously expected outcome:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-MNyB5D4K_0o/X4dzVa9Nn9I/AAAAAAAABTM/AQbtrqYHl48lwLEiKzrUJuKQNBJc7tU1ACLcBGAsYHQ/s245/1012C.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="110" data-original-width="245" src="https://1.bp.blogspot.com/-MNyB5D4K_0o/X4dzVa9Nn9I/AAAAAAAABTM/AQbtrqYHl48lwLEiKzrUJuKQNBJc7tU1ACLcBGAsYHQ/s16000/1012C.png" /></a></div><div><br /></div>Instead of accidentally and unknowingly encountering this error/feature in a way which is detrimental to your research, I would always recommend checking that <b>“Calculator Options”</b> is set to <b>“Automatic”</b>,<b> </b>prior to beginning your work within the MS-Excel platform.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-KPcQq0L3JhI/X4dzqbw7Q8I/AAAAAAAABTU/PxARehn-RQU9pAe1GfUgNoORAz2AHHUrACLcBGAsYHQ/s259/1012D.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="180" data-original-width="259" src="https://1.bp.blogspot.com/-KPcQq0L3JhI/X4dzqbw7Q8I/AAAAAAAABTU/PxARehn-RQU9pAe1GfUgNoORAz2AHHUrACLcBGAsYHQ/s0/1012D.png" /></a></div><br />I hope that you found this article useful. <br /><br />I’ll see you in the next entry. <br /><br />-RD <br />Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-57997942540476579562020-10-06T23:28:00.005-04:002020-10-06T23:31:54.761-04:00Averaging Across Variable Columns (SPSS)There may be a more efficient way to perform this function, as simpler functionality exists within other programming languages. However, I have not been able to discover a non <b>“ad-hoc” </b>method for performing this task within SPSS. <br /><br />We will assume that we are operating within the following data set:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-apvBDdzNEX4/X30zjhdRW0I/AAAAAAAABSM/C-V69eU_OLovvfBgE0SuMxWFv2jw3FaygCLcBGAsYHQ/s344/A10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="257" data-original-width="344" src="https://1.bp.blogspot.com/-apvBDdzNEX4/X30zjhdRW0I/AAAAAAAABSM/C-V69eU_OLovvfBgE0SuMxWFv2jw3FaygCLcBGAsYHQ/s16000/A10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Which possesses the following data labels:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-dRstvQfcyr4/X30z2247HrI/AAAAAAAABSU/MqVH5Zb9B1wjSuz4OThvxhCACA88b0KkgCLcBGAsYHQ/s350/B10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="258" data-original-width="350" src="https://1.bp.blogspot.com/-dRstvQfcyr4/X30z2247HrI/AAAAAAAABSU/MqVH5Zb9B1wjSuz4OThvxhCACA88b0KkgCLcBGAsYHQ/s16000/B10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Assuming that all variables are on a similar scale, we could create a new variable by utilizing the code below: <br /><br /><b>COMPUTE CatSum=MEAN(VarA, <br />VarB, <br />VarC). <br />EXECUTE. <br /></b><br />This new variable will be named <b>“CatSum”</b>. This variable will be comprised of the mean of the sum of each variable’s corresponding observational data rows: (<b>“VarA”</b>, <b>“VarB”</b>, <b>“VarC”</b>). <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1KVLXcGlTXs/X300R55rkuI/AAAAAAAABSc/9LS64FaKsesHMYq6rcN_tIpXOYv9D590QCLcBGAsYHQ/s430/C10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="259" data-original-width="430" src="https://1.bp.blogspot.com/-1KVLXcGlTXs/X300R55rkuI/AAAAAAAABSc/9LS64FaKsesHMYq6rcN_tIpXOYv9D590QCLcBGAsYHQ/s16000/C10.6.png" /></a></div><div><br /></div> To generate the mean value of our newly created <b>“CatSum” </b>variable, we would execute the following code: <br /><br /><b>DESCRIPTIVES VARIABLES=CatSum <br /> /STATISTICS=MEAN STDDEV. </b><div><br />This produces the output:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-GMeDNaTlGhI/X300j_B8QkI/AAAAAAAABSk/I4KtlkVu1VkSjdRYeL7TTnLXV8xGjw8-ACLcBGAsYHQ/s426/D10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="426" src="https://1.bp.blogspot.com/-GMeDNaTlGhI/X300j_B8QkI/AAAAAAAABSk/I4KtlkVu1VkSjdRYeL7TTnLXV8xGjw8-ACLcBGAsYHQ/s16000/D10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br />To reiterate what we are accomplishing by performing this task, we are simply generating the mean value of the sum of variables: <b>“VarA”</b>, <b>“VarB”</b>, <b>“VarC”</b>. <br /><br />Another way to conceptually envision this process, is to imagine that we are placing all of the variables together into a single column:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-30_c_QnCkOc/X3003ZZOk_I/AAAAAAAABSs/lRmctvQXC-c5NVirOKHljCXm1iQsIqo5ACLcBGAsYHQ/s741/E10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="741" data-original-width="87" height="640" src="https://1.bp.blogspot.com/-30_c_QnCkOc/X3003ZZOk_I/AAAAAAAABSs/lRmctvQXC-c5NVirOKHljCXm1iQsIqo5ACLcBGAsYHQ/w74-h640/E10.6.png" width="74" /></a></div><br />After which, we are generating the mean value of the column which contains all of the combined variable observational values. <br /><br />And that, is that!<br /><br />At least, for this article. <br /><br />Stay studious in the interim, Data Heads! <br /><br />- RD<div><div><div><br /></div></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-1417840554896830862020-09-08T19:44:00.006-04:002020-09-08T19:47:46.538-04:00How to Beautify your (SPSS) Outputs with MODIFY <div><b>** (Clicking on the any of the images displayed below will enlarge their contents) **</b></div><div><br /></div>First, we will address the steps necessary to suppress unnecessary and unwanted columns within the SPSS Frequency tables. <br /><br />The process to enable the <b>MODIFY</b> functionality is rather complicated. However, if you follow the steps below, you too will be able to have beautiful outputs without having to endeavor upon a lengthy manual cleanup process. <br /><br /><b><u>Steps Necessary to Enable the MODIFY Command </u></b><br /><br />1. Un-install SPSS. <br /><br />2. Install the latest version of Python Programming Language (3.x). The executable installer can be found here: <a href="http://www.python.org">www.python.org</a>. <br /><br /><b>(NOTE: THIS STEP MUST STILL BE ADHERED TO, EVEN IF ANACONDA PYTHON HAS ALREADY BEEN PREVIOUSLY INSTALLED.) </b><br /><br />3. Re-install SPSS. During the installation process, be sure to make all of the appropriate selections necessary to install the SPSS Python Libraries. <br /><br />4. From the top menu within SPSS’s data view, select the menu title<b> “Extensions”</b>, then select the option <b>“Extension Hub”</b>.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-A_bQz39u7eM/X1gVy0_gnUI/AAAAAAAABRU/_WGYGslt5zgI2NvZ0CppdXWFr6XpEQxuwCLcBGAsYHQ/s727/EHub.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="153" data-original-width="727" height="102" src="https://1.bp.blogspot.com/-A_bQz39u7eM/X1gVy0_gnUI/AAAAAAAABRU/_WGYGslt5zgI2NvZ0CppdXWFr6XpEQxuwCLcBGAsYHQ/w500-h102/EHub.png" width="500" /></a></div><br /><div><br /></div>5. Within the <b>“Explore”</b> tab of the <b>“Extension Hub”</b> menu, search for <b>“SPSSINC MODIFY TABLES” </b>within the left search bar.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-rOVNRaXS93E/X1gWA3XN2dI/AAAAAAAABRY/UHfMPPebBbkhAfXrxOiiwMgUH01oswFZgCLcBGAsYHQ/s967/Frame1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="720" data-original-width="967" height="370" src="https://1.bp.blogspot.com/-rOVNRaXS93E/X1gWA3XN2dI/AAAAAAAABRY/UHfMPPebBbkhAfXrxOiiwMgUH01oswFZgCLcBGAsYHQ/w500-h370/Frame1.png" width="500" /></a></div><div><br /></div>6. Check the box <b>“Get extension” </b>to the right of<b> “SPSSINC_MODIFY_TABLES”</b>, then click <b>“OK”</b>. <br /><br />7. The next screen should confirm that the installation of the extension has occurred. <br /><br /><b><u>Steps Necessary to Utilize the MODIFY Command </u></b><br /><br />We are now prepared to obliterate all of those pesky<b> ‘Percent’ </b>and <b>‘Cumulative Percent’ </b>tables from existence! In order to achieve this as it applies to all tables within the output section, create and run the following lines of syntax subsequent to frequency table creation.<br /><br /><b>SPSSINC MODIFY TABLES subtype="Frequencies" <br /><br />SELECT='Cumulative Percent' 'Percent' <br /><br />DIMENSION= COLUMNS <br /><br />PROCESS = ALL HIDE=TRUE <br /><br />/STYLES APPLYTO=DATACELLS. </b><br /><br /><b><u>Steps Necessary to Remove the top Frequency Rows Which Accompany Frequency Table Output</u></b><div><b><u><br /></u></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-MldU_t1Beek/X1gWOc4kZUI/AAAAAAAABRg/uq2qcuYHgj88648VR2-KTSb5tpaSqpp5gCLcBGAsYHQ/s541/Freq.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="226" data-original-width="541" height="210" src="https://1.bp.blogspot.com/-MldU_t1Beek/X1gWOc4kZUI/AAAAAAAABRg/uq2qcuYHgj88648VR2-KTSb5tpaSqpp5gCLcBGAsYHQ/w500-h210/Freq.png" width="500" /></a></div><div><br /></div>In order to suppress the creation of the type of table depicted above, you must modify your initial frequency syntax. <br /><br />Instead of utilizing syntax such as: <br /><br /><b>FREQUENCIES VARIABLES=Q1 Q2 Q3 <br /><br /> /ORDER=ANALYSIS. <br /></b><br />You are instead forced to utilize a more verbose syntax: <br /><br /><b>OMS SELECT ALL /EXCEPTIF SUBTYPES='Frequencies' <br /><br />/DESTINATION VIEWER=NO. <br /><br />FREQUENCIES VARIABLES= Q1 Q2 Q3 <br /><br /> /ORDER=ANALYSIS. <br /><br />OMSEND. </b><br /><br />Doing such adds lines of code. However, it is worth the effort. At least, in my opinion. As the offset to the trade is peace of mind. <br /><br /><b><u>How to Suppress Syntax from Printing within the SPSS Output </u></b><br /><br />In order to suppress syntax from printing within the SPSS Output widow, prior to creating output, follow the steps below. <br /><br />1. From the top menu within SPSS’s data view, select the menu title <b>“Edit”</b>, then select the option <b>“Options”</b>. <br /><br />2. Within the subsequent menu, select the tab <b>“Viewer”</b>. Then, remove the check mark located to the left of <b>“Display commands in the log”</b>. Next, click <b>“Apply”</b>. <br /><br />You are now prepared to create SPSS session output devoid of syntax. <br /><br /><b><u>How to Modify the Visual Style of SPSS Table Output </u></b><br /><br />If you’d prefer a different, perhaps more readable SPSS table output, the following steps allow for the modification of such. <br /><br />1. Create a table within SPSS which complies with the system default output style. <br /><br />2. Right click on the table within the output, and select the options <b>“Edit Content”</b>, <b>“In Separate Window”</b> within the drop down menu.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-JMPcJWBz8XE/X1gWc_wrY4I/AAAAAAAABRo/JJcXSk-2iNcDH5bOUuS42MCCC4wpYzp2QCLcBGAsYHQ/s512/dropdown.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="423" data-original-width="512" height="413" src="https://1.bp.blogspot.com/-JMPcJWBz8XE/X1gWc_wrY4I/AAAAAAAABRo/JJcXSk-2iNcDH5bOUuS42MCCC4wpYzp2QCLcBGAsYHQ/w500-h413/dropdown.png" width="500" /></a></div><div><br /></div>3. Selecting<b> “Format”</b>, followed by <b>“Table Looks” </b>from the top menu, presents a new pop-up menu which allows for general table alterations. <br /><br />As an example, select <b>“ClassicLook” </b>from the <b>“TableLook Files:” </b>menu. <br /><br />Next, click the right <b>“Edit Look”</b> button, then click the tab <b>“Cell Formats”</b>. Within this submenu, the general background of table cells can be modified. Be sure to click <b>“Apply”</b> before clicking<b> “OK”</b>. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-C_UtSpsjQ0U/X1gWo7U1j0I/AAAAAAAABRw/_OqL8Lzus0wtmOXM7yKBvKB4O4i9ZLw6wCLcBGAsYHQ/s744/TableProper.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="515" data-original-width="744" height="347" src="https://1.bp.blogspot.com/-C_UtSpsjQ0U/X1gWo7U1j0I/AAAAAAAABRw/_OqL8Lzus0wtmOXM7yKBvKB4O4i9ZLw6wCLcBGAsYHQ/w500-h347/TableProper.png" width="500" /></a></div><div><br />4. To save a custom <b>“Look”</b>, again select <b>“TableLooks”</b> from the <b>“Format”</b> menu. Select <b>“Save Look”</b>, with <b>“<As Displayed>”</b> selected within the right <b>“TableLook Files”</b> menu. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-_ICR0iam4yg/X1gW0T0caQI/AAAAAAAABR4/99hg1ts2iUYjNwVR5HRKDFqU0Lv6ncXAQCLcBGAsYHQ/s636/LookResults.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="493" data-original-width="636" height="388" src="https://1.bp.blogspot.com/-_ICR0iam4yg/X1gW0T0caQI/AAAAAAAABR4/99hg1ts2iUYjNwVR5HRKDFqU0Lv6ncXAQCLcBGAsYHQ/w500-h388/LookResults.png" width="500" /></a></div><div><br /></div>5. To load this look so that it is applied to all future outputs, select <b>“Edit”</b> from the top main SPSS Data View menu. Then select <b>“Options”</b> from the drop down menu followed by the tab <b>“Pivot Tables”</b>. Select the <b>“Browse”</b> button from beneath the <b>“Table View”</b> menu, then select the new look which you created. <br /><br />6. Clicking <b>“Apply”</b>, followed by <b>“OK”</b>, will apply this look to all future tables created during the duration of the SPSS session. <br /><br />If you ever want to revert back to the default look, follow the previous steps, and select <b>“<System Default>”</b> from the leftmost <b>“TableLook” </b>menu. <br />Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-8432181850897965532020-08-31T10:52:00.002-04:002020-08-31T10:57:50.031-04:00(R) Markov ChainsPer Wikipedia, “A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state of the attained in the previous event”. <br /><br />Explained in a less broad manner, a Markov chain could be described as a way of assessing probabilistic systems by assessing fluidity as it applies to both a single variable, and the other variables contained within a system. <br /><br />For example, in the case of weather systems, a day which is cloudy may subsequently be followed by a day which is also cloudy, a day without clouds, or a rainy day. However, the probability of each subsequent event will undoubtedly be impacted by the composition of the current state. <br /><br />Another example of the applied methodology is assessment of market share. If company A offers a product which potentially retains 60% of its current consumers annually, but also has the potential to lose 40% of that consumer base to company B on an annual basis, and company B potentially retains 80% of its current annually, but also has the potential to lose 20% of that consumer base to company A, what is the impact of the phenomenon described on an annual basis? <br /><br />Let’s explore both examples: <br /><br />First, we’ll create a model which can predict weather. <br /><br />We’ll assume that the following probabilities appropriately describe the autumn forecasts for weather in Winnipeg.<br /><br /> Cloudy Clear Snowy Rainy <br /><br />Cloudy 33% 17% 25% 25% <br /><br />Clear 25% 50% 12% 13% <br /><br />Snowy 19% 15% 33% 33% <br /><br />Rainy 20% 20% 10% 50% <br /><br />To further understand this probability matrix, assume that currently the day’s forecast in Winnipeg is <b>“Cloudy”</b>. This would typically indicate that the following day would have weather which is either “<b>Cloudy” </b>(33%),<b> “Clear”</b> (17%), <b>“Snowy” </b>(25%), or <b>“Rainy”</b> (25%). <br /><br />Now, we’ll run the information through the R-Studio platform:<br /><br /><b><u>EXAMPLE A – Weather Model </u></b><br /><br /><b># With the libraries ‘markovchain’ and ‘diagram’ downloaded and enabled # <br /><br /># Create a Transition Matrix # <br /><br />trans_mat <- matrix(c(.33, .17, .25, .25, .25, .50, .12, .13, .19, .15, .33, .33, .20, .20, .10, .50),nrow = 4, byrow = TRUE) <br /><br />stateNames <- c("Cloudy","Clear", "Snowy", "Rainy") <br /><br />row.names(trans_mat) <- stateNames <br /><br />colnames(trans_mat) <- stateNames <br /><br /># Check input # <br /><br />trans_mat <br /><br /># Console Output #</b><i><br /><br /> Cloudy Clear Snowy Rainy <br />Cloudy 0.33 0.17 0.25 0.25 <br />Clear 0.25 0.50 0.12 0.13 <br />Snowy 0.19 0.15 0.33 0.33 <br />Rainy 0.20 0.20 0.10 0.50</i><br /><b><br /># Create a Discrete Time Markov Chain # <br /><br />disc_trans <- new("markovchain",transitionMatrix=trans_mat, states=c("Cloudy","Clear", "Snowy", "Rainy"), name="Weather") <br /><br /># Check input # <br /><br />disc_trans <br /><br /># Console Output #</b><br /><br /><i>Weather <br /> A 4 - dimensional discrete Markov Chain defined by the following states: <br /> Cloudy, Clear, Snowy, Rainy <br /> The transition matrix (by rows) is defined as follows: <br /> Cloudy Clear Snowy Rainy <br />Cloudy 0.33 0.17 0.25 0.25 <br />Clear 0.25 0.50 0.12 0.13 <br />Snowy 0.19 0.15 0.33 0.33 <br />Rainy 0.20 0.20 0.10 0.50</i><br /><br /><b># Illustrate the Matrix Transitions # <br /><br />plotmat(trans_mat,pos = NULL, <br /><br /> lwd = 1, box.lwd = 2, <br /><br /> cex.txt = 0.8, <br /><br /> box.size = 0.1, <br /><br /> box.type = "circle", <br /><br /> box.prop = 0.5, <br /><br /> box.col = "light yellow", <br /><br /> arr.length=.1, <br /><br /> arr.width=.1, <br /><br /> self.cex = .4, <br /><br /> self.shifty = -.01, <br /><br /> self.shiftx = .13, <br /><br /> main = "")</b><br /><br />This produces the output graphic:<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-8xzK-2zG-_Y/X00PRRTme9I/AAAAAAAABQ0/1N4qtTmFkkchRJsZcAvbeCT95U5ICiVcQCLcBGAsYHQ/s570/MarkovChain1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="570" data-original-width="401" src="https://1.bp.blogspot.com/-8xzK-2zG-_Y/X00PRRTme9I/AAAAAAAABQ0/1N4qtTmFkkchRJsZcAvbeCT95U5ICiVcQCLcBGAsYHQ/s0/MarkovChain1.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><i>(As it pertains to the graphic- something important to note is the direction of the arrows. The arrow direction in the graphic is inverted. Therefore, I would only use the graphic as an auxiliary for personal reference.) </i><br /><br /><b># We will assume that the current forecast is cloudy by creating the vector below # </b><br /><br /><b>Current_state<-c(1, 0, 0, 0) </b><br /><br /><b># Now we will utilize the following code to predict the weather for tomorrow # </b><br /><br /><b>steps<-1 </b><br /><br /><b>finalState<-Current_state*disc_trans^steps </b><br /><br /><b>finalState </b><br /><br /><b># Console Output # </b><br /><br /><i> Cloudy Clear Snowy Rainy </i><br /><i>[1,] 0.33 0.17 0.25 0.25 <br /></i><br />This output indicates that tomorrow will have a 33% chance of being cloudy, a 17% chance of being clear, a 25% chance of being snowy, and a 25% chance of being rainy. <br /><br /><b># Let’s predict the weather for the following day # <br /><br />steps<-2 <br /><br />finalState<-Current_state*disc_trans^steps <br /><br />finalState <br /><br /># Console Output # </b><br /><br /><i> Cloudy Clear Snowy Rainy <br />[1,] 0.2428372 0.2621651 0.1839856 0.311012 <br /></i><br />With this information, we can assume that generally there is a 24% chance of rain, a 26% chance of the day being clear, an 18% of the day being snowy, and a 31% chance of the day being rainy. <br /><br />It would be helpful if the rounded figures summed to 1. But I think that you probably understand the example regardless. <br /><br /><b><u>EXAMPLE A – Market Share </u></b><br /><br />Let’s re-visit our market share example: <br /><br />Company A offers a product which potentially retains 60% of its current consumers annually, but also has the potential to lose 40% of that consumer base to company B on an annual basis, and company B potentially retains 80% of its current annually, but also has the potential to lose 20% of that consumer base to company A, what is the impact of the phenomenon described on an annual basis? <br /><br />Let’s make a few assumptions. <br /><br />First, we will assume that the projection given above is accurate. <br /><br />Next, we’ll assume that the total customer base as it pertains to the product is 60,000,000. <br /><br />Finally, we’ll assume that the Company A possesses 20% of this market, and Company B possesses 80% of this market. 12,000,000 individuals and 48,000,000 respectively. <br /><b><br /># With the libraries ‘markovchain’ and ‘diagram’ downloaded and enabled # <br /><br /># Create a Transition Matrix # <br /><br />trans_mat <- matrix(c(0.6,0.4,0.8,0.2),nrow = 2, byrow = TRUE) <br /><br />stateNames <- c("Company A","Company B") <br /><br />row.names(trans_mat) <- stateNames <br /><br />colnames(trans_mat) <- stateNames <br /><br /># Check input # <br /><br />trans_mat <br /><br /># Console Output #</b><br /><div><br /><i> Company A Company B <br />Company A 0.6 0.4 <br />Company B 0.8 0.2 </i><br /><br /><b># Create a Discrete Time Markov Chain # </b></div><div><b><br />disc_trans <- new("markovchain",transitionMatrix=trans_mat, states=c("Company A","Company B"), name="Market Share") <br /><br />disc_trans <br /><br /># Check input # <br /><br />disc_trans <br /><br /># Console Output # </b><br /><i><br />Market Share <br /> A 2 - dimensional discrete Markov Chain defined by the following states: <br /> Company A, Company B</i></div><div><i> The transition matrix (by rows) is defined as follows: <br /> Company A Company B <br />Company A 0.6 0.4 <br />Company B 0.8 0.2</i><br /></div><br /><b># Illustrate the Matrix Transitions # <br /><br />plotmat(trans_mat,pos = NULL, <br /><br /> lwd = 1, box.lwd = 2, <br /><br /> cex.txt = 0.8, <br /><br /> box.size = 0.1, <br /><br /> box.type = "circle", <br /><br /> box.prop = 0.5, <br /><br /> box.col = "light yellow", <br /><br /> arr.length=.1, <br /><br /> arr.width=.1, <br /><br /> self.cex = .4, <br /><br /> self.shifty = -.01, <br /><br /> self.shiftx = .13, <br /><br /> main = "")</b><br /><br />This produces the output graphic:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-cCWp2j58Qv8/X00PX29y_aI/AAAAAAAABQ4/wm7zUwml8egheTmzmuVx6XdiZwQ1hbxYACLcBGAsYHQ/s439/MarkovChain2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="439" data-original-width="144" src="https://1.bp.blogspot.com/-cCWp2j58Qv8/X00PX29y_aI/AAAAAAAABQ4/wm7zUwml8egheTmzmuVx6XdiZwQ1hbxYACLcBGAsYHQ/s0/MarkovChain2.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><i>(Again, as it pertains to the graphic- something important to note is the direction of the arrows. The arrow direction in the graphic is inverted. Therefore, I would only use the graphic as an auxiliary for personal reference.) <br /></i><br /><b># We will assume that the market share is as follows # <br /><br /># This reflects the information provided in the example description above # <br /><br />Current_state<- c(0.20,0.80) <br /><br /># Now we will utilize the following code to predict the market share for the next year # <br /><br />steps<-1 <br /><br />finalState<-Current_state*disc_trans^steps <br /><br />finalState <br /><br /># Console Output #</b><br /><br /><i> Company A Company B <br />[1,] 0.76 0.24 </i><br /><br />As illustrated, one year out, Company A now controls 76% of the market share (45,600,000)*, and Company B controls 24% of the market share (14,400,000). <br /><br />* Assuming that original market share does not increase or decline in overall individuals. The calculation for the figures is: 60,000,000 * .76 and 60,000,000 * .24. <br /><br />Similar to our previous example, we can also project the current trend for multiple consecutive time periods.<br /><br /><b># The following code to predicts the market share for the following two years # <br /><br />steps<-2 <br /><br />finalState<-Current_state*disc_trans^steps <br /><br />finalState <br /><br /># Console Output #</b><br /><br /><i> Company A Company B <br />[1,] 0.648 0.352 </i><br /><br />Steady state in the case of this example, will predict the potential equilibrium which will be reached if the trends continue ad infinitum. <br /><br /><b># Steady state Matrix # <br /><br />steadyStates(disc_trans) <br /><br /># Console Output # </b><br /><br /><i> Company A Company B <br />[1,] 0.6666667 0.3333333</i><br /><br />Company A in this scenario now controls approximately 66.66% of the market share, and Company B controls 33.33% of the market share.Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-35561549373697002132020-08-25T19:56:00.004-04:002020-08-25T20:00:52.941-04:00(R) Exotic Analysis – Distance Correlation T-TestIn prior articles, I explained the various test of correlation which are available within the R programming language. One of those methods which was described but is rarely utilized outside of the textbook, is the Distance Correlation T-Test methodology. <br /><br />In this entry, I will briefly explain when it is appropriate to utilize the distance correlation, and how to appropriate apply the methodology within the R framework. <br /><br />Now I must begin by stating that what I am about to describe is uncommon, and should only be utilized in situations which absolutely warrant application.<br /><br />The distance correlation as described within the context of this blog is:<div><br /><b><u>Distance Correlation</u></b> – A method which tests model variables for correlation through the utilization of a Euclidean distance formula.<br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br />So when would I apply the <b>Distance Correlation T-Test</b>? To answer this question, only in situations in which other correlation methods are inapplicable. In the case which I am about to demonstrate, an example of the inapplicability of other methods would be situations in which one variable is continuous, and the other is categorical.<br /><br /><b><u>Example:</u></b><br /><br />(This example requires that the R package:<b> “energy”</b>, be downloaded and enabled.)</p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br /><b># Data Vectors #<br /> <br /> x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)<br /> y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)<br /><br />dcor.ttest(x, y)<br /><br />mean(x)<br /><br />sd(x)<br /><br />mean(y)<br /><br />sd(y)</b><br /><br />This produces the output:<br /> <br /><i> dcor t-test of independence<br /> <br /> data: x and y<br /> T = -0.1138, df = 34, p-value = 0.545<br /> sample estimates:<br /> Bias corrected dcor <br /> -0.01951283<br /><br />> mean(x)<br />[1] 4.8<br />> sd(x)<br />[1] 3.794733<br />> mean(y)<br />[1] 79<br />> sd(y)<br />[1] 14.3527</i><br /><br /><b>Conclusion:</b><br /><br />There was a not significant difference in GROUP X (M = 4.80, SD = 3.79), as compared to GROUP Y (M = 79, SD = 14.35), t(34) = -0.11, p = .55.<br /><br />However, you may be wondering, what is the difference between the Distance Correlation T-Test, the Distance Correlation Method, and the Pearson Test of Correlation?<br /><br /><b><u>Distance Correlation T-Test</u></b> – Utilized to test for significance in situations in which one variable is continuous, and the other is categorical. This method can also be utilized in other situations, however, if both variables are continuous, then the<b> Pearson Test of Correlation </b>is most appropriate. <br /><br /><b><u>Distance Correlation Method</u></b> – Utilized to test for correlation between two variables when assessed through the application of the Euclidean Distance Formula. This model output value is similar to coefficient of determination, in that, it can range from 0 (no correlation), to 1 (perfect correlation). <br /><br /><b><u>The Pearson Test of Correlation</u></b> – Utilized to determine if values are correlated. This method should typically be utilized above all other tests of correlation. However, it is only appropriate to utilize this method when both variables are continuous.</p></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-11785824963802933832019-08-04T15:36:00.004-04:002021-04-15T18:05:52.553-04:00Model and Method UtilizationThere are many model types, methods and techniques demonstrated on this website. In this entry, I will categorize each of the aforementioned concepts, and provide a brief description as it pertains to the scenario which would warrant appropriate utilization. <br /><br /><div><b><i>(Tests of Normality)</i></b><br /><br /><b><u>Q-Q Plot</u></b> – A graph which is utilized to assess data for normality. <br /><br /><b><u>P-P Plot</u></b> – A graph which is utilized to assess data for normality. <br /><br /><b><u>Shapiro-Wilk Normality Test</u></b> – A test which is utilized to test data for normality. <br /><br /><b><i>(Tests Related to Parametric Model Variable Correlation) </i></b><br /><br /><b><u>Variance Influence Factor</u> </b>– A method which tests model variables for correlation.<br /><br /><b><u>(Pearson) Coefficient of Correlation</u></b> – A method which tests variables for correlation.<br /><br /><b><u>Partial Correlation</u></b> - A method which is utilized to measure the correlation between two variables, while also controlling for a third variable.<br /><br /><b><u>Distance Correlation</u></b> – A method which tests model variables for correlation through the utilization of a Euclidean distance formula. <br /><br /><b><u>Canonical Correlation</u></b> – A method which assesses model variables for correlation through the combination of model variables into independent groups.<br /><br /><b><i>(Tests Related to Non-Parametric Model Variable Correlation)</i></b><br /><b><i><br /></i></b><b><u>Spearman’s Rank Correlation</u> </b>- A non-parametric alternative to the Pearson correlation. This method is utilized in circumstances when either data samples are non-linear, or the data type contained within those samples are ordinal. An example of ordinal data – “survey response data which asked the respondent to rank a particular item on a scale of 1-10”.<br /><br /><b><u> Kendall Rank Correlation Coefficient</u></b> - Like Spearman’s rho, Kendall’s Tau is also utilized in circumstances when either data samples are non-linear, or the data type contained within the samples is ordinal.<br /><br /><b><i> (Tests of Significance Amongst Groups)</i></b><br /><br /><b><u>One Sample T-Test </u></b>- This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.<br /><br /><b><u>Two Sample T-Test</u></b> - This test functions in the same manner as the above test. However, in the case of this model, data is randomly sampled from different sets of items from two separate control groups. <br /><br /><b><u>The Welch Two Sample T-Test</u></b> - This test functions in the same manner as the above test. The only difference being, this method is utilized if data is randomly sampled from different sets of items from two separate control groups of uneven size. <br /><br /><b><u>Paired T-Test</u> </b>– Similar in composition to the Two Sample T-Test, this test is utilized if you are sampling the same set twice, once for each variable.<br /><br /><b><i>(Analysis of Variance “ANOVA”)</i></b><br /><b><i><br /></i></b><b><u>Analysis of Variance</u> </b>– Also known as ANOVA, this method is utilized to test for significance across the variances of multiple sample groups. In many ways, this test is similar to a t-test, however, ANOVA allows for multiple group comparison.<br /><br /><b><u>One Way Analysis of Variance (ANOVA)</u> </b>– An ANOVA model containing a single independent variable.<br /><br /><b><u>Two Way Analysis of Variance (ANOVA)</u></b> - An ANOVA model containing multiple independent variables.<br /><br /><b><u>Repeated-Measures Analysis of Variance (ANOVA)</u></b> – An ANOVA model containing a single independent variable measured multiple times.<br /><br /><b><i> (Exotic Analysis of Variance “ANOVA” Variants)</i></b><br /><br /><b><u>Analysis of Covariance (ANCOVA)</u> </b>– An ANOVA model which also factors for a covariate value which may impact the system as a whole. <br /><br /><a href="https://statistics.laerd.com/spss-tutorials/ancova-using-spss-statistics.php">https://statistics.laerd.com/spss-tutorials/ancova-using-spss-statistics.php</a><br /><br /><b><u>Random Effects Analysis of Variance</u> </b>– An ANOVA model which is synthesized from sampling from a greater population in order to determine inference. <br /><br /><a href="https://stat.ethz.ch/education/semesters/as2015/anova/06_Random_Effects.pdf">https://stat.ethz.ch/education/semesters/as2015/anova/06_Random_Effects.pdf</a><br /><br /><b><u>Multivariate Analysis of Variance (MANOVA)</u></b> – An ANOVA model containing multiple dependent variables.<br /><br /><a href="https://statistics.laerd.com/spss-tutorials/one-way-manova-using-spss-statistics.php">https://statistics.laerd.com/spss-tutorials/one-way-manova-using-spss-statistics.php</a><br /><br /><b><u>Multivariate of Covariance (MANCOVA)</u></b> – An ANOVA model containing multiple dependent variables. Also factors for a covariate value which may impact the system as a whole. <br /><br /><a href="https://statistics.laerd.com/spss-tutorials/one-way-mancova-using-spss-statistics.php">https://statistics.laerd.com/spss-tutorials/one-way-mancova-using-spss-statistics.php</a><br /><br /><b><i>(Test of Significance for Nonparametric Data)</i></b><br /><br /><b><u>Friedman Test (One Way Analysis of Variance)</u></b> – The nonparametric alternative to a One Way ANOVA test.<br /><br /><b><u>Wilcox Signed Rank Test (One Sample T-Test, Paired T-Test)</u></b> – The nonparametric alternative to the One Sample T-Test, and the Paired T-Test.<br /><br /><b><u>Mann-Whitney U Test (Two Sample T-Test)</u></b> – A nonparametric alternative to the One Way ANOVA test.<br /><br /><b><i>(Tests of Significance Amongst Groups)</i></b><br /><br /><b><u>Chi-Square</u></b> – A test which measures categorical significance as it pertains to a binary outcome variable. <br /><br /><b><u>McNemar's Test</u> </b>– A test which measures categorical significance, limited to two initial categories, and two categorical outcomes. This test is typically utilized for drug trials. <br /><br /><b><i>(Metric to Assess Rate of Agreement Amongst Two Entitles) </i></b><br /><br /><b><u>Cohen’s Kappa</u> </b>– A test which measures the rate of agreement amongst two entities. <br /><br /><b><i>(Tests of Significance Amongst Groups Comprised of Survey Questions)</i></b><br /><br /><b><u>Cronbach’s Alpha</u></b> - Cronbach’s Alpha is primarily utilized to measure the inter-relatedness of response data collected from sociological surveys. Specifically, the potential differentiation of response information related to certain interrelated categorical survey questions. <br /><br /><b><i>(Tests Pertaining to Stationarity and Random Walks)</i></b><br /><br /><b><u>Dicky-Fuller Test</u></b> – A methodology of analysis utilized to test data for stationarity.<br /><br /><b><u>Phillips-Perron Unit Root Test</u></b> – A methodology utilized to test data for random walk potential.<br /><br /><b><i>(Comparison of Outcome Variables)</i></b><br /><br /><b><u>Two Step Cluster</u> </b>– A method which assesses model outcome variables through the utilization of a clustering technique. <br /><br /><b><u>K-Means</u></b> - A method which assesses model outcome variables through the utilization of a clustering technique.<br /><br /><b><u>Hierarchical Cluster</u></b> - A method which assesses model outcome variables through the utilization of a hierarchal technique. <br /><br /><b><u>K-Nearest Neighbor</u></b> – A method which compares similarity of outcome variables as determined by the values of the model’s independent variables. <br /><br /><b><i>(Reduction of Independent Variables through Variable Synthesis)</i></b><br /><br /><b><u>Dimension Reduction</u></b> – A method which creates new variables with values that are determined by the original values of the independent model variables. <br /><br /><b><i>(Impact Assessment)</i></b><br /><br /><b><u>TURF Analysis</u></b> – A method of analysis typically utilized for product and design studies. This technique assesses the most effective way to reach a sample target demographic. <br /><br /><b><i>(Survival Analysis) </i></b><br /><br /><b><u>Survival Analysis</u></b> - A statistical methodology which measures the probability of an event occurring within a group over a period of time.<br /><br /><b><i>(Sample Distribution Tests)</i></b><br /><br /><b><u>The Wald Wolfowitz Test</u> </b>- A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.<br /><br /><b><u>The Wald Wolfowitz Test (2-Sample)</u></b> - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.<br /><br /><b><u>The Kolmogorov-Smirnov Test</u></b> - A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.<br /><br /><b><u>The Kolmogorov-Smirnov Test (2-Sample)</u></b> - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.<br /><br /><b><i>(Outcome Models – Conditions for Utilization)</i></b><br /><br /><b><u>Linear Regression</u></b> – Continuous outcome variable. Continuous independent variable(s). <br /><br /><b><u>General Linear Mixed Models</u></b> – Continuous outcome variable. Any type of independent variable(s). <br /><br /><b><u>Logistic Regression Analysis</u></b> – Binary outcome variable. Categorical or continuous independent variable(s).<br /><br /><b><u>Discriminant Analysis</u></b> – Binary outcome variable. Categorical or continuous independent variable(s).<br /><br /><b><u>Loglinear Analysis</u> </b>- Binary outcome variable. Categorical independent variable(s).<br /><br /><b><u>Partial Least Squares Regression</u></b> – Any type of outcome variable. Any type of independent variable(s).<br /><br /><b><u>Polynomial Regression</u></b> – Continuous outcome variable. Continuous independent variable(s).<br /><br /><b><u>Multinomial Logistical Analysis</u></b> – Categorical outcome variable. Categorical input variable(s).<br /><br /><b><u>Logistical Ordinal Regression</u></b> – Categorical outcome variable. Categorical input variable(s).<br /><br /><b><u>Probit Regression</u></b> – Binary outcome variable. Categorical or continuous input variable(s).<br /><br /><b><u>2-Stage Least Squares Regression</u></b> - Categorical outcome variable. Continuous independent variable(s). <br /><div><span style="font-kerning: none;"><br /></span></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-2068629627757745292019-08-04T15:00:00.000-04:002021-04-15T18:05:52.662-04:00APA Format In today’s article, we will discuss the standard methodology which is utilized to report statistical findings. In previous examples featured on this website, model outputs were explained in a more simplistic manner in order to decrease the level of complexity related to such. However, if the purpose of the overall research endeavor is to produce results for publication, then the APA format should be applied to whatever experimental findings are generated from the application of methodologies.<br /><br />“APA” is an abbreviation for The American Psychological Association. Regardless of the type of research that is being conducted, the formatting standards maintained by the APA as it applies to statistical research, should always be utilized when presenting data in a professional manner.<br /><br /><b><u>Details</u></b><br /><br />All figures which contain decimal values should be rounded to the nearest hundredth. Ex. .105 = .11. Reporting p-values being the exception to this rule. P-values should, in most cases, be reported in a format which contains two decimals. The exception occurring when a greater amount of specificity is required to illustrate the details of the findings.<br /><br />Another rule to keep in mind pertains to leading zeroes. A leading zero prior to a decimal place is only required if the represented figure has the potential to exceed “1”. If the value cannot exceed “1”, then a leading zero is un-necessary. <br /><br />Below are examples which demonstrate the most common application of the APA format.<br /><br /><u><b>Chi-Square</b></u><br /><u><b><br /></b></u><b>Template:</b><br /><br />A chi-square test of independence was performed to examine the relation between <b>CATEGORY</b> and <b>OUTCOME</b>. The relation between these variables was found to be significant at the p < .05 level, χ2 (<b>DEGREES OF FREEDOM</b>, N = <b>SAMPLE SIZE</b>) = <b>X-Squared Value</b>, p = <b>p - value</b>.<br /><br /><b>- OR -</b><br /><div><b><br /></b>A chi-square test of independence was performed to examine the relation between <b>CATEGORY</b> and <b>OUTCOME</b>. The relation between these variables was not found to be significant at the p < .05 level, χ2 (<b>DEGREES OF FREEDOM</b>, N = <b>SAMPLE SIZE</b>) = <b>X-Squared Value</b>, p = <b>p - value</b>.<br /><br /><b>Example:</b><br /><br />While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role (Assume a 95% Confidence Interval). </div><div><br /></div><div>The data that you gather from the surveys is as follows:<br /><br /><u>General Faculty</u><br />130 Satisfied 20 Unsatisfied<br /><br /><u> Professors</u><br />30 Satisfied 20 Unsatisfied<br /><br /><u> Adjunct Professors</u><br />80 Satisfied 20 Unsatisfied<br /><br /><u> Custodians</u><br />20 Satisfied 10 Unsatisfied<br /><br /><b># Code # <br /><br />Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2) <br /><br />N <- sum(130, 30, 80, 20, 20, 20, 20, 10) <br /><br /> chisq.test(Model) <br /><br />N <br /><br /># Console Output #</b><br /><br /><i> Pearson's Chi-squared test<br /> <br /> data: Model<br /> X-squared = 18.857, df = 3, p-value = 0.0002926 <br /><br />> N <br />[1] 330 </i><br /><br /><b>APA Format: </b><br /><br />A chi-square test of independence was performed to examine the relation between occupational role and job satisfaction. The relation between these variables was found to be significant at the p < .05 level, χ2 (3, N = 330) = 18.56, p < .001.<br /><br /><b><u>Tukey HSD </u></b><br /><br /><b>Template: </b><br /><br />Post hoc comparisons using the Tukey HSD test indicated that the mean score for the <b>CONDITION A </b>(M = <b>Mean1</b>, SD = <b>Standard Deviation1</b>) was significantly different than <b>CONDITION B</b> (M = <b>Mean2,</b> SD = S<b>tandard Deviation2</b>), p = <b>p-value</b>. </div><br /><b><u>Analysis of Variance (ANOVA)</u></b><br /><b><br />(One Way) <br /><br />Template: </b><br /><div><br />There was a significant effect of the <b>CATEGORY</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(1)</b>, <b>Degrees of Freedom(2)</b>) =<b> F Value</b>, p = <b>p - value</b>). <br /><br /><b>- OR - </b><br /><br />There was not a significant effect of the <b>CATEGORY</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(1)</b>, <b>Degrees of Freedom(2)</b>) =<b> F Value</b>, p = <b>p - value</b>).<br /><br /><b>Example:</b><br /><br />A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.<br /><br />Low Salt Soup it rated: 4, 1, 8<br />Medium Salt Soup is rated: 4, 5, 3, 5<br />High Salt Soup is rated: 3, 2, 5 <br /><br />(Assume a 95% Confidence Interval) <br /><b><br /># Code # <br /><br />satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)<br /> <br /> salt <- c(rep("low",3), rep("med",4), rep("high",3))<br /> <br /> salttest <- data.frame(satisfaction, salt)<br /> <br /> results <- aov(satisfaction~salt, data=salttest)<br /><br />summary(results) <br /><br /># Console Output #</b></div><br /> Df Sum Sq Mean Sq F value Pr(>F)<br />salt <b>2 </b> 1.92 0.958 <b>0.209 0.816</b><br />Residuals <b>7 </b>32.08 4.583 <br /><br /><b>APA Format: </b><br /><b><br /></b>There not was a significant effect of the level of salt content on patron satisfaction at the p<.05 level for the three conditions (F(2, 7) = 0.21, p = 0.82). <br /><br /><b>(Two Way) <br /><br />Template: <br /><br />Hypothesis 1:</b><br /><br />There was a significant effect of the <b>CATEGORY</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(1)</b>, <b>Degrees of Freedom(2)</b>) =<b> F Value</b>, p = <b>p - value</b>).<br /><br /><b>- OR - </b><br /><br />There was not a significant effect of the <b>CATEGORY</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(1)</b>, <b>Degrees of Freedom(2)</b>) =<b> F Value</b>, p = <b>p - value</b>).<br /><div><br /><b>Hypothesis 2: </b><br /><br />There was a significant effect of the <b>CATEGORY2</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(2)</b>, <b>Degrees of Freedom(4)</b>) =<b> F Value</b>, p = <b>p - value</b>).<br /><br /><b>- OR - </b><br /><br />There was not a significant effect of the <b>CATEGORY2</b> on the <b>OUTCOME</b> for <b>SCENARIO</b> at the p <. 05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(2)</b>, <b>Degrees of Freedom(4)</b>) =<b> F Value</b>, p = <b>p - value</b>).<br /><br /><b>Hypothesis 3: </b><br /><br />There was a statistically significant interaction effect of the <b>CATEGORY1</b> on the <b>CATEGORY2</b> at the p < .05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(3)</b>, <b>Degrees of Freedom(4)</b>) = <b>F Value</b>, p = <b>p - value</b>).<br /><br /><b>- OR - </b><br /><br />There was not a statistically significant interaction effect of the <b>CATEGORY1</b> on the <b>CATEGORY2</b> at the p < .05 level for the <b>NUMBER OF CONDITIONS</b> (F(<b>Degrees of Freedom(3)</b>, <b>Degrees of Freedom(4)</b>) = <b>F Value</b>, p = <b>p - value</b>). </div><div><br /></div><div><b>Example:</b><br /><br />Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration. <br /><br />(Assume a 95% Confidence Interval) <br /><br />School A:<br /><br />1 Hour of Study Time: 7, 2, 10, 2, 2<br />2 Hours of Study Time: 9, 10, 3, 10, 8<br />3 Hours of Study Time: 3, 6, 4, 7, 1<br /><br />School B:<br /><br />1 Hour of Study Time: 8, 5, 1, 3, 10<br />2 Hours of Study Time: 7, 5, 6, 4, 10<br />3 Hours of Study Time: 5, 5, 2, 2, 2<br /><b><br />satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)<br /> <br /> studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))<br /> <br /> school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))<br /> <br /> schooltest <- data.frame(satisfaction, studytime, school)<br /> <br /> results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))<br /> <br /> summary(results)</b></div><br />Which produces the output: <br /><br /><i> Df Sum Sq Mean Sq F value Pr(>F) <br /> studytime <b>2 </b> 62.6 31.300 <b>3.809 0.0366 *</b><br /> school <b>1</b> 2.7 2.700 <b>0.329 0.5718 </b><br /> studytime:school <b>2 </b> 7.8 3.900 <b>0.475 0.6278 </b><br /> Residuals <b>24</b> 197.2 8.217 <br /> ---<br /> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</i><br /><div><i><br /></i></div><div><b>APA Format: </b><br /><br />There was a significant effect as it pertains to study time impacting student stress levels at the p < .05 level for the three conditions (F(2, 24) = 3.81, p = .04). <br /><br />There was not a significant effect as it relates to the school attended impacting student stress levels at the p < .05 level for the two conditions (F(1, 24) = 0.329, p > .05). <br /><br />There was not a statistically significant interaction effect of the school variable on the study time variable at the p < .05 level (F(2, 24) = 0.475, p > .05).</div><br /><b>TukeyHSD(results)</b><br /><b><br /></b><i>> TukeyHSD(results)<br /> Tukey multiple comparisons of means<br /> 95% family-wise confidence level<br /><br />Fit: aov(formula = lm(satisfaction ~ studytime * school, data = schooltest))<br /><br />$studytime<br /> diff lwr upr p adj<br />Three Hours-One Hour -1.3 -4.5013364 1.901336 0.5753377<br />Two Hours-One Hour 2.2 -1.0013364 5.401336 0.2198626<br /><b>Two Hours-Three Hours 3.5 0.2986636 6.701336 0.0302463</b><br /><br />$school<br /> diff lwr upr p adj<br />SchoolB-SchoolA -0.6 -2.760257 1.560257 0.571817<br /><br />$`studytime:school`<br /><br /> diff lwr upr p adj<br />Three Hours:SchoolA-One Hour:SchoolA -0.4 -6.005413 5.2054132 0.9999178<br />Two Hours:SchoolA-One Hour:SchoolA 3.4 -2.205413 9.0054132 0.4401459<br />One Hour:SchoolB-One Hour:SchoolA 0.8 -4.805413 6.4054132 0.9976117<br />Three Hours:SchoolB-One Hour:SchoolA -1.4 -7.005413 4.2054132 0.9696463<br />Two Hours:SchoolB-One Hour:SchoolA 1.8 -3.805413 7.4054132 0.9157375<br />Two Hours:SchoolA-Three Hours:SchoolA 3.8 -1.805413 9.4054132 0.3223867<br />One Hour:SchoolB-Three Hours:SchoolA 1.2 -4.405413 6.8054132 0.9844928<br />Three Hours:SchoolB-Three Hours:SchoolA -1.0 -6.605413 4.6054132 0.9932117<br />Two Hours:SchoolB-Three Hours:SchoolA 2.2 -3.405413 7.8054132 0.8260605<br />One Hour:SchoolB-Two Hours:SchoolA -2.6 -8.205413 3.0054132 0.7067715<br />Three Hours:SchoolB-Two Hours:SchoolA -4.8 -10.405413 0.8054132 0.1240592<br />Two Hours:SchoolB-Two Hours:SchoolA -1.6 -7.205413 4.0054132 0.9470847<br />Three Hours:SchoolB-One Hour:SchoolB -2.2 -7.805413 3.4054132 0.8260605<br />Two Hours:SchoolB-One Hour:SchoolB 1.0 -4.605413 6.6054132 0.9932117<br />Two Hours:SchoolB-Three Hours:SchoolB 3.2 -2.405413 8.8054132 0.5052080</i><br /><br /><b>twohours <- c(9, 10, 3, 10, 8, 7, 5, 6, 4, 10)<br />threehours <- c(3, 6, 4, 7, 1, 5, 5, 2, 2, 2)<br /><br />mean(twohours)<br />sd(twohours)<br /><br />mean(threehours)<br />sd(threehours)</b><br /><br /><i>> mean(twohours)<br />[1] 7.2<br />> sd(twohours)<br />[1] 2.616189<br />> <br />> mean(threehours)<br />[1] 3.7<br />> sd(threehours)<br />[1] 2.002776</i><br /><br /><b>APA Format:</b><br /><br />Post hoc comparisons using the Tukey HSD test indicated that at the p < .05 level, the mean score for the level of stress exhibited by students who studied for Two Hours (M = 7.20, SD = 2.62), was significantly different as compared to the scores of the students who studied for Three Hours (M = 3.70, SD = 2.00), p = .03. <br /><br /><b>(Repeated Measures)<br /><br />Template: </b><br /><br /><b>Example: </b><br /><br />Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.<br /><br />Before Reading = 1, 8, 2, 4, 4, 10, 2, 9<br />After Reading = 4, 2, 5, 4, 3, 4, 2, 1<br />After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8 <br /><b><br />library(lme4) # You will need to install and enable this package #<br /> library(nlme) # You will also need to install and enable this package #<br /> <br /> happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )<br /> <br /> week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))<br /> <br /> id <- c(1,2,3,4,5,6,7, 8)<br /> <br /> survey <- data.frame(id, happiness, week)<br /> <br /> model <- lme(happiness ~ week, random=~1|id, data=survey)<br /> <br /> anova(model)</b> <br /><b><br /></b>This method saves some time by producing the output: <br /><br /><i> numDF denDF F-value p-value<br /> (Intercept) <b> 1</b> <b>14</b> 37.21053 <.0001<br /> week 2 14 <b>1.04624 </b> <b> 0.3772 </b></i><br /><br />There was not a significant effect of the health assessment on the survey questions related to stroke concern at the p < .05 level for the five conditions (F(1, 14) = 1.05, p > .05).<br /><div><br /></div><div><b><u>Student’s T-Test </u></b><br /><br /><b>(One Sample T-Test) </b><br /><b><br /></b><b>Template: <br /><br />(Right Tailed) </b><br /><br />There was a significant increase in the <b>GROUP A </b>(M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the historically assumed mean (M = <b>Historic Mean Value</b>); t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value</b>. <br /><br /><b>- OR - </b><br /><br />There was not a significant increase in the <b>GROUP A </b>(M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the historically assumed mean (M = <b>Historic Mean Value</b>); t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value</b>.<br /><b><br />Example:</b><br /><br />A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?<br /><br />The levels of the samples were:<br /><br />.27, .31, .27, .34, .40, .29, .37, .14, .30, .20<br /><br /><b>N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)<br /><br /> </b><br /><b>t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)</b><br /><b><br /></b><b># " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #</b><br /><b><br /></b><i>One Sample t-test<br /> <br /> data: N<br /> t = 3.6713, df = 9, p-value = 0.002572<br /> alternative hypothesis: true mean is greater than 0.2<br /> 95 percent confidence interval:<br /> 0.244562 Inf<br /> sample estimates:<br /> mean of x <br /> 0.289</i><br /><br /><b>mean(N)<br />sd(N)</b><br /><i><br />> mean(N)<br />[1] 0.289<br />> <br />> sd(N)<br />[1] 0.07665942</i><br /><i><br /></i><b>APA Format:</b><br /><br />A one sample t-test was conducted to compare the level of corn syrup in the current sample batch of cakes, to the assumed historical level of corn syrup contained within previously manufactured cakes. <br /><br />There was a significant increase in the amount of corn syrup in the recent batch of cakes (M = .29, SD = .08), as compared to the historically assumed mean (M =.20); t(9) = 3.67, p = .003.<br /><br /><div style="font-stretch: normal; line-height: normal;"><span style="font-kerning: none;"><b>(Two Sample T-Test)</b></span></div><br /><b>Template:</b><br /><br /><b>(Two Tailed)</b><br /><b><br /></b>There was a significant difference in the <b>GROUP A</b> (M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the <b>GROUP B</b> (M = <b>Mean of GROUP B</b>, SD = <b>Standard Deviation of GROUP B</b>), t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value</b>.<br /><br /><b>-OR-</b><br /><br />There was not a significant difference in the <b>GROUP A</b> (M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the <b>GROUP B</b> (M = <b>Mean of GROUP B</b>, SD = <b>Standard Deviation of GROUP B</b>), t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value</b>.<br /><br />A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:<br /><br />70, 74, 76, 72, 75, 74, 71, 71<br /><br />He then measures temperature in samples which the chemical was not applied.<br /><br />74, 75, 73, 76, 74, 77, 78, 75<br /><br />Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?<br /><br /><b> N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)<br /> <br /> N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)<br /> <br /> t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)</b><br /><br /><i>Two Sample t-test<br /><br />data: N2 and N1<br />t = 2.4558, df = 14, p-value = 0.02773<br />alternative hypothesis: true difference in means is not equal to 0<br />95 percent confidence interval:<br /> 0.3007929 4.4492071<br />sample estimates:<br />mean of x mean of y <br /> 75.250 72.875 </i><br /><br /><b>mean(N1)<br /><br />sd(N1)<br /><br />mean(N2)<br /><br />sd(N2)</b><br /><i><br />> mean(N1) <br />[1] 72.875 <br />> <br />> sd(N1) <br />[1] 2.167124 <br />> <br />> mean(N2) <br />[1] 75.25 <br />> <br />> sd(N2) <br />[1] 1.669046</i></div><div><i><br /></i></div><div><b>APA Format:</b></div><br />A two sample t-test was conducted to compare the temperature of water prior to the application of the chemical, to the temperature of water subsequent to the application of the chemical <br /><br />There was a significant difference in the temperature of water prior to the application of the chemical (M = 72.88, SD = 2.17), as compared to the temperature of the water subsequent to the application of the chemical (M = 75.25, SD = 1.67); t(14) = 2.46, p = .03.<br /><br /><b>(Paired T-Test) <br /><br />Template: <br /><br />(Right Tailed) </b><br /><br />There was a significant increase in the <b>GROUP A</b> (M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the <b>GROUP B</b> (M = <b>Mean of GROUP B</b>, SD = <b>Standard Deviation of GROUP B</b>), t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value.</b> <br /><br /><b>- OR - </b><br /><br />There was not a significant increase in the <b>GROUP A</b> (M = <b>Mean of GROUP A</b>, SD = <b>Standard Deviation of GROUP A</b>), as compared to the <b>GROUP B</b> (M = <b>Mean of GROUP B</b>, SD = <b>Standard Deviation of GROUP B</b>), t(<b>Degrees of Freedom</b>) = <b>t-value</b>, p = <b>p-value.</b><br /><br /><b>Example: </b><br /><br />A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.<br /><br />The same twelve watches are then re-rested for duration with the new battery.<br /><br />Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).<br /><br />For this, we will utilize the code:<br /><br /><b> N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)<br /> N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)<br /> <br /> t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )</b><br /><i><br />Paired t-test<br /> <br /> data: N2 and N1<br /> t = 2.4581, df = 11, p-value = 0.01589<br /> alternative hypothesis: true difference in means is greater than 0<br /> 95 percent confidence interval:<br /> 12.32551 Inf<br /> sample estimates:<br /> mean of the differences <br /> 45.75 </i><br /><b><br />mean(N1) <br />sd(N1) <br /><br />mean(N2) <br />sd(N2)</b><br /><i><br />> mean(N1) <br />[1] 325.3333 <br />> <br />> sd(N1) <br />[1] 56.84642 <br />> <br />> mean(N2) <br />[1] 371.0833 <br />> <br />> sd(N2) <br />[1] 51.22758</i><br /><div><i><br /></i></div><div><b>APA Format:</b><i><br /></i><br /><br />A paired t-test was conducted to the lifespan duration of watches which contained the new battery, to the lifespan of watches which contained the initial battery. <br /><br />There was a significant increase in the lifespan duration of watches which contained the new battery (M = 325.33, SD =56.85), as compared to the lifespan of watches which contained the initial battery (M = 371.08, SD = 51.23); t(11) = 2.46, p = .02.<br /><br /><b><u>Regression Models </u></b><br /><br /><b>Example:</b><br /><br /><b>(Standard Regression Model) <br /><br />x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)<br /> y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)<br /> z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25) <br /><br />multiregress <- (lm(y ~ x + z)) </b><br /><br /><i>Call:<br /> lm(formula = y ~ x + z)<br /> <br /> Residuals:<br /> Min 1Q Median 3Q Max <br /> -6.4016 -5.0054 -1.7536 0.8713 14.0886 <br /> <br /> Coefficients:<br /> Estimate Std. Error t value Pr(>|t|) <br /> (Intercept) 47.1434 12.0381 3.916 0.00578 **<br /> x 0.7808 0.3316 2.355 0.05073 . <br /> z 0.3990 0.4804 0.831 0.43363 <br /> ---<br /> Residual standard error: 7.896 on 7 degrees of freedom<br /> Multiple R-squared: 0.5249, Adjusted R-squared: 0.3891 <br /> F-statistic: 3.866 on 2 and 7 DF, p-value: 0.07394 </i></div><div><br /></div><div><b>APA Format:</b><br /><br />A linear regression model was utilized to test if variables “x” and “z” significantly predicted outcomes within the observations of “y” included within the sample data set. The results indicated that while “x” (B = .781, p = .051) is a significant predictor variable, the overall model itself does not possess a worthwhile predictive capacity (r2 = .041). <br /><b><br />(Non-Standard Regression Model) </b></div><div><b><br /></b></div><div><b>Example:</b><br /><b><br /></b><b># Model Creation # <br /> <br /> Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26) <br /> <br /> Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0) <br /> <br /> Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1) <br /> <br /> Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0) <br /><br /># Summary Creation and Output # <br /><br /> CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial) <br /> <br /> summary(CancerModelLog) <br /><br /># Output #</b><br /><div><i><br /> Call: <br /><br />glm(formula = Cancer ~ Age + Obese + Smoking, family = binomial) <br /> <br /> Deviance Residuals: <br /> Min 1Q Median 3Q Max <br /> -1.6096 -0.7471 0.5980 0.8260 1.8485 <br /> <br /> Coefficients: <br /> Estimate Std. Error z value Pr(>|z|) <br /> (Intercept) -2.34431 2.25748 -1.038 0.2991 <br /> Age 0.02984 0.04055 0.736 0.4617 <br /> Obese -0.38924 1.39132 -0.280 0.7797 <br /> Smoking 2.54387 1.53564 1.657 0.0976 . <br /> --- <br /> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 <br /> <br /> (Dispersion parameter for binomial family taken to be 1) <br /> <br /> Null deviance: 20.728 on 14 degrees of freedom <br /> Residual deviance: 16.807 on 11 degrees of freedom <br /> AIC: 24.807 <br /> Number of Fisher Scoring iterations: 4 </i></div><br /><b># Generate Nagelkerke R Squared # <br /> <br /> # Download and Enable Package: "BaylorEdPsych" # <br /> <br /> PseudoR2(CancerModelLog) <br /><br /># Console Output # </b><br /><br /><i> McFadden Adj.McFadden Cox.Snell <b>Nagelkerke </b> McKelvey.Zavoina Effron<br /> 0.2328838 -0.2495624 0.2751639 <b>0.3674311 </b> 0.3477522 0.3042371 0.8000000 <br /> Adj.Count AIC Corrected.AIC <br /> 0.5714286 23.9005542 27.9005542 </i><br /><br /><b>APA Format:</b><br /><br />A logistic regression model was utilized to test if a model containing the variables “Age”, “Smoking Status”, and “Obesity”, could predict Cancer outcomes as it pertains to the individuals included within the sample data set. The results indicated that the model does not possess a worthwhile predictive capacity (Nagelkerke R-Square = .37). </div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-81495046845101102232019-07-20T12:56:00.000-04:002021-04-15T18:05:52.776-04:00How to Make Beautiful Visuals (MS-Excel)I am aware that this subject matter may be considered to be very basic. However, as a data scientist, it is not entirely uncommon that the end result of many of your research endeavors, will somehow or another, require the creation of a presentation of findings.<br /><br />This of course, inevitably, will lead to the utilization of Power Point. Which will, almost as a prerequisite, require the utilization of Excel.<br /><br />Therefore, in today’s article, we will review instructions as it relates to the creation of visual outputs as enabled by MS-Excel. <br /><br />To illustrate this concept, I have created an example worksheet. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-FvEbRn7JFH0/XTNEYfVXo7I/AAAAAAAABLk/CYwnHV3ghwg55pTACXnOEwYFwKTIcwQLwCLcBGAs/s1600/0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="69" data-original-width="519" src="https://1.bp.blogspot.com/-FvEbRn7JFH0/XTNEYfVXo7I/AAAAAAAABLk/CYwnHV3ghwg55pTACXnOEwYFwKTIcwQLwCLcBGAs/s1600/0.png" /></a></div><div><br /></div>This worksheet can be found within this website’s GitHub Repository. <br /><br /><b><u>Basic Column Chart </u></b><br /><br />For our scenario, we’ll assume that your goal is to create an attractive column chart as it relates to the above data. Utilizing the <b>“Insert” </b>ribbon option, after highlighting the data,<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-EbpmLtRcr6A/XTNEdWoUWkI/AAAAAAAABLo/lR_XgzkeQKgUxWBtqorM58XlCTeQLJg5gCLcBGAs/s1600/1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="94" data-original-width="543" src="https://1.bp.blogspot.com/-EbpmLtRcr6A/XTNEdWoUWkI/AAAAAAAABLo/lR_XgzkeQKgUxWBtqorM58XlCTeQLJg5gCLcBGAs/s1600/1.png" /></a></div><div><br /></div>and subsequently selecting of the top leftmost menu selection button,<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-dfJG2DU_d-E/XTNEio-yRRI/AAAAAAAABLs/p5meI4nTLiYXkBFyyJe7KoDEiWYIA1SxgCLcBGAs/s1600/3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="425" data-original-width="438" height="387" src="https://1.bp.blogspot.com/-dfJG2DU_d-E/XTNEio-yRRI/AAAAAAAABLs/p5meI4nTLiYXkBFyyJe7KoDEiWYIA1SxgCLcBGAs/s400/3.png" width="400" /></a></div><div><br /></div>presents us with a rather uninspiring graphical depiction of the underlying data.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-VuMFbXMUz7w/XTNFP0e2SuI/AAAAAAAABMU/vHnBanv7I58npgLBQBN7P1wb36Zv_3EtwCLcBGAs/s1600/4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="481" height="240" src="https://1.bp.blogspot.com/-VuMFbXMUz7w/XTNFP0e2SuI/AAAAAAAABMU/vHnBanv7I58npgLBQBN7P1wb36Zv_3EtwCLcBGAs/s400/4.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Let’s make this graphic look a bit better visually. <br /><br />First, we’ll make the columns more attractive by changing their texture. <br /><br />This can be achieved by clicking on the column portion of the graphic.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-htzAmQ1Jvc0/XTNEoI3dn8I/AAAAAAAABMA/3_K4JBkBXvs_NVr7tYOILTnfzk8g1RN_ACEwYBhgL/s1600/5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="309" data-original-width="537" height="230" src="https://1.bp.blogspot.com/-htzAmQ1Jvc0/XTNEoI3dn8I/AAAAAAAABMA/3_K4JBkBXvs_NVr7tYOILTnfzk8g1RN_ACEwYBhgL/s400/5.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Next, click on the<b> “Format” </b>option within the ribbon menu. <br /><br />From the many sub-menu selections, click <b>“Shape Effects”</b>, followed by <b>“Bevel”</b>, subsequently followed by <b>“Circle”</b>.<div> <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-dfJG2DU_d-E/XTNEio-yRRI/AAAAAAAABMI/aUBxp_zIjwIMtbrR4RFU9jsUsKJuG24OgCEwYBhgL/s1600/3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="425" data-original-width="438" height="310" src="https://1.bp.blogspot.com/-dfJG2DU_d-E/XTNEio-yRRI/AAAAAAAABMI/aUBxp_zIjwIMtbrR4RFU9jsUsKJuG24OgCEwYBhgL/s320/3.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The result should resemble the following:</div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-KgGfnpZvvDg/XTNExk3hFCI/AAAAAAAABMM/9FWiIpjs7fs20CCecTPXbmXzOY3zDeLnACEwYBhgL/s1600/7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="318" data-original-width="555" height="227" src="https://1.bp.blogspot.com/-KgGfnpZvvDg/XTNExk3hFCI/AAAAAAAABMM/9FWiIpjs7fs20CCecTPXbmXzOY3zDeLnACEwYBhgL/s400/7.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Next, I would advise adding data labels. To achieve this, left click on any of the columns within the chart. <br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-2iHuPRpw6l4/XTNE4aVh8yI/AAAAAAAABMQ/RLGNC1UQ4kc9ReFVeNaRPxc35-FTf-mlwCEwYBhgL/s1600/8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="297" data-original-width="370" height="320" src="https://1.bp.blogspot.com/-2iHuPRpw6l4/XTNE4aVh8yI/AAAAAAAABMQ/RLGNC1UQ4kc9ReFVeNaRPxc35-FTf-mlwCEwYBhgL/s400/8.png" width="400" /></a></div><div><br /></div>From the drop down menu, select <b>“Add Data Labels”</b>, followed by <b>“Add Data Labels”</b>. <br /><br />The result is a much more informative graphic. <br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0EWcdPDZYBE/XTNFsoROe8I/AAAAAAAABMc/DYRRcNkB0mAoEUQB1Vsne2sbtBCD0fgOQCLcBGAs/s1600/9.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="313" data-original-width="540" height="231" src="https://1.bp.blogspot.com/-0EWcdPDZYBE/XTNFsoROe8I/AAAAAAAABMc/DYRRcNkB0mAoEUQB1Vsne2sbtBCD0fgOQCLcBGAs/s400/9.png" width="400" /></a></div><br />However, for the sake of our example, we’ll assume that the axis needs to be modified so that the scale depicted measures from 0.00 – 4.00. <br /><br />Select the graph’s axis by first right clicking the axis potion of the graphic. <br /><br />Next, to modify the axis, left click on the selected axis. From the menu which appears, select <b>“Format Axis”</b>. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-45DAzENCRM0/XTNFyGzLiZI/AAAAAAAABMg/SC4ji-Sj52sBgCSsxdADsFS4MgocboHYACLcBGAs/s1600/10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="318" data-original-width="209" src="https://1.bp.blogspot.com/-45DAzENCRM0/XTNFyGzLiZI/AAAAAAAABMg/SC4ji-Sj52sBgCSsxdADsFS4MgocboHYACLcBGAs/s1600/10.png" /></a></div><div><br /></div>From the grey menu which appears on the right side of the screen, enter the axis values which you feel are most appropriate for the graphic.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uWOrNgg4b0U/XTNF2_ezZEI/AAAAAAAABMk/QvElRrW23uUYKJhd9VHjnmekd5wIj-pkACLcBGAs/s1600/11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="218" data-original-width="279" src="https://1.bp.blogspot.com/-uWOrNgg4b0U/XTNF2_ezZEI/AAAAAAAABMk/QvElRrW23uUYKJhd9VHjnmekd5wIj-pkACLcBGAs/s1600/11.png" /></a></div><br />Finally, to make our graph extra eye-catching, we will copy it from the Excel workbook where it is currently located, and paste it into our Power Point template. <br /><br />However, when pasting, we will be sure to select, from the options available upon left clicking the slide, <b>“Use Destination Theme & Embed Workbook (H)”</b>. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-_HdbK8p6T1s/XTNF70-fmDI/AAAAAAAABMo/JPiHfabrSWMlS-nPNVhkQRA38xN_ezQJgCLcBGAs/s1600/12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="103" data-original-width="276" height="147" src="https://1.bp.blogspot.com/-_HdbK8p6T1s/XTNF70-fmDI/AAAAAAAABMo/JPiHfabrSWMlS-nPNVhkQRA38xN_ezQJgCLcBGAs/s400/12.png" width="400" /></a></div><div><br /></div>In the case of our example, the final product resembles the following:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-BkNXDahz3do/XTNGBPS2r_I/AAAAAAAABMw/kjMEdP-Gjf4oKxM1AqWQWXvyrkEUJxKuACLcBGAs/s1600/13.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="234" data-original-width="384" height="242" src="https://1.bp.blogspot.com/-BkNXDahz3do/XTNGBPS2r_I/AAAAAAAABMw/kjMEdP-Gjf4oKxM1AqWQWXvyrkEUJxKuACLcBGAs/s400/13.png" width="400" /></a></div><div><br /></div><b><u>Basic 2-D Line Chart </u></b><br /><br />To create a 2-D line chart from the same data, we will again highlight the data, click on the <b>"Insert" </b>ribbon, and select the left topmost option.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-OJTpgCULDQU/XTNGHX_l-AI/AAAAAAAABM4/dRiJQPO2tfoACbN36bW9AQj45V7uLN7uQCLcBGAs/s1600/14.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="465" data-original-width="389" height="400" src="https://1.bp.blogspot.com/-OJTpgCULDQU/XTNGHX_l-AI/AAAAAAAABM4/dRiJQPO2tfoACbN36bW9AQj45V7uLN7uQCLcBGAs/s400/14.png" width="332" /></a></div><div><br /></div>This will present a rather uninspiring graphical depiction of the underlying data.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-DwM4iD897k4/XTNGaUiEPRI/AAAAAAAABNM/1ehY7R6F8BEVoI5eFuvRZS44KobECO47ACLcBGAs/s1600/15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="339" data-original-width="580" height="231" src="https://1.bp.blogspot.com/-DwM4iD897k4/XTNGaUiEPRI/AAAAAAAABNM/1ehY7R6F8BEVoI5eFuvRZS44KobECO47ACLcBGAs/s400/15.png" width="400" /></a></div><div><br /></div>Let’s add some points to our graph to increase its descriptive capacity. This can be achieved by clicking on the line itself, then right clicking to display the following menu. From this menu select <b>“Format Data Series”</b>. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-JbXxgDzlePY/XTNGmhkj_wI/AAAAAAAABNQ/BaGNzJjNJxAzefT_lLPTvweysf_0iYXWwCLcBGAs/s1600/16.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="295" data-original-width="282" src="https://1.bp.blogspot.com/-JbXxgDzlePY/XTNGmhkj_wI/AAAAAAAABNQ/BaGNzJjNJxAzefT_lLPTvweysf_0iYXWwCLcBGAs/s1600/16.png" /></a></div><div><br />With the <b>“Marker” </b>option selected, you are granted the ability to select the type of point, and the size of the point, which you would prefer to be implemented. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-jBSLaGKafEw/XTNGsD48ZEI/AAAAAAAABNU/YUbhle82ZaoBKIjcF2yo_16npN85aiXqQCLcBGAs/s1600/16a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="334" data-original-width="541" height="246" src="https://1.bp.blogspot.com/-jBSLaGKafEw/XTNGsD48ZEI/AAAAAAAABNU/YUbhle82ZaoBKIjcF2yo_16npN85aiXqQCLcBGAs/s400/16a.png" width="400" /></a></div><div><br /></div>The end result should resemble:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-RXhk833i6mM/XTNGyaSPXZI/AAAAAAAABNY/QxurC1nxbt0ZFRNf9omc_uXzJEkMmQ2rACLcBGAs/s1600/17.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="304" data-original-width="496" height="245" src="https://1.bp.blogspot.com/-RXhk833i6mM/XTNGyaSPXZI/AAAAAAAABNY/QxurC1nxbt0ZFRNf9omc_uXzJEkMmQ2rACLcBGAs/s400/17.png" width="400" /></a></div><div><br /></div>I already adjusted the axis. However, if you would prefer data labels and a templated format, please follow the prior portion of instructions within the previous example. <br /><br />That’s all for now. Stay studious, Data Heads!</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-19637337507331874292019-06-08T15:29:00.005-04:002021-04-15T18:05:52.885-04:00(Python) Joining Distinct Variable Cell Entries with Pandas Hey, Data Heads! I’m back from an extended hiatus with a quick article to demonstrate a very useful Pandas function.<br /><br />To understand this entry, you must first have some prior experience with both the Python programming language, and the Pandas Python library. If you are unfamiliar with either of the aforementioned topics, information and demonstrations related to such can be found within previous articles featured on this website.<br /><br />As you may recall from a much earlier article which discussed the SAS programming platform, a limitation exists within the SAS language which inhibits the joining of multiple distinct variables into a single cell entry, with all associated entries from other column variables being combined into a single associated variable adjacent to the distinct variable entry. In prior articles on this topic, I designed a series of macros to accomplish what I have just described, however, in the case of Python, specifically through the utilization of the Pandas library, this task can be achieved through a single line of code.<br /><br /><u style="font-weight: bold;">Example:</u> <br /><br />We will begin by enabling the Pandas library. After which, we will import the familiar data set: "<b>SetA"</b>, into the allocated memory.<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-BdKCwZng1lU/XPwLZftBckI/AAAAAAAABKY/LDIIkm8t41IJpDJzexOqOlezkFZu_17CgCLcBGAs/s1600/SetA.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="228" data-original-width="161" src="https://1.bp.blogspot.com/-BdKCwZng1lU/XPwLZftBckI/AAAAAAAABKY/LDIIkm8t41IJpDJzexOqOlezkFZu_17CgCLcBGAs/s1600/SetA.png" /></a></div><br />Just as a reminder, if you aren’t in the mood to input the .CSV cell entries yourself, this file, and all others, can be found within this website’s associated GitHub repository. <br /><br /><b># Enable Pandas Package #<br /><br />import pandas<br /><br /># Specify the appropriate file path for import #<br /><br /># Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #<br /><br />filepath = "C:\\Users\\Desktop\\SetA.csv"<br /><br /># Create a variable to store the data #<br /><br />pandadataframe = pandas.read_csv(filepath)<br /><br /># Modify the column variable to the appropriate variable format and type #<br /><br />pandadataframe['VARA'] = pandadataframe['VARA'].astype('str')<br /><br />pandadataframe['DATAVAL'] = pandadataframe['DATAVAL'].astype('str')</b><br /><br />The function below, which serves as the method for generating the desired result, can only be utilized if all related variables referenced are of the "<b>string" </b>type. It is for this reason that the two lines of code above this description perform a variable type modification. This ensures that each variable referenced in the code below is a string type variable.<br /><b><br />pandadataframe = pandadataframe.groupby(['VARA'])['DATAVAL'].apply('|'.join).reset_index()</b><br /><b><br /></b>Once the above function has performed its task, we will then perform the print function in order to display the results of such.<br /><br /><b>print(pandadataframe) </b><br /><b><br /></b>Which displays the following output:<br /><br />VARA DATAVAL <br />0 A 1|2 <br />1 B 1|2|3 <br />2 E 1|2|3|4<br /><br />As we have succeeded with our task, all that remains is saving our newly created data set. This can be achieved through the utilization of the code below: <br /><br /><b># Choose file pathway designation to indicate where data will be saved # <br /><br />pandadataframe.to_csv("C:\\Users\\ Desktop\\SetAOutput.csv", sep=',', encoding='utf-8', index = False) </b><br /><br />The data set, when viewed within MS-Excel will resemble the following image:<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-EYAqlVM3nJo/XPwMjO6EZwI/AAAAAAAABKk/PGjK6Lu0p2wPyrY_EG1XrP_e7PkzVRsqACLcBGAs/s1600/SetA2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="109" data-original-width="162" src="https://1.bp.blogspot.com/-EYAqlVM3nJo/XPwMjO6EZwI/AAAAAAAABKk/PGjK6Lu0p2wPyrY_EG1XrP_e7PkzVRsqACLcBGAs/s1600/SetA2.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>I hope that you found this article helpful. Soon I’ll be back with another entry, but not too soon. Until then, stay inquisitive, Data Heads!Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0