tag:blogger.com,1999:blog-16087687369139309262023-08-28T19:33:01.821-04:00Reflections of a Data ScientistData Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comBlogger180125tag:blogger.com,1999:blog-1608768736913930926.post-48553001097313373632023-08-28T19:32:00.000-04:002023-08-28T19:32:01.254-04:00Are US Indexes Overweight?<p></p><div style="text-align: left;"></div><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRHjUbsdggv7J8gdrdmnIIV8tiXt5RcIRb9XJUNEHtF-8F_ISajRGQ4LnHphyEK16-dm0zhcns8XFHs5QZJSZM-2yqMLp95frA-i3e8XxnefFmWV64KC2oTCBVPBi7Po4DazKrpFoEv4ZmgMO3CCP4P8Bg5evcyPPzR1SL3Cyy3hjfiLRgUPj4jLnJkRA/s1331/Chonk_Cat.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="433" data-original-width="1331" height="129" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRHjUbsdggv7J8gdrdmnIIV8tiXt5RcIRb9XJUNEHtF-8F_ISajRGQ4LnHphyEK16-dm0zhcns8XFHs5QZJSZM-2yqMLp95frA-i3e8XxnefFmWV64KC2oTCBVPBi7Po4DazKrpFoEv4ZmgMO3CCP4P8Bg5evcyPPzR1SL3Cyy3hjfiLRgUPj4jLnJkRA/w400-h129/Chonk_Cat.jpg" width="400" /></a></div><p>Ever since the COVID Pandemic hit the mainstream news cycle in 2020, I’ve noticed that many of the stocks which I follow seem to dwindle, while the major indexes continue to accelerate upward. I decided to do a bit of research on this subject, and the following is what I discovered along the way.</p><p><b><u>The Case of the NASDAQ Composite</u></b><br /><br />The NASDAQ Composite is comprised of 3,279 separate equities, each possessing a differing weight as it pertains to their overall contribution to the index’s price. The top 5 equities by weight are:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsIaeO86iCrf96ntJHtTi6rORzbqt39Y5P_fu8XVRlYqiV0EFMX16PaG9GxWSI-Wa0-fCQ_Pjh-x78TsXococXDiaNuDx1fGWMV8SVM4TVi6iGbo0gBrNP19vcVlL173lXaZJYYcZ_OvVST0DiHAzrBP76SAbEcd68PbEFMSXBkxdykRRdEH4g7sWxc_A/s360/NDAQ1.png" style="clear: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="247" data-original-width="360" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsIaeO86iCrf96ntJHtTi6rORzbqt39Y5P_fu8XVRlYqiV0EFMX16PaG9GxWSI-Wa0-fCQ_Pjh-x78TsXococXDiaNuDx1fGWMV8SVM4TVi6iGbo0gBrNP19vcVlL173lXaZJYYcZ_OvVST0DiHAzrBP76SAbEcd68PbEFMSXBkxdykRRdEH4g7sWxc_A/w400-h275/NDAQ1.png" width="400" /></a></div><div class="separator" style="clear: both;"><span style="text-align: left;"><br /></span></div><div class="separator" style="clear: both;"><span style="text-align: left;">In sum, 5 equities, and their corresponding evaluations, contribute to approximately 34.07% of the NASDAQ Composite's overall price. </span></div><div class="separator" style="clear: both;"><span style="text-align: left;"><br /></span></div>Over the past five years, the NASDAQ has experienced a hefty appreciation of 67.43%.<br /><br /><div><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzyNr6quFkeGOnXb-NCTQ_rUUW9xpD7Ck_nyyUcf_O_PDgvgFAEqNBOHjGi9jym6UDG4Hw6_y5tYJPv3JlyMBBAOv_BdGdZHWACm84Q5fCXxD31WAWhYZ69DJ6-46PqwFpSPh2Z8rGItAlpiNE8NGI10kaKSK1C3ywfq9WsNSsCEfl84cYxdbu7BcEWj0/s690/Nasdaq_Index_Appreciation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="448" data-original-width="690" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzyNr6quFkeGOnXb-NCTQ_rUUW9xpD7Ck_nyyUcf_O_PDgvgFAEqNBOHjGi9jym6UDG4Hw6_y5tYJPv3JlyMBBAOv_BdGdZHWACm84Q5fCXxD31WAWhYZ69DJ6-46PqwFpSPh2Z8rGItAlpiNE8NGI10kaKSK1C3ywfq9WsNSsCEfl84cYxdbu7BcEWj0/w400-h260/Nasdaq_Index_Appreciation.png" width="400" /></a></div><div style="text-align: center;"><br /></div>However, of this 67.43% evaluation upward, how much of the price shift can be attributed to the top 5 weighted equities from which the index is comprised?<div><br /></div><div><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-mbOdT2N6bp_V4iwwElJzWwaIuRv7xN0q5snDnS9Jxg33iaAZyoWNIaM0IM0GfiNJace7bOOIqsakT_w6cV2FcVXly7EaJuroRvxwZqlr4QOx_TYN17h2lGpyXrzSqNNMUg41iqz9uP71TB6SNkdZ_1m4XVigZ3ZanBGgztSJuartyrnT7ufsLOKaznI/s388/NDAQ2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="184" data-original-width="388" height="189" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-mbOdT2N6bp_V4iwwElJzWwaIuRv7xN0q5snDnS9Jxg33iaAZyoWNIaM0IM0GfiNJace7bOOIqsakT_w6cV2FcVXly7EaJuroRvxwZqlr4QOx_TYN17h2lGpyXrzSqNNMUg41iqz9uP71TB6SNkdZ_1m4XVigZ3ZanBGgztSJuartyrnT7ufsLOKaznI/w400-h189/NDAQ2.png" width="400" /></a></div><br />If every other component within the NASDAQ traded completely flat throughout the duration of the previous 5 years, the Dow Jones Industrial Average would have increased in value by approximately 25.51%. As the index appreciated by 67.43%, we can conclude that some of the other components from which the index is comprised, immensely increased the overall potential aggregate (67.43 > 25.51). <br /><br /><b>Rating: HEFTYCHONK</b></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCsynbz9BXIucBUZ6E2tfmBXsa9KYycq0jC85BHwYp8m9eEQ6A9T2oyJV3NWxkpTJ3Evup7P8NvGywnYIjCxtyfPU65EzcyNiawkHMn2qmpbAZQPty7TpsjBcim6AFKbPSPttzV7d3AxnhNeVC9P3q9A8IqBoctuA0YI2s2REFAAV4PfL2XprdfdzHzH4/s1417/HEFTYCHONK.jpg" style="clear: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1063" data-original-width="1417" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCsynbz9BXIucBUZ6E2tfmBXsa9KYycq0jC85BHwYp8m9eEQ6A9T2oyJV3NWxkpTJ3Evup7P8NvGywnYIjCxtyfPU65EzcyNiawkHMn2qmpbAZQPty7TpsjBcim6AFKbPSPttzV7d3AxnhNeVC9P3q9A8IqBoctuA0YI2s2REFAAV4PfL2XprdfdzHzH4/w400-h300/HEFTYCHONK.jpg" width="400" /></a></div><div><br /></div>If 5 companies make up 34.07% of the index’s weight, and 0.15% of the index has increased by approximately 22.51%, while the index itself has appreciated by 67.43%, we can evaluate this furry boi as a Heftychonk. <br /><br /><b style="text-decoration: underline;">The Case of the Dow Jones Industrial Average</b> <br /><br />The Dow Jones Industrial Average is comprised of 30 separate equities, each possessing a differing weight as it pertains to their overall contribution to the index’s price. The top 10 equities by weight are:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSwJ6z2SnYKXmtBaiX9aU-OvTgFs_X7fcOVwJ25EoUxU_L9SGy2Mrs2Y3P_HOMxINgPndq-deFz3iCaaGTkBmAivCt9wUyc07CT7xP-ZTHxc0xIpc-02I6dxutULCgBpLzPRaEKGttO-R7qeapDI6EyjabZRuOHt2bzqZHbzsopCOshd6eVWgqrI5-ZWs/s409/DOW1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="244" data-original-width="409" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSwJ6z2SnYKXmtBaiX9aU-OvTgFs_X7fcOVwJ25EoUxU_L9SGy2Mrs2Y3P_HOMxINgPndq-deFz3iCaaGTkBmAivCt9wUyc07CT7xP-ZTHxc0xIpc-02I6dxutULCgBpLzPRaEKGttO-R7qeapDI6EyjabZRuOHt2bzqZHbzsopCOshd6eVWgqrI5-ZWs/w400-h238/DOW1.png" width="400" /></a></div><div><br /></div>In sum, 5 equities, and their corresponding evaluations, contribute approximately 33.38% of the Dow Jones Industrial Average’s overall price. <br /><br />Over the past five years, the Dow Jones Industrial Average has experienced a healthy appreciation of 32.40%.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI8sy5_F7Lo0dWZMB8XPOF9gNw09a3Ew770zKY11Mu7Jolw3blJ1q0aniBHDYqcMN6VhxYMPhYJiWP4U467Qqt-Bi-NSdJoAf-wyHJMb1NNMloIRRkZHsbOMW9cgghsq4WwVnbdqikYwxkVBTrH8UY36__DOIgL0n4-3GgWQBZv_UYzPZhEH0vhl1NR0s/s688/Dow_Index_Appreciation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="446" data-original-width="688" height="259" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI8sy5_F7Lo0dWZMB8XPOF9gNw09a3Ew770zKY11Mu7Jolw3blJ1q0aniBHDYqcMN6VhxYMPhYJiWP4U467Qqt-Bi-NSdJoAf-wyHJMb1NNMloIRRkZHsbOMW9cgghsq4WwVnbdqikYwxkVBTrH8UY36__DOIgL0n4-3GgWQBZv_UYzPZhEH0vhl1NR0s/w400-h259/Dow_Index_Appreciation.png" width="400" /></a></div><div style="text-align: center;"><br /></div>However, of this 32.40% evaluation upward, how much of this price shift can be attributed to the top 5 weighted equities?<div><br /></div><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-DjlJ9QIN2k4c29_mdgRmO6BgIrHz5EGX_8Q2JegYtwhsVtb3Dogirb4lL-9Z_J17Tj5xmiSf3n3F7ClccelikbzcFcXudLa3z4VxwRiOKfVyAZ8uAUpxI6o2Eab9t1ksgwaEGXerPyuH-L_LPKCZuqNx1ydl6g29ebsPCR6suGsmz2eMTnNC0ygl_xs/s409/DOW2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="143" data-original-width="409" height="139" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-DjlJ9QIN2k4c29_mdgRmO6BgIrHz5EGX_8Q2JegYtwhsVtb3Dogirb4lL-9Z_J17Tj5xmiSf3n3F7ClccelikbzcFcXudLa3z4VxwRiOKfVyAZ8uAUpxI6o2Eab9t1ksgwaEGXerPyuH-L_LPKCZuqNx1ydl6g29ebsPCR6suGsmz2eMTnNC0ygl_xs/w400-h139/DOW2.png" width="400" /></a></div><br />If every other component within the Dow Jones Industrial Average traded completely flat throughout the duration of the previous 5 years, the Dow Jones Industrial Average would have increased in value by approximately 30.97%. As the index appreciated by 32.40%, we can conclude that some of the components from which the index is comprised, slightly increased the overall potential aggregate (32.40 > 30.97). <br /><br /><b>Rating: MEGACHONKER</b><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSieMUV9VeT_RU_SJxb2tv_cXXHHiv4KwZwbkSciRTVY_i3HOwVZ0NwTfqSYfZH3tngry4oYVabQuTebLtjtDlbbnl0E6NCL6CW8R5-iGZzYBSrBXq1LbLu8stldTq0kMPvh-oRS0OCO4PHga8IEsp6Lq-ayI7vi9rySFai0_iHl1xizs8nGhgBdPs9lM/s1024/MEGACHONKER.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="498" data-original-width="1024" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSieMUV9VeT_RU_SJxb2tv_cXXHHiv4KwZwbkSciRTVY_i3HOwVZ0NwTfqSYfZH3tngry4oYVabQuTebLtjtDlbbnl0E6NCL6CW8R5-iGZzYBSrBXq1LbLu8stldTq0kMPvh-oRS0OCO4PHga8IEsp6Lq-ayI7vi9rySFai0_iHl1xizs8nGhgBdPs9lM/w400-h195/MEGACHONKER.jpeg" width="400" /></a></div><div><br /></div>If 5 companies make up 33.82% of the index’s weight, and 16.67% of the index has increased by approximately 30.97%, while the index itself has appreciated by 32.40%, we can evaluate this plump feline as being a MeGaChOnKeR. <br /><br /><b><u>The Case of the S&P</u></b> <br /><br />The S&P 500 is comprised of 503 separate equities, each possessing a differing weight as it pertains to their overall contribution to the index’s price. The top 10 equities by weight are as follows:<div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfHwCH_mFhnP8fagdPVKeayl4nY01bI8FO8ucVaDDs2WCBZSaVnEEI0TqoUJ9jsKV76GMyweE5z1oIkGpaXAfwVHDm4o9_ruhIleWBeqPDepGmOdT--Swbc9tGe2hTICbpxTuC6mhO77QurhvKAjxBdG1wIZYqiSZCakNGuQKDQI3xrk_XtPL3nwnDcmg/s409/SP1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="246" data-original-width="409" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfHwCH_mFhnP8fagdPVKeayl4nY01bI8FO8ucVaDDs2WCBZSaVnEEI0TqoUJ9jsKV76GMyweE5z1oIkGpaXAfwVHDm4o9_ruhIleWBeqPDepGmOdT--Swbc9tGe2hTICbpxTuC6mhO77QurhvKAjxBdG1wIZYqiSZCakNGuQKDQI3xrk_XtPL3nwnDcmg/w400-h239/SP1.png" width="400" /></a></div><div style="text-align: center;"><br /></div>In sum, these 10 equities, and their corresponding evaluations, contribute approximately 30.29% of the S&P 500’s overall price. <br /><br />Over the past five years, the S&P has experienced a robust appreciation of 52.92%.</div><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><br /></div><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ0A9CVjGdIUWYSDASgHiJ--OnUAx7Snp5h_7XJ_7DiEeMqGzaIS0Aw0wwTuxnIlwCusAXcuH6nFh713UXXgdlRaUHjjJMCCDZP4mN0LR2sR2XQqhXxmR15ocZU63PxqSduwAs89XYAI2XY--AVtdrCUsgC_nKwKh2vug1tg2BabnLH5NSpdYABt-D29w/s688/Dow_Index_Appreciation.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="446" data-original-width="688" height="259" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ0A9CVjGdIUWYSDASgHiJ--OnUAx7Snp5h_7XJ_7DiEeMqGzaIS0Aw0wwTuxnIlwCusAXcuH6nFh713UXXgdlRaUHjjJMCCDZP4mN0LR2sR2XQqhXxmR15ocZU63PxqSduwAs89XYAI2XY--AVtdrCUsgC_nKwKh2vug1tg2BabnLH5NSpdYABt-D29w/w400-h259/Dow_Index_Appreciation.png" width="400" /></a></div><div><br /></div>However, of this 52.91% evaluation upward, how much of the price shift be attributed to the top 10 weighted equities from which the index is comprised?<div><br /></div><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9G0nzUp-GDWurAyjIwTEFAqLR2rIUPumCTG1T4OaCaRaL3XXSvPXn3_XLBRQhgIhqNoijUyBhEKENaKTGuSFBTRwPieawp217W8NJ0j4nvH6wx8b4JCwe_38vLqSllmtIRJdznHhFloJU9m91dYk6CJFAcItky0IfMHu88-HO6yc-fKx3cNOk6_USWvc/s401/SP2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="305" data-original-width="401" height="243" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9G0nzUp-GDWurAyjIwTEFAqLR2rIUPumCTG1T4OaCaRaL3XXSvPXn3_XLBRQhgIhqNoijUyBhEKENaKTGuSFBTRwPieawp217W8NJ0j4nvH6wx8b4JCwe_38vLqSllmtIRJdznHhFloJU9m91dYk6CJFAcItky0IfMHu88-HO6yc-fKx3cNOk6_USWvc/s320/SP2.png" width="320" /></a></div><br />If every other component within the S&P traded completely flat throughout the duration of the previous 5 years, the S&P index would have increased in value by approximately 57.35%. As the index appreciated by 52.92%, we can conclude that some of the components from which the index is comprised, actually reduced the overall potential aggregate (52.92 < 57.35). <br /><br /><b>Rating: OH LAWD HE COMIN</b><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtiV95SkhWGXWLDNOaen8mTF3_whfICL9TgSBbvVMfzhwfhLNXIgjGAmGZf-eG__KpIzUBTBKinVDgHohJN7sz52WSAXnudhD88qcWNWk0ZXbWjn-ShhaKFp3Dx7O7tAqoNDAW9tTOl0WhzyvKHAr5oYfySExUsMfhvvMRwpM20DkTEwqNrrUTsRJze4s/s1200/OHLAWD.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1200" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtiV95SkhWGXWLDNOaen8mTF3_whfICL9TgSBbvVMfzhwfhLNXIgjGAmGZf-eG__KpIzUBTBKinVDgHohJN7sz52WSAXnudhD88qcWNWk0ZXbWjn-ShhaKFp3Dx7O7tAqoNDAW9tTOl0WhzyvKHAr5oYfySExUsMfhvvMRwpM20DkTEwqNrrUTsRJze4s/w400-h300/OHLAWD.jpg" width="400" /></a></div><div style="text-align: center;"><br /></div>If 10 companies make up 30.29% of the index’s weight, and that 1.98% of the index has increased by approximately 57.35%, while the index itself has appreciated by 52.92%, we can evaluate this rotund hunk of a cat with the phrase, “<b>OH LAWD HE COMIN</b>”! <br /><br /><b><u>Conclusion</u></b> <br /><br />While every index has been disproportionately impacted by the performance of a few equities, it would appear that the NASDAQ, the index which is least top heavy and possesses the greatest number of companies, has appreciated far more in value than its contemporaries. <br /><br />Maybe the NASDAQ, though often more volatile than other indexes, benefits from two conflicting attributes. The top 5 stocks within the index anchor the overall price of the index by the magnitude of their market cap, while the much smaller corporate equities, are directionally pressured by component proximity. <br /><br />Like almost every other medium of existence, it would seem that titans emerge after a certain period of growth. To succeed in such circumstances one must either cast their lot with familiar titans, or learn to swim amongst them. <br /><br />-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-71401998457801995982023-07-23T12:00:00.000-04:002023-07-23T13:02:08.735-04:00Odd Man Out: The Problem with Serpentine DraftsIn this article, we’re going to continue the trend of discussing topics related to Fantasy Sports. Specifically, the innate problem which I see within the serpentine draft format. I feel that this topic is particularly appropriate for this time of year, as football fans are gearing up for their own fantasy league drafts. <br /><br />If you’re unfamiliar with the serpentine draft format, it is best described as: <br /><br /><i>A serpentine draft, or sometimes referred to as a a "Snake" draft, is a type in which the draft order is reversed every round (eg 1..12, 12..1, 1..12, 12..1, etc.). For example, if you have the first pick in the draft, you will pick first in round one, and then last in round two. </i><br /><br />Source: <a href="https://help.fandraft.com/support/solutions/articles/61000278703-draft-types-and-formats " rel="nofollow" target="_blank">https://help.fandraft.com/support/solutions/articles/61000278703-draft-types-and-formats </a><br /><br />I’ve created a few examples of this draft type below. The innate issue which I see within this draft format, pertains to the differentiation between the projected point value of each draft selection, as determined by a team’s draft position. The more teams present within a league, the greater the point disparity between teams. <br /><br />Assuming that each team executed an optimized drafting strategy, we would expect the outcome to resemble something like the illustration below. <br /><br />Each number within a cell represents the best player value available to each team, each round. The green cells contain starting player values, and the grey cells contain back-up player values. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLfn1stAT5oros-y_5rJiK_3QdvNz5vhqlNLoI_8r_m1pXlSIPw3TSa0T94NS9-_TDov3LoHvNm6c3lFZaICl-E3nff155oZEgYdQO5fK78mFBIPwbuMm3YvhJq2KEjD9RJNZFwqS6rIFZjQbPBgZv8VEyImA-QSM_n6pXSTuLSi-io1C_V9baDaznebM/s865/Odd_Man_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="378" data-original-width="865" height="175" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLfn1stAT5oros-y_5rJiK_3QdvNz5vhqlNLoI_8r_m1pXlSIPw3TSa0T94NS9-_TDov3LoHvNm6c3lFZaICl-E3nff155oZEgYdQO5fK78mFBIPwbuMm3YvhJq2KEjD9RJNZFwqS6rIFZjQbPBgZv8VEyImA-QSM_n6pXSTuLSi-io1C_V9baDaznebM/w400-h175/Odd_Man_3.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div>As you can observe from the summed values below each outcome column, each team possess a one point advantage against the team which selected subsequently, and a one point disadvantage against the team which selected previously. The greatest differentiation occurring between the team which made the first selection within the draft order, as measured against the team which made the last selection within the draft order: <b>11</b> (1026 – 1015). <br /><br />As previously mentioned, the less teams within a league, the less the number of selection rounds. As a result of such, there is less of a disparity between the teams which pick earlier within the order, as compared to teams which pick later within the order. <br /><br />Below is the optimal outcome of a league comprised of ten teams.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC8wfe0xJfCGtdoEmsRNCbpUW4E9XissEXRDS9gH-rb3z4P4EGFmHGc6JXs6RH4WjtZc1hC_7HiTGDX2N1zfge2nGTFBgdXDCVPWTJJAe7pKX00gDKg3C-HFL4tRFZDiJGAq4evo2i9kUPI6cxDn6h2P8wxmornZh2NuNRBU933pvs7WXQ_K5QbPs0HvY/s728/Odd_Man_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="376" data-original-width="728" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC8wfe0xJfCGtdoEmsRNCbpUW4E9XissEXRDS9gH-rb3z4P4EGFmHGc6JXs6RH4WjtZc1hC_7HiTGDX2N1zfge2nGTFBgdXDCVPWTJJAe7pKX00gDKg3C-HFL4tRFZDiJGAq4evo2i9kUPI6cxDn6h2P8wxmornZh2NuNRBU933pvs7WXQ_K5QbPs0HvY/w400-h205/Odd_Man_2.png" width="400" /></a></div><div><br /></div>While the single point differentiation persists between consecutive teams within the draft order, the differentiation between the first selector, and the last selector, has been reduced to: 9 (856 – 847).<br /><br />This trend continues across ever smaller league sizes: <b>7</b> (1024 – 1017).<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjA7QY5hiPS_gAxSLNFuc5l-qwNU44Szr9oStCqU4Vy6QFxUo22uuBs943Gn8Gy7oFe5izB1BJKiQR02LMBXpXEftEYuEc8XfLRWaVisQwefTHaX_ZjOhyCihIwR01lIS1pSB5Rk0bJnHjQYlYTpaO55A6-DH0hBiIe4umEik_RglRnMeuMYTFvZftIXC4/s606/Odd_Man_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="376" data-original-width="606" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjA7QY5hiPS_gAxSLNFuc5l-qwNU44Szr9oStCqU4Vy6QFxUo22uuBs943Gn8Gy7oFe5izB1BJKiQR02LMBXpXEftEYuEc8XfLRWaVisQwefTHaX_ZjOhyCihIwR01lIS1pSB5Rk0bJnHjQYlYTpaO55A6-DH0hBiIe4umEik_RglRnMeuMYTFvZftIXC4/w400-h248/Odd_Man_1.png" width="400" /></a></div><div><br /></div><div>In each instance, we should expect the total differentiation of points between the first draft participant, and last draft participant, (if optimal drafting occurred), to be equal to: <b>N – 1</b>. Where N = the number of total draft participants within the league.<br /><br />All things being equal, if each team is managed optimally, we should expect the first team within each draft to finish first within each league. Second place would belong to the team which drafted second. Third place belonging to the team which drafted third, and so on, etc. <br /><br />If all players are equally at risk of being injured on each fantasy team, then this occurrence does little to upset the overall ranking of teams by draft order. It must be remembered that teams which drafted earlier within the order, will also possess better replacement players as compared to their competitors. Therefore, when injuries do occur, later drafting teams will be disproportionately impacted.<br /><br />I would imagine that as AI integration begins to seep into all aspects of existence, that the opportunity for each team owner to draft with consistent optimization, will further stratify the inherent edge attributed to serpentine draft position. As it stands currently, there is still an opportunity for lower draft order teams to compete if one or more of their higher order competitors blunder a selection. <br /><br />In any case, I hope that what I have written in this article helped to describe what I like to refer to as the, “Odd Man Out” phenomenon. I hope to see you again soon, with more of the statistical content which you crave.<br /><br />-RD</div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-68760275187690685122023-07-17T20:28:00.001-04:002023-07-17T20:33:29.394-04:00(R) Daily Fantasy Sports Line-up Optimizer (Basketball)I’ve been mulling over whether or not I should give away this secret sauce on my site, and I’ve come to the conclusion that anyone who seriously contends within the Daily Fantasy medium, probably is already aware of this strategy. <br /><br /><div>Today, through the magic of R software, I will demonstrate how to utilize code to optimize your daily fantasy sports line-up. This particular example will be specific to the Yahoo daily fantasy sports platform, and to the sport of basketball.<br /><br />I also want to give credit, where credit is due. <br /><br />The code presented below is a heavily modified variation of code initially created by: Patrick Clark.<br /><br />The original code source case be found here: <a href="http://patrickclark.info/Lineup_Optimizer.html" rel="nofollow" target="_blank">http://patrickclark.info/Lineup_Optimizer.html<br /></a><br /><b><u>Example:</u></b><br /><br />First, you’ll need to access Yahoo’s Daily Fantasy page. I’ve created an NBA Free QuickMatch, which is a 1 vs. 1 contest against an opponent where no money changes hands.<div><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbkscX9E_Kg6XICsyjxOpWT27lxR_CViu254d7y_yzUAtZaV1c-tn_76syIhOjYYg01Sba4qwqPj7Twqr_UHr-Fw_8AIZYmgz9xwIRu3v__YYhyirhka7cLv6dKj6Pj1uGG3NLJ57Zr2iTqnuUSokDmXj-U-vHZwFf-vf44jNOgUetxEpFpnTRu8jc/s1091/NBA1-1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="670" data-original-width="1091" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbkscX9E_Kg6XICsyjxOpWT27lxR_CViu254d7y_yzUAtZaV1c-tn_76syIhOjYYg01Sba4qwqPj7Twqr_UHr-Fw_8AIZYmgz9xwIRu3v__YYhyirhka7cLv6dKj6Pj1uGG3NLJ57Zr2iTqnuUSokDmXj-U-vHZwFf-vf44jNOgUetxEpFpnTRu8jc/w400-h244/NBA1-1.png" width="400" /></a><br /><br />This page will look a bit different during the regular season, as the NBA playoffs are currently underway. That aside, our next step is to download all of the current player data. This can be achieved by clicking on the “<b>i</b>” bubble icon.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2fVl6p7fn6FIo3-rClM1boXgCCD3WztqoR4Tic4pjT3f-a0fDn6XpNv5ku_4SXl3-sbHCgMGOWg61MxNHkb8egiQpJpY7EdZpGttEHp9XTark1n5zGh4igOIXfZfvLZ6WYFgci_2BB6Z2-Lfds6A8f37n6bG_C2GG9D_XN4FZzxDekqP5bFI1e_IY/s1162/NBA2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="681" data-original-width="1162" height="235" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2fVl6p7fn6FIo3-rClM1boXgCCD3WztqoR4Tic4pjT3f-a0fDn6XpNv5ku_4SXl3-sbHCgMGOWg61MxNHkb8egiQpJpY7EdZpGttEHp9XTark1n5zGh4igOIXfZfvLZ6WYFgci_2BB6Z2-Lfds6A8f37n6bG_C2GG9D_XN4FZzxDekqP5bFI1e_IY/w400-h235/NBA2.png" width="400" /></a><br /><br />Next, click on the “<b>Export players list</b>” link. This will download the previously mentioned player data.<br /><br />The player data should resemble the (.csv) image below:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLartDNus5-E6ep1JO4F0DbdnBCqEPLEYjBhvwZ_HKmcWvTfmVC7jdVW4wzn3muHidvq2kJW7nGBgKKByLhE0gmFGetJQnWjh5Sazv5X1rsc3jE-_EAFLJNEGbdoMFa5vVfr9WqxHt_JRwQGDac4uPHy6Bq6V_IvPsrPXs-yijWymDHW3sRldGPhha/s865/NBA3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="424" data-original-width="865" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLartDNus5-E6ep1JO4F0DbdnBCqEPLEYjBhvwZ_HKmcWvTfmVC7jdVW4wzn3muHidvq2kJW7nGBgKKByLhE0gmFGetJQnWjh5Sazv5X1rsc3jE-_EAFLJNEGbdoMFa5vVfr9WqxHt_JRwQGDac4uPHy6Bq6V_IvPsrPXs-yijWymDHW3sRldGPhha/w400-h195/NBA3.png" width="400" /></a><br /><br />Prior to proceeding to the subsequent step, we need to do a bit of manual data clean up. <br /><br />Any player who is injured or not starting, I removed from the data set. I also concatenated the First Name and Last Name fields, and placed that concatenation within the ID variable. Next, I removed all variables except for the following: <b>ID</b> (newly modified), <b>Position</b>, <b>Salary</b>, and<b> FPPG </b>(Fantasy Points Per Game). <br /><br />The results should resemble the following image:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyR8we0C_3QcICUgU2Zs-P5lk84PSAqbjQddE6kwZSqYUlChRVk9VFK8UPenqZjp45gx9ADcoIOyUHYem1nNhk7AZ6SeEa1l1ucSGVn5QC0qLzZrwvSxnJ8ddcWBWZMqo-_A5BmdR5cHtesQmMfYgrpKy-gJ2b7PxgNoVaJVWK7OjQ7xRErmc2HZhN/s405/NBA4.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="326" data-original-width="405" height="321" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyR8we0C_3QcICUgU2Zs-P5lk84PSAqbjQddE6kwZSqYUlChRVk9VFK8UPenqZjp45gx9ADcoIOyUHYem1nNhk7AZ6SeEa1l1ucSGVn5QC0qLzZrwvSxnJ8ddcWBWZMqo-_A5BmdR5cHtesQmMfYgrpKy-gJ2b7PxgNoVaJVWK7OjQ7xRErmc2HZhN/w400-h321/NBA4.png" width="400" /></a><br /><br />(Specific player data and all associated variables will differ depending on the date of download)<br /><br />Now that the data has been formatted, we’re ready to code!<b><br /><br />###################################################################<br /><br />library(lpSolveAPI)<br /><br />library(tidyverse)<br /><br /># It is easier to input the data as an Excel file if possible #<br /><br /># Player names (ID) have the potential to upset the .CSV format # <br /><br />library(readxl)<br /><br /># Be sure to set the played data file path to match your directory / file name # <br /><br />PlayerPool <- read_excel("C:/Users/Your_Modified_Players_List.xlsx")<br /><br /># Create some positional identifiers in the pool of players to simplify linear constraints #<br /><br /># This code creates new position column variables, and places a 1 if a player qualifies for a position #<br /><br />PlayerPool$PG_Check <- ifelse(PlayerPool$Position == "PG",1,0)<br /><br />PlayerPool$SG_Check <- ifelse(PlayerPool$Position == "SG",1,0)<br /><br />PlayerPool$SF_Check <- ifelse(PlayerPool$Position == "SF",1,0)<br /><br />PlayerPool$PF_Check <- ifelse(PlayerPool$Position == "PF",1,0)<br /><br />PlayerPool$C_Check <- ifelse(PlayerPool$Position == "C",1,0)<br /><br />PlayerPool$One <- 1<br /><br /># This code modifies the position columns so that each variable is a vector type #<br /><br />PlayerPool$PG_Check <- as.vector(PlayerPool$PG_Check)<br /><br />PlayerPool$SG_Check <- as.vector(PlayerPool$SG_Check)<br /><br />PlayerPool$SF_Check <- as.vector(PlayerPool$SF_Check)<br /><br />PlayerPool$PF_Check <- as.vector(PlayerPool$PF_Check)<br /><br />PlayerPool$C_Check <- as.vector(PlayerPool$C_Check)<br /><br /># This code orders each player ID by position # <br /><br />PlayerPool <- PlayerPool[order(PlayerPool$PG_Check),]<br /><br />PlayerPool <- PlayerPool[order(PlayerPool$SG_Check),]<br /><br />PlayerPool <- PlayerPool[order(PlayerPool$SF_Check),]<br /><br />PlayerPool <- PlayerPool[order(PlayerPool$PF_Check),]<br /><br />PlayerPool <- PlayerPool[order(PlayerPool$C_Check),]<br /><br /># Appropriately establish variables in order to perform the "solver" function #<br /><br />Num_Players <- length(PlayerPool$One)<br /><br />lp_model = make.lp(0, Num_Players) <br /><br />set.objfn(lp_model, PlayerPool$FPPG)<br /><br />lp.control(lp_model, sense= "max")<br /><br />set.type(lp_model, 1:Num_Players, "binary")<br /><br /># Total salary points avalible to the player #<br /><br /># In the case of Yahoo, the salary points are set to ($)200 #<br /><br />add.constraint(lp_model, PlayerPool$Salary, "<=",200)<br /><br /># Maximum / Minimum Number of Players necessary for each position type #<br /><br />add.constraint(lp_model, PlayerPool$PG_Check, "<=",3)<br /><br />add.constraint(lp_model, PlayerPool$PG_Check, ">=",1)<br /><br /># Maximum / Minimum Number of Players necessary for each position type #<br /><br />add.constraint(lp_model, PlayerPool$SG_Check, "<=",3) <br /><br />add.constraint(lp_model, PlayerPool$SG_Check, ">=",1)<br /><br /># Maximum / Minimum Number of Players necessary for each position type #<br /><br />add.constraint(lp_model, PlayerPool$SF_Check, "<=",3) <br /><br />add.constraint(lp_model, PlayerPool$SF_Check, ">=",1)<br /><br /># Maximum / Minimum Number of Players necessary for each position type #<br /><br />add.constraint(lp_model, PlayerPool$PF_Check, "<=",3)<br /><br />add.constraint(lp_model, PlayerPool$PF_Check, ">=",1)<br /><br /># Maximum / Minimum Number of Players necessary for each position type (only require one (C)enter) #<br /><br />add.constraint(lp_model, PlayerPool$C_Check, "=",1)<br /><br /># Total Numner of Players Needed for the entire Fantasy Line-up #<br /><br />add.constraint(lp_model, PlayerPool$One, "=",8)<br /><br /># Perform the Solver function #<br /><br />solve(lp_model)<br /><br /># Projected_Score provides the projected score summed from the optimized projected line-up (FPPG) #<br /><br />Projected_Score <- crossprod(PlayerPool$FPPG,get.variables(lp_model))<br /><br />get.variables(lp_model)<br /><br /># The optimal_lineup data frame provides the optimized line-up selection #<br /><br />optimal_lineup <- subset(data.frame(PlayerPool$ID, PlayerPool$Position, PlayerPool$Salary), get.variables(lp_model) == 1)</b><br /><br />If we take a look at our:<br /><br /><b>Projected_Score</b><br /><br />We should receive an output which resembles the following:<br /><br /><i>> Projected_Score<br /> <span> </span>[,1]<br />[1,] 279.5<br /></i><br />Now, let’s take a look at our:<br /><br /><b>optimal_lineup</b><br /><br />Our output should resemble something like:<br /><br /><i> PlayerPool.ID PlayerPool.Position PlayerPool.Salary<br />3 Marcus Smart PG 20<br />51 Bradley Beal SG 43<br />108 Tyrese Haliburton SG 16<br />120 Jerami Grant SF 27<br />130 Eric Gordon SF 19<br />148 Brandon Ingram SF 36<br />200 Darius Bazley PF 19<br />248 Steven Adams C 20<br /></i><br />With the above information, we are prepared to set our line up.<br /><br />You could also run this line of code:<br /><br /><b>optimal_lineup <- subset(data.frame(PlayerPool$ID, PlayerPool$Position, PlayerPool$Salary, PlayerPool$FPPG), get.variables(lp_model) == 1)<br /><br />optimal_lineup</b><br /><br />Which provides a similar output that also includes point projections:<br /><br /><i> PlayerPool.ID PlayerPool.Position PlayerPool.Salary PlayerPool.FPPG<br />3 Marcus Smart PG 20 29.8<br />51 Bradley Beal SG 43 50.7<br />108 Tyrese Haliburton SG 16 26.9<br />120 Jerami Grant SF 27 38.4<br />130 Eric Gordon SF 19 30.7<br />148 Brandon Ingram SF 36 43.2<br />200 Darius Bazley PF 19 29.7<br />248 Steven Adams C 20 30.1</i><div><br />Summing up <b>PlayerPool.FPPG</b>, we reach the value: 279.5. This was the same value which we observed within the <b>Projected_Score</b> matrix.<br /><br /><b><u>Conclusion:</u></b><br /><br />While this article demonstrates a very interesting concept, I would be remiss if I did not advise you to <b><i>NOT</i></b> gamble on daily fantasy. This post was all in good fun, and for educational purposes only. By all means, defeat your friends and colleagues in free leagues, but do not turn your hard-earned money over to gambling websites. <br /><br />The code presented within this entry may provide you with a minimal edge, but shark players are able to make projections based on far more robust data sets as compared to league FPPG. <br /><br />In any case, the code above can be repurposed for any other daily fantasy sport (football, soccer, hockey, etc.). Remember, only to play for fun and for free. </div><div><br /></div><div>-RD <br /></div></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-29463055937240368742023-07-09T21:34:00.000-04:002023-07-10T18:39:49.394-04:00(R) Bedford's Law<p>In today’s article, we will be discussing Benford’s Law, specifically as it is utilized as an applied methodology to assess financial documents for potential fraud: <br /><br />First, a bit about the phenomenon which Benford sought to describe: <br /><br /><i>The discovery of Benford's law goes back to 1881, when the Canadian-American astronomer Simon Newcomb noticed that in logarithm tables the earlier pages (that started with 1) were much more worn than the other pages. Newcomb's published result is the first known instance of this observation and includes a distribution on the second digit, as well. Newcomb proposed a law that the probability of a single number N being the first digit of a number was equal to log(N + 1) − log(N). <br /><br />The phenomenon was again noted in 1938 by the physicist Frank Benford, who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader's Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229. This discovery was later named after Benford (making it an example of Stigler's law). </i><br /><br />Source: <a href="https://en.wikipedia.org/wiki/Benford" s_law="">https://en.wikipedia.org/wiki/Benford%27s_law</a> <br /><br />So what does this actually mean in laymen’s terms? <br /><br />Essentially, given a series of numerical elements from a similar source, we should expect certain leading digits to occur, and correspond to, a particular distribution patter. <br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWGz-WjPol3z3j6mcXAuu7WEews2pKRFRgCtdNKD1Jl70ixOIxLvWBHtJivpLDRlTj3TnNWnzCoQiyWHJy5XjwvYaW8INAgZRprgDNmHHigAkjiE1GCSvsTxsgU_EcLYTVhsYpvoGk-zZu3TmVOIdV0Vb1z3YzDMEQmr8EVRBiXd3f24Q7GHd4puSK/s938/Benfords_Law_1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="710" data-original-width="938" height="303" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWGz-WjPol3z3j6mcXAuu7WEews2pKRFRgCtdNKD1Jl70ixOIxLvWBHtJivpLDRlTj3TnNWnzCoQiyWHJy5XjwvYaW8INAgZRprgDNmHHigAkjiE1GCSvsTxsgU_EcLYTVhsYpvoGk-zZu3TmVOIdV0Vb1z3YzDMEQmr8EVRBiXd3f24Q7GHd4puSK/w400-h303/Benfords_Law_1.png" width="400" /></a><br /><br />If a series of elements perfectly corresponds with Benford’s Law, then the elements within the series should follow the above pattern as it pertains to leading digit frequency. Ex. Numbers which begin the number “<b>1</b>”, should occur 30.1% of the time. Numbers which begin with the number “<b>2</b>”, should occur 17.6% of the time. Numbers which begin with the number “<b>3</b>”, should occur 12.5% of the time. <br /><br />The distribution is derived as follows:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1zBG6l3VfXmsT2e1hpJgBw6QZtu8W8zDkD-nN25aZqDoSHCC4ceWquDXvPeMzJMWkhzifYTIRluI4j4yO_j-qCA1EmgPOLHSE74SXoSMbVD4oZXRPBBMNTHgeaasqdJ-M-0AdieL03JySrUvCWXvW5GFdtFB5WxIjZS7oUf-k3sE4uxfqbOZSLr_6/s495/Benfords_Law_2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="229" data-original-width="495" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1zBG6l3VfXmsT2e1hpJgBw6QZtu8W8zDkD-nN25aZqDoSHCC4ceWquDXvPeMzJMWkhzifYTIRluI4j4yO_j-qCA1EmgPOLHSE74SXoSMbVD4oZXRPBBMNTHgeaasqdJ-M-0AdieL03JySrUvCWXvW5GFdtFB5WxIjZS7oUf-k3sE4uxfqbOZSLr_6/w400-h185/Benfords_Law_2.png" width="400" /></a><br /><br />The utilization of Benford’s Law is applicable to numerous scenarios:<br /><br />1. Accounting fraud detection<br /><br />2. Use in criminal trials<br /><br />3. Election data <br /><br />4. Macroeconomic data<br /><br />5. Election data <br /><br />6. Price digit analysis <br /><br />7. Genome data <br /><br />8. Scientific fraud detection <br /><br />As it relates to screening for financial fraud, if the application of methodology related to the Benford’s Law Distribution returns a result in which the sample elements do not correspond with the distribution, then fraud is not necessarily the conclusion which we would immediately assume. However, the findings may indicate that additional data scrutinization is necessary. <br /><br /><b><u>Example</u>:</b><br /><br />Let’s utilize Benford’s Law to analyze Cloudflare’s (NET) Balance Sheet (12/31/2021).<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQBXCZWm0AuTfxdKl6EyDxrcvfzDpEx5Xmw3y1Ob2IW79dCNuS9fSmFx9QGS0_aCaIpcprbB1X2TkGxeaEGDItO5pib8MrThVkvDkwm0oTZM0UlqLQQJErO_s4JAhbYkHUPU5NbMcP42nhbzhJZ6qfR-itl6ki8AZCrYrPPIA71rMh8d92Qw8zWNe7/s528/Benfords_Law_3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="528" data-original-width="518" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQBXCZWm0AuTfxdKl6EyDxrcvfzDpEx5Xmw3y1Ob2IW79dCNuS9fSmFx9QGS0_aCaIpcprbB1X2TkGxeaEGDItO5pib8MrThVkvDkwm0oTZM0UlqLQQJErO_s4JAhbYkHUPU5NbMcP42nhbzhJZ6qfR-itl6ki8AZCrYrPPIA71rMh8d92Qw8zWNe7/w393-h400/Benfords_Law_3.png" width="393" /></a><br /><br />Even though it’s an un-necessary step as it relates to our analysis, let’s first discern the frequency of each leading digit. These digits are underlined in red within the graphic above.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrKoUoBFV4VwUGLmHoR3wLo7I8SlU8auhFVNIXjom-VMuepQW79S4BU_mZP9l-5gh6_wRWGAOVxQ8PitzHxpn38jrLCn4EdSZP2kZwxOXAbatLE_R1CgKLUUtu4Txy6QzQJntR4Ro8dpQqYXjMuOj66H0QyWA1xdXHbQ50_4aJ4DE8aCW-MiCiRFqU/s326/Benfords_Law_Dist.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="326" data-original-width="284" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrKoUoBFV4VwUGLmHoR3wLo7I8SlU8auhFVNIXjom-VMuepQW79S4BU_mZP9l-5gh6_wRWGAOVxQ8PitzHxpn38jrLCn4EdSZP2kZwxOXAbatLE_R1CgKLUUtu4Txy6QzQJntR4Ro8dpQqYXjMuOj66H0QyWA1xdXHbQ50_4aJ4DE8aCW-MiCiRFqU/w174-h200/Benfords_Law_Dist.png" width="174" /></a><br /><br />What Benford’s Law will seeks to assess, is the comparison of leading digits as they occurred within our experiment, to our expectations as they exist within the Benford’s Law Distribution.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy5XKj_KREXyPpEqsfNWOdXVFMV9bL76jeqprT_SDJ9w7SVKM503YSXByFYoqwq_rJFNObNEeQfBlUEs6xHxviBZqubFp_0a4aLB4nhxdTsbvCvtCRlYTzBrN0ibzp9BMhJ-dGageAtqp1WIFz98Fmnjf0vIcDTdAKRv54SWTvZdatqTRFSBlFDJWq/s291/Benfords_Law_4.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="226" data-original-width="291" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy5XKj_KREXyPpEqsfNWOdXVFMV9bL76jeqprT_SDJ9w7SVKM503YSXByFYoqwq_rJFNObNEeQfBlUEs6xHxviBZqubFp_0a4aLB4nhxdTsbvCvtCRlYTzBrN0ibzp9BMhJ-dGageAtqp1WIFz98Fmnjf0vIcDTdAKRv54SWTvZdatqTRFSBlFDJWq/s1600/Benfords_Law_4.png" width="291" /></a><br /><br />The above table illustrates the frequency of occurrence of each leading digit within our analysis, versus the expected percentage frequency as stated by Benford’s Law. <br /><br />Now let’s perform the analysis:<br /><b><br /># H0: The first digits within the population counts follow Benford's law #<br /><br /># H1: The first digits within the population counts do not follow Benford's law #<br /><br /># requires benford.analysis #<br /><br />library(benford.analysis)<br /><br /># Element entries were gathered from Cloudflare’s (NET) Balance Sheet (12/31/2021) #<br /><br />NET <- c(2372071.00, 1556273.00, 815798.00, 1962675.00, 815798.00, 134212.00, 791014.00, 1667291.00, 1974792.00, 791014.00, 1293206.00, 845217.00, 323612.00, 323612.00)<br /><br /># Perform Analysis #<br /><br />trends <- benford(NET, number.of.digits = 1, sign = "positive", discrete=TRUE, round=1)<br /><br /># Display Analytical Output # <br /><br />trends<br /><br /># Plot Analytical Findings #<br /><br />plot(trends)</b><br /><br /><u>Which provides the output:</u><br /><br /><i>Benford object:<br /><br />Data: NET <br />Number of observations used = 14 <br />Number of obs. for second order = 10 <br />First digits analysed = 1<br /><br />Mantissa: <br /><br /> Statistic Value<br /><span> </span>Mean 0.51<br /><span> </span>Var 0.11<br /> Ex.Kurtosis -1.61<br /> <span> </span>Skewness 0.25<br /><br />The 5 largest deviations: <br /><br /> digits absolute.diff<br />1 8 2.28<br />2 1 1.79<br />3 2 1.47<br />4 4 1.36<br />5 7 1.19<br /><br />Stats:</i><br /><br /><b> Pearson's Chi-squared test<br /><br />data: NET<br />X-squared = 14.729, df = 8, p-value = 0.06464</b><br /><br /><i> Mantissa Arc Test<br /><br />data: NET<br />L2 = 0.092944, df = 2, p-value = 0.2722<br /><br />Mean Absolute Deviation (MAD): 0.08743516<br />MAD Conformity - Nigrini (2012): Nonconformity<br />Distortion Factor: 8.241894<br /><br /><b>Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!</b><br /><br />~ Graphical Output Provided by Function ~</i></p><p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjju3np1juxTymNdM4OmaOe09Vqy9d39pnrIR6eEdDPdFp2RJEtmbl7z8nqexS0mL2_6KYr3oUMtT__tgy2XmHC3v3PeQjGqmo4T0p6iuTsHn3PlCKLmQaK5D77o66XDTLIdgu7waVCB3zVIDVjL1uh8debNBuLoy_M6Et75prD--sbf_Y0dOxldHER/s777/Benfords_Law_5.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="493" data-original-width="777" height="254" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjju3np1juxTymNdM4OmaOe09Vqy9d39pnrIR6eEdDPdFp2RJEtmbl7z8nqexS0mL2_6KYr3oUMtT__tgy2XmHC3v3PeQjGqmo4T0p6iuTsHn3PlCKLmQaK5D77o66XDTLIdgu7waVCB3zVIDVjL1uh8debNBuLoy_M6Et75prD--sbf_Y0dOxldHER/w400-h254/Benfords_Law_5.png" width="400" /></a><br /><br />(The most important aspects of the output are<b> bolded</b>) <br /><br /><b><u>Findings</u>:</b> <br /><br /><i> Pearson's Chi-squared test <br /><br />data: NET <br />X-squared = 14.729, df = 8, p-value = 0.06464 <br />Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!</i><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqMXVzCIaeD6AdKwUK6GpxN5FbirumjJ6LcTerZnnI-Q7X_T_jT2gGIuLC5vEYhcKCW15ilKPzQYjP951tmI71n6l6OdKGR47egD8zXhB-WIvHqP8k2kNTRozr7RcchaP3WPUK_KRr3WOZh62Ve1On6EhMyFoai0t3kQrBnAal-tikkdngrVCbgl_g/s493/Benfords_Law_6.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="254" data-original-width="493" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqMXVzCIaeD6AdKwUK6GpxN5FbirumjJ6LcTerZnnI-Q7X_T_jT2gGIuLC5vEYhcKCW15ilKPzQYjP951tmI71n6l6OdKGR47egD8zXhB-WIvHqP8k2kNTRozr7RcchaP3WPUK_KRr3WOZh62Ve1On6EhMyFoai0t3kQrBnAal-tikkdngrVCbgl_g/w400-h205/Benfords_Law_6.png" width="400" /></a><br /><br />A chi-square goodness of fit test was performed to examine whether the first digit of balance sheet items from the company Cloudflare (12/31/2021), adhere to Benford's law. Entries were found to be in adherence, with non-significance at the p < .05 level, χ2 (8, N = 14) = 14.73, p = 0.07.<br /><br />As it relates to the graphic, in ideal circumstances, each blue data bar should have its uppermost portion touching the broken red line. <br /><br /><b>Example(2):</b><br /><br />If you’d prefer to instead run the analysis simply as a chi-squared test which does not require the “<b>benford.analysis</b>” package, you can effectively utilize the following code. The image below demonstrates the concept being employed.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgShRcvLoujzX73JhQArSEPDSqoVxXAwNrF0hQD9zm19T0nFBCw2cTGjLfGQaz91ECBgFm0r4X-YuMHdvjPOIlFnWE1PlRPh0Mfn11Ws8ooWOiF7Jzw0g3QzQ7PE2cWr1GEx9D5uOWnn1kaku6wnLTXlshdn8K_nihvuYGeco4SJo4BQ_yAaMHmmZ4X/s789/Benfords_Law_7.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="313" data-original-width="789" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgShRcvLoujzX73JhQArSEPDSqoVxXAwNrF0hQD9zm19T0nFBCw2cTGjLfGQaz91ECBgFm0r4X-YuMHdvjPOIlFnWE1PlRPh0Mfn11Ws8ooWOiF7Jzw0g3QzQ7PE2cWr1GEx9D5uOWnn1kaku6wnLTXlshdn8K_nihvuYGeco4SJo4BQ_yAaMHmmZ4X/w400-h158/Benfords_Law_7.png" width="400" /></a><br /><br /><b>Model <- c(6, 1, 2, 0, 0, 0, 2, 3, 0)<br /><br />Results <- c(0.30102999566398100, 0.17609125905568100, 0.12493873660830000, 0.09691001300805650, 0.07918124604762480, 0.06694678963061320, 0.05799194697768670, 0.05115252244738130, 0.04575749056067510)</b><br /><br /><b>chisq.test(Model, p=Results, rescale.p = FALSE)<br /></b><br /><u>Which provides the output:</u><br /><br /> <i> Chi-squared test for given probabilities<br /><br />data: Model<br />X-squared = 14.729, df = 8, p-value = 0.06464</i><br /><br />Which are the same findings that we encountered while performing the analysis previously.</p><p>That’s all for now! Stay studious, Data Heads! </p><p>-RD</p>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-45589790498253097102023-07-01T12:12:00.000-04:002023-07-01T12:12:53.227-04:00Money Hustle II<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3L1cZPn_4kwcQOgxPhZYDoUkuyGEc07JoflBoJt_SHjl4naOriV_myOWam2eeKREuG41dcD_0rlOTQjPta0AWmviKUgmqoTOWJVM0OGRRI99IWuzZr_UrvYpyAmNetHPsILbW1NIfHdCeUvPYJdBcfggNEKMBSYOZkoyWVUtkjQ_rL5sY0sYinh23Tds/s421/MH%20II.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="421" height="227" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3L1cZPn_4kwcQOgxPhZYDoUkuyGEc07JoflBoJt_SHjl4naOriV_myOWam2eeKREuG41dcD_0rlOTQjPta0AWmviKUgmqoTOWJVM0OGRRI99IWuzZr_UrvYpyAmNetHPsILbW1NIfHdCeUvPYJdBcfggNEKMBSYOZkoyWVUtkjQ_rL5sY0sYinh23Tds/s320/MH%20II.png" width="320" /></a></div><div style="text-align: center;"><br /></div><div>Following up on the prior article, a reader of this site wrote to me and asked (paraphrasing):<br /><br /><i>“I really enjoyed your last entry as it pertains to randomly selected portfolios. However, I remain skeptical. How would another random selection of stocks perform within a historical bear market? Also, how would these stocks perform across a longer duration of time?”</i><br /><br />Great questions! We would not be scientists if we didn’t attempt to replicate our results! <br /><br />We’ll start with a new list of random equities. The same rules apply. No commodity funds or bond funds will be included within our Random Portfolio. We will be going back to the late 1990’s, to the zenith of the dot com bubble, and then into the lethargy of the early 2000’s. This experiment will be imperfect, as amongst other confounding factors, our random stock picker can only pick from equities which are still public, and still in existence. Any equity which did not make it to the present, unfortunately, cannot be included within our experiment.<div><br /><b>#### WARNING – INVESTMENT OF ANY SORT BEARS RISK! THIS ARTICLE IS NOT FINANCIAL ADVICE! DO NOT REPLICATE THIS EXPERIMENT AND EXPECT TO MAKE MONEY! ####</b><br /><br />To fairly decide as to which equities our fund would hold, I utilized the websites:<br /><br /><a href="https://www.rayberger.org/random-stock-picker/">https://www.rayberger.org/random-stock-picker/</a><br /><br /><b>~AND~</b><br /><br /><a href="https://www.randstock.ca/selector">https://www.randstock.ca/selector</a><br /><br />So, let’s meet our new components:<br /><br /><b>Fair Isaac Corporation (FICO)</b> - Fair Isaac Corporation develops analytic, software, and data decisioning technologies and services that enable businesses to automate, enhance, and connect decisions in the Americas, Europe, the Middle East, Africa, and the Asia Pacific. Sector(s): Technology<br /><br /><b>Par Technology Corporation (PAR)</b> - PAR Technology Corporation, together with its subsidiaries, provides technology solutions to the restaurant and retail industries worldwide. Sector(s): Software Application<br /><br /><b>IMAX Corporation (IMAX)</b> - IMAX Corporation, together with its subsidiaries, operates as a technology platform for entertainment and events worldwide. Sector(s): Communication Services<br /><br /><b>Spectrum Brands Holdings, Inc. (SPB) </b>- Spectrum Brands Holdings, Inc. operates as a branded consumer products company worldwide. It operates through three segments: Home and Personal Care; Global Pet Care; and Home and Garden. Sector(s): Consumer Defensive<br /><br /><b>Patterson Companies, Inc. (PDCO) </b>- Patterson Companies, Inc. engages in distribution of dental and animal health products in the United States, the United Kingdom, and Canada. Sector(s): Healthcare<br /><br /><b>The Walt Disney Company (DIS)</b> - The Walt Disney Company, together with its subsidiaries, operates as an entertainment company worldwide. Sector(s): Communication Services<br /><br /><b>Regis Corporation (RGS) </b>- Regis Corporation owns, operates, and franchises hairstyling and hair care salons in the United States, Canada, Puerto Rico, and the United Kingdom. Sector(s): Consumer Cyclical<br /><br /><b>Synovus Financial Corp (SNV)</b> - Synovus Financial Corp. operates as the bank holding company for Synovus Bank that provides commercial and consumer banking products and services. Sector(s): Financial Services<br /><br /><b>Royal Bank of Canada (RY)</b> - Royal Bank of Canada operates as a diversified financial service company worldwide. Sector(s): Financial Services<br /><br /><b>Incyte Corporation (INCY)</b> - Incyte Corporation, a biopharmaceutical company, engages in the discovery, development, and commercialization of therapeutics for hematology/oncology, and inflammation and autoimmunity areas in the United States, Europe, Japan, and internationally. Sector(s): Healthcare<br /><br />With a purchase date of each equity being set at 1/1/1998 (closing price), our fictitious Random Fund performs as shown below:</div><div><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBkQ34cKvP0afz7o34Z-PUTlnnsiVE7JoN9KIKwqLaHfZ-OsPlPx0l3QfG4ApH2sXAU8MKwZY9XJUiODKvJim8XUjAFKBwMismBxF-1H_B50Q1ZGfKiRYkTKp9WXdyXEQcTlWsRnnZUpWCg0BthpVxWxir1DVVM1-hcDWZv2Uv58EUzWG2unClKQ4L/s797/Random_Fund_A_1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="373" data-original-width="797" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBkQ34cKvP0afz7o34Z-PUTlnnsiVE7JoN9KIKwqLaHfZ-OsPlPx0l3QfG4ApH2sXAU8MKwZY9XJUiODKvJim8XUjAFKBwMismBxF-1H_B50Q1ZGfKiRYkTKp9WXdyXEQcTlWsRnnZUpWCg0BthpVxWxir1DVVM1-hcDWZv2Uv58EUzWG2unClKQ4L/w400-h188/Random_Fund_A_1.png" width="400" /></a><br /><br />Again, I choose a few benchmarks to compare this performance against: <br /><br /><b>Fidelity Magellan Fund (FMAGX) </b>– A famous actively managed mutual fund which possess the following strategy, “The Fund seeks capital appreciation. Fidelity Management & Research may buy "growth" stocks or "value" stocks or a combination of both. They rely on fundamental analysis of each issuer and its potential for success in light of its current financial condition, its industry position, and economic and market conditions.” <br /><br />Source: <a href="https://www.marketwatch.com/investing/fund/fmagx">https://www.marketwatch.com/investing/fund/fmagx</a> <br /><br /><b>Vanguard Total Stock Market Index Fund (VTSAX) </b>– Description provided, “Created in 1992, Vanguard Total Stock Market Index Fund is designed to provide investors with exposure to the entire U.S. equity market, including small-, mid-, and large-cap growth and value stocks. The fund’s key attributes are its low costs, broad diversification, and the potential for tax efficiency. Investors looking for a low-cost way to gain broad exposure to the U.S. stock market who are willing to accept the volatility that comes with stock market investing may wish to consider this fund as either a core equity holding or your only domestic stock fund.” <br /><br />Source: <a href="https://investor.vanguard.com/investment-products/mutual-funds/profile/vtsax">https://investor.vanguard.com/investment-products/mutual-funds/profile/vtsax</a> <br /><br /><b>### And Introducing a New Challenger! ### </b><br /><br /><b>Columbia Seligman Technology & Info Fd;A (SLMCX) </b>– “The Fund seeks to provide shareholders with capital gain. The Fund will invest at least 80% of its net assets in securities of companies operating in the communications, information and related industries, including companies operating in the information technology and telecommunications sectors.” <br /><br />Source: <a href="https://www.marketwatch.com/investing/fund/slmcx">https://www.marketwatch.com/investing/fund/slmcx</a> <br /><br /><b><u>Why (SLMCX) and not the previous competitive benchmark (AIEQ)? </u></b><br /><br />As the <b>AI Powered Equity ETF (AIEQ)</b> did not exist in the 1990’s, I decided to select a fund which is managed by a company which once employed Charles Kadlec. Who is Charles Kadlec? He is an author and financier, having written the book: <u>Dow 100,000: Fact or Fiction</u>.<div><br /></div><div class="separator" style="clear: both; text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6VOKNcVIOe8HbcfkXLInh7tVOpRRTQ5xJ1XoqdkAMgay9aw9Bao_PxAfQNtQfmwDN3od_Huy2SFvbeaV2Zi2h21qC4XWNquGCktMWIMlekbMV5l8aFm4qmuGYwPCJpOc5bPtT9lL8BNSlk70jIg56SsU8pjBNxJu60P5JoAM403OGESgj2yyiwUlT/s631/Random_Fund_A_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="631" height="228" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6VOKNcVIOe8HbcfkXLInh7tVOpRRTQ5xJ1XoqdkAMgay9aw9Bao_PxAfQNtQfmwDN3od_Huy2SFvbeaV2Zi2h21qC4XWNquGCktMWIMlekbMV5l8aFm4qmuGYwPCJpOc5bPtT9lL8BNSlk70jIg56SsU8pjBNxJu60P5JoAM403OGESgj2yyiwUlT/s320/Random_Fund_A_2.png" width="320" /></a></div><br /><u>Dow 100,000: Fact or Fiction</u> was released in September 1999, having likely been composed throughout 1998. So, in the spirit of the technical optimism which existed from that era, and which still exists within the hearts of AIEQ investors, I choose the Columbia Seligman Technology and Information Fund Class A (SLMCX) as our final benchmark. </div><br />First, let’s consider our returns over the span of a five year period:<br /><br /><div><p style="font-family: Helvetica; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 11px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOdxJWDuHpDJBdpxWc0Z989PANQpo_AcfMjZDHR2Q6S3fV3BdyyES88GQg1_QGfarrUkPSVSgRFYKrMnKcNYxgUcESGhTws_5TnfQT5TiCvFnRHbQwy8rimTu3NzCmVT1FzlGsyoFXXGGVDI1hJoW_C-y_vsCTNNJifu_U800StlerYdcF915ll3gT/s480/Random_Fund_A_3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="288" data-original-width="480" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOdxJWDuHpDJBdpxWc0Z989PANQpo_AcfMjZDHR2Q6S3fV3BdyyES88GQg1_QGfarrUkPSVSgRFYKrMnKcNYxgUcESGhTws_5TnfQT5TiCvFnRHbQwy8rimTu3NzCmVT1FzlGsyoFXXGGVDI1hJoW_C-y_vsCTNNJifu_U800StlerYdcF915ll3gT/w400-h239/Random_Fund_A_3.png" width="400" /></a></p></div><div><br /></div><div>As was mentioned within the prior article, we should typically expect actively managed funds to hold together more consistently throughout market downturns. In this case, the downturn was the collapse of the tech-bubble. </div><div><br /></div><div>As we stretch our assessment period out to the present, our returns are as follows:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT4DwvDvP6kKddpfIIjUdyZx8pMJ_wSpkCozZT3lrB2ivts55BW5ePc0hFLcxDGgNhyoKsMbmewpDi88YSwre6V95Gqdni7XwpLLbQcljvp4h8z_vDnF3wH_VeY3CdBYTkwQ2kEyYNHlaGi-JWLzSYLQ_Pwcx_lvxC5JLU5bppg0UDY7PafElkEJTP6jc/s481/Random_Fund_A_4.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="288" data-original-width="481" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT4DwvDvP6kKddpfIIjUdyZx8pMJ_wSpkCozZT3lrB2ivts55BW5ePc0hFLcxDGgNhyoKsMbmewpDi88YSwre6V95Gqdni7XwpLLbQcljvp4h8z_vDnF3wH_VeY3CdBYTkwQ2kEyYNHlaGi-JWLzSYLQ_Pwcx_lvxC5JLU5bppg0UDY7PafElkEJTP6jc/w400-h239/Random_Fund_A_4.png" width="400" /></a><br /><br />Over a larger period of time, our poor Random Fund got clobbered. The total stock market index fund outperformed the actively managed fund – Magellan. However, in this instance, the (managed) tech sector heavy SLMCX fund destroyed all competitors. <br /><br />Let’s take a look at the top holdings of the well performing SLMCX:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk0WFpSTokWwMWIQMltg1iR-wyrph5h-jE9_kSKVwbh8AeeqW6mCLVkwa_5ZYu4Iaz7gv0Uu3knPjl53LpsQasCqmxVf5KZ72zrMMB7EDkwQLEEIUyTFaVClELp88dasmDIAHyhS2XkY_75mEShjF9flUxk05BkWW_5uUeL6Zqx7SCe5lT8FSGQflQ/s561/Random_Fund_A_6.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="448" data-original-width="561" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk0WFpSTokWwMWIQMltg1iR-wyrph5h-jE9_kSKVwbh8AeeqW6mCLVkwa_5ZYu4Iaz7gv0Uu3knPjl53LpsQasCqmxVf5KZ72zrMMB7EDkwQLEEIUyTFaVClELp88dasmDIAHyhS2XkY_75mEShjF9flUxk05BkWW_5uUeL6Zqx7SCe5lT8FSGQflQ/w400-h320/Random_Fund_A_6.png" width="400" /></a><br /><br />This mix heavily resembles the top holdings of Fidelity’s NASDAQ Composite Index Fund:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6XT-AYNw5OWHAWFk0lbDk8JdHgio91wdf3WktuDCqwZMydV3JEFc719hE78adG-6YjaaAH6c0p51fJxjfB0ofEU2dyrnk0UTnrFPJszi7iYwrbKVGIWnLlzLv3lvPWfzB7xtzh3Re92kb_jVlDGn67lCS7uJIKxVv_O9TDjO-wGFlC6qkXU74bTsR/s563/Random_Fund_A_7.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="447" data-original-width="563" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6XT-AYNw5OWHAWFk0lbDk8JdHgio91wdf3WktuDCqwZMydV3JEFc719hE78adG-6YjaaAH6c0p51fJxjfB0ofEU2dyrnk0UTnrFPJszi7iYwrbKVGIWnLlzLv3lvPWfzB7xtzh3Re92kb_jVlDGn67lCS7uJIKxVv_O9TDjO-wGFlC6qkXU74bTsR/w400-h318/Random_Fund_A_7.png" width="400" /></a><br /><br />However, SLMCX is a bit heavier in chip sector allocation.<br /><br />Though the gains of SLMCX are impressive in comparison, it would appear that SLMCX's composition is mostly derived from the NASDAQ composite. When compared against the broad US index, SLMCX is a stud. When compared against the NASDAQ composite over a 5-year period, SLMCX underperforms.<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3HLrLQ2sildDV5ufrikGHQ7t8zzkwo6SVHx71F5ic15-8avrx2LKI02wUySo2l33oHOPOd4-22vaUywQqmYWhgcWJv5fQ3aBwHioddwkHZTMwEmvrXGZ1AxmeMYx2Jaj7Fxw5et2gq5iXmZuUkBevSsARrW5EE7l4BwN2k95ZTGVZpT0Ytwrm4E-2/s702/Random_Fund_A_8.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="552" data-original-width="702" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3HLrLQ2sildDV5ufrikGHQ7t8zzkwo6SVHx71F5ic15-8avrx2LKI02wUySo2l33oHOPOd4-22vaUywQqmYWhgcWJv5fQ3aBwHioddwkHZTMwEmvrXGZ1AxmeMYx2Jaj7Fxw5et2gq5iXmZuUkBevSsARrW5EE7l4BwN2k95ZTGVZpT0Ytwrm4E-2/w400-h314/Random_Fund_A_8.png" width="400" /></a><br /><br />Why?<br /><br />As it pertains to the SLMCX and the NASDAQ both trouncing their competition, it might have something to do with the composition of US markets.<br /><br />For the past few decades, the US technology sector has benefitted from both a lax regulatory environment, and significant taxpayer investment. National Defense spending allocations often find their way into tech sector contracts. Research grant money and public infrastructure investments also helps drive rapid growth. <br /><br />Random Fund melted on the pavement in comparison to both its index and managed counterparts, most likely due to its holdings not being predominately comprised of composite components.<br /><br />Managed funds are forced to dink and dunk, spending more time out of the market than their index counterparts. These transactions cause taxes and other costs to be assessed. There is also the management fee. Composites themselves are almost a self-fulfilling prophecy. Stocks included within a composite appreciate in value when funds which benchmark the composite are purchased. The stocks themselves also can be purchased individually, which further drives mutual appreciation.</div><div><br /></div><div>So why do investors even consider actively managed funds / hedge funds as investment options? </div><div><br /></div><div>1. Diversification. Typically, wealthier individuals prefer to very broadly diversify their capital allocation. To assist in achieving aspects of an individualized allocation strategy, managed funds offer the opportunity to seek out alternative or exotic investments. </div><div><br /></div><div>2. Liquidity. As most financial crises, both individual and societal, are caused by liquidity contractions, investors with excess capital who are savvy, may want to maintain both excess liquidity and market exposure at the cost of potential sub-alpha returns. This offset is accepted for the sake of liquidity access in times of financial uncertainty. </div><div><br /></div><div>3. Financial Planning. Financial advisors can assist in personal financial guidance and estate planning. In some instances, the advisor relationship entails that the advisee be invested in financial products provided by the advisor's firm. These funds, while sometimes bearing a premium, may assist the client in maintaining the above listed attributes. This is achieved while also providing the client with a personalized financial plan.</div><div><br /></div><div>That is it for this article.</div><div><br /></div><div>I'll see you next week!</div><div><br /></div>-RD<br /><br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-18469384418472330712023-06-26T19:11:00.000-04:002023-06-27T10:02:26.564-04:00 Money HustleFollowing up on prior entries related to the stock market and market timing strategies, I’d like to present the following article for review: <br /><br /><a href="https://prosperitythinkers.com/personal-finance/three-monkeys-and-cat-pick-stocks/">https://prosperitythinkers.com/personal-finance/three-monkeys-and-cat-pick-stocks/</a> <br /><br />This article will be the topic for today’s post. <br /><br />Returning again to the wisdom of Burton Malkiel, who was the subject of our last article, we find this sage of finance being quoted as stating, <br /><br /><b><i> “A blindfolded monkey throwing darts at a newspaper’s financial pages could select a portfolio that would do just as well as one carefully selected by experts.” </i></b><br /><br />It would seem that some researchers took this quote a bit too literally as the article suggests, and decided to perform variations of the above described experiment. <br /><br />The article discusses these contests of stock picking in detail. As apparently, in different instances, a cat, school children, and dart throwers, were employed to discern differing investment portfolios. In each instance described, the seemingly random selection of equities, regardless of the employed methodologies, outperformed both actively managed funds, and broad indexes. <br /><br />I decided to perform my own trial research as it relates to this sort investment approach.<div><br /><b>#### WARNING – INVESTMENT OF ANY SORT BEARS RISK! THIS ARTICLE IS NOT FINANCIAL ADVICE! DO NOT REPLICATE THIS EXPERIMENT AND EXPECT TO MAKE MONEY! #### </b><br /><br />First, I chose a few financial benchmarks. <br /><br /><b>AI Powered Equity ETF (AIEQ)</b> – An ETF which is described as, “AIEQ uses artificial intelligence to analyze and identify US stocks believed to have the highest probability of capital appreciation over the next 12 months, while exhibiting volatility similar to the overall US market. The fund selects 30 to 125 constituents and has no restrictions on the market cap of its included securities. The model suggests weights based on capital appreciation potential and correlation to other included companies, subject to a 10% cap per holding. It is worth noting that while AIEQ relies heavily on its quantitative model, the fund is actively-managed, and follows no index.” <br /><br />Source: <a href="https://www.etf.com/AIEQ">https://www.etf.com/AIEQ</a> <br /><br /><b>Fidelity Magellan Fund (FMAGX)</b> – A famous actively managed mutual fund which possess the following strategy, “The Fund seeks capital appreciation. Fidelity Management & Research may buy "growth" stocks or "value" stocks or a combination of both. They rely on fundamental analysis of each issuer and its potential for success in light of its current financial condition, its industry position, and economic and market conditions.” <br /><br />Source: <a href="https://www.marketwatch.com/investing/fund/fmagx">https://www.marketwatch.com/investing/fund/fmagx</a> <br /><br /><b>Vanguard Total Stock Market Index Fund (VTSAX) </b>– Description provided, “Created in 1992, Vanguard Total Stock Market Index Fund is designed to provide investors with exposure to the entire U.S. equity market, including small-, mid-, and large-cap growth and value stocks. The fund’s key attributes are its low costs, broad diversification, and the potential for tax efficiency. Investors looking for a low-cost way to gain broad exposure to the U.S. stock market who are willing to accept the volatility that comes with stock market investing may wish to consider this fund as either a core equity holding or your only domestic stock fund.” <br /><br />Source: <a href="https://investor.vanguard.com/investment-products/mutual-funds/profile/vtsax">https://investor.vanguard.com/investment-products/mutual-funds/profile/vtsax</a> <br /><br />With these benchmarks defined, I set off too to create my own randomly established equity fund. To fairly decide which equities my fund would hold, I utilized the two websites: <br /><br /><a href="https://www.rayberger.org/random-stock-picker/">https://www.rayberger.org/random-stock-picker/</a> </div><div><br /></div><div><b>~AND~</b><br /><br /><a href="https://www.randstock.ca/selector">https://www.randstock.ca/selector</a> <br /><br />The only equity selections which I outright rejected from inclusion were fixed income ETFs, and ETFs which sought to replicate the performance of a commodity. <br /><br />All funds would receive an equal allocations of capital ($ 1000), and the initial issue price of my Random Fund would be set at $ 10.00 a share. <br /><br /><b>Oshkosh Corporation (OSK)</b> - Oshkosh Corporation designs, manufacture, and markets specialty trucks and access equipment vehicles worldwide. Sector(s): Industrials <br /><br /><b>Franklin FTSE Brazil ETF (FLBR) </b>- The FTSE Brazil Capped Index is based on the FTSE Brazil Index and is designed to measure the performance of Brazilian large- and mid-capitalization stocks. Sector(s): ETF <br /><br /><b>Public Storage (PSA)</b> - Public Storage, a member of the S&P 500 and FT Global 500, is a REIT that primarily acquires, develops, owns, and operates self-storage facilities. Sector(s) - Real Estate <br /><br /><b>Dorian LPG Ltd. (LPG)</b> - Dorian LPG Ltd., together with its subsidiaries, engages in the transportation of liquefied petroleum gas (LPG) through its LPG tankers worldwide. The company owns and operates very large gas carriers (VLGCs). Sector(s) - Energy <br /><br /><b>Delta Air Lines, Inc. (DAL) </b>- Delta Air Lines, Inc. provides scheduled air transportation for passengers and cargo in the United States and internationally. Sector(s) – Industrials <br /><br /><b>Grupo Industrial Saltillo, S.A.B. de C.V. (SALT) (MX) </b>- Grupo Industrial Saltillo, S.A.B. de C.V. engages in the design, manufacture, wholesale, and marketing of products for automotive, construction, and houseware industries in Mexico, Europe, and Asia. Sector(s) - Consumer Cyclical <br /><br /><b>Equity LifeStyle Properties, Inc. (ELS)</b> - We are a self-administered, self-managed real estate investment trust (REIT) with headquarters in Chicago. Sector(s) – Real Estate <br /><br /><b>SPDR Portfolio S&P 500 ETF (SPLG) </b>- Under normal market conditions, the fund generally invests substantially all, but at least 80%, of its total assets in the securities comprising the index. Sector(s) – ETF <br /><b><br />DHT Holdings, Inc. (DHT)</b> - DHT Holdings, Inc., through its subsidiaries, owns and operates crude oil tankers primarily in Monaco, Singapore, and Norway. As of March 16, 2023, it had a fleet of 23 very large crude carriers. Sector(s) - Energy <br /><br /><b>Norfolk Southern Corporation (NSC) </b>- Norfolk Southern Corporation, together with its subsidiaries, engages in the rail transportation of raw materials, intermediate products, and finished goods in the United States. Sector(s) – Industrials <br /><br />With a purchase date of each equity being set at 5/1/2019 (closing price), our fictitious Random Fund performed as shown below:<br /><br /></div><div style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqwQRi5iiK9Z0N8H2vJ5neysJt4nhnRkH8RgNajF7EnBn6nITs1LdNQrhZf5sIhBtL2-E7k3i6dvm7JU-hoSntAOrmk3mKQVyB4mtc2iqiyjIRaabeqx5A-J9dOYKQSWzk8JlfS8bn3GRnmjnQkqZKz9niDdUxLBC5f5dntzhKWsUOuIRcWRHNywim/s1174/Random_Fund_0.png" style="clear: left; display: inline; margin-bottom: 1em; margin-left: 1em; text-align: center;"><img border="0" data-original-height="364" data-original-width="1174" height="122" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqwQRi5iiK9Z0N8H2vJ5neysJt4nhnRkH8RgNajF7EnBn6nITs1LdNQrhZf5sIhBtL2-E7k3i6dvm7JU-hoSntAOrmk3mKQVyB4mtc2iqiyjIRaabeqx5A-J9dOYKQSWzk8JlfS8bn3GRnmjnQkqZKz9niDdUxLBC5f5dntzhKWsUOuIRcWRHNywim/w400-h122/Random_Fund_0.png" width="400" /></a></div><div style="text-align: left;"> </div><div>Now let’s compare the fund’s performance against our previously decided upon benchmarks: <br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgj39_gEGZKXVftaMPzlhnN1qBXatEO0Hw063Lw0rxRwTwrJ314sn117UIIVaPmGny191AzTn0_yIearz7wGAJKyHfeEEkC-HH_tXB-X7OveYGvVPWrGK5v1wVwV83Xt_O1MYPcV9yLGwPEAGRyDT-BIWlCRlFFmHYuVjJLh3yKmj-2eYyKQ9hikTYd/s460/Random_Fund_1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="150" data-original-width="460" height="129" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgj39_gEGZKXVftaMPzlhnN1qBXatEO0Hw063Lw0rxRwTwrJ314sn117UIIVaPmGny191AzTn0_yIearz7wGAJKyHfeEEkC-HH_tXB-X7OveYGvVPWrGK5v1wVwV83Xt_O1MYPcV9yLGwPEAGRyDT-BIWlCRlFFmHYuVjJLh3yKmj-2eYyKQ9hikTYd/w400-h129/Random_Fund_1.png" width="400" /></a><br /><br />Graphing the performance of each fund across multiple years: <br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha-DQr0syKMkvu42Apykhneu0gLQ8zXA-95bKiaScrbrB7LHdaTl8Y7Nc4u5_ju-WQWQr8cZxOxkz6Vj0pLk4o5120UssaHBuC9ekrx5j8l0y_2PgZ0ufUCDLeI0rgDOQdYwqLIBV8te1Zyv8QqN_WtEboZs7c7IJKbk49HdooXtQzA7pci-myOfuk/s480/Random_Fund_2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="288" data-original-width="480" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha-DQr0syKMkvu42Apykhneu0gLQ8zXA-95bKiaScrbrB7LHdaTl8Y7Nc4u5_ju-WQWQr8cZxOxkz6Vj0pLk4o5120UssaHBuC9ekrx5j8l0y_2PgZ0ufUCDLeI0rgDOQdYwqLIBV8te1Zyv8QqN_WtEboZs7c7IJKbk49HdooXtQzA7pci-myOfuk/w400-h239/Random_Fund_2.png" width="400" /></a><br /><br />Or looking at returns over the span of a five year period: <br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKliPnOxLxBW_PnVoq0eyV6Qi4a4vmwTFjCVVrUjnm6jE09urq1OTJbw69trFTtDlSiIDvmxh52JmBZ3g0ma9YqwLMbycgUQQgozQqxeuW1q_4WggFdMG0ciMPaUPJtbAYtl01HiH5lRxh6EmJs3gYrmWjNpSqbAArLyJ3_gvS8RgZ-b-yoM9es1vZ/s480/Random_Fund_3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="288" data-original-width="480" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKliPnOxLxBW_PnVoq0eyV6Qi4a4vmwTFjCVVrUjnm6jE09urq1OTJbw69trFTtDlSiIDvmxh52JmBZ3g0ma9YqwLMbycgUQQgozQqxeuW1q_4WggFdMG0ciMPaUPJtbAYtl01HiH5lRxh6EmJs3gYrmWjNpSqbAArLyJ3_gvS8RgZ-b-yoM9es1vZ/w400-h239/Random_Fund_3.png" width="400" /></a><br /><br />Strangely enough, our Random Fund outperforms the advisor managed fund (Magellan Fund), the broad based index fund (VTSAX), and the AI managed fund (AIEQ). While actively managed funds typically underperform index benchmarks for a multitude of reasons, I found it incredibly odd that my e-monkey Random Fund outperformed even the index itself. This is also what happened to be the case in the similarly conducted experiments mentioned within the article. <br /><br />Let’s think of a few informed reasons as to why this might be: <br /><br />1. Actively managed funds often turn over equities with far greater frequency when compared to their index counterparts. Index funds typically owe the majority of their equity turnover to modifications made to the underlying index. This causes both tax implications, and a lack of dividend opportunities. <br /><br />2. Our Random Fund contains less components, and is less balanced than its competitors. Therefore, we would expect the fund to be more generally impacted by variance. In bull markets, I would expect such a selection of equities to outperform an underlying index. However, in bear markets, the inverse should hold true. In that, our Random Fund would likely lose more value than its contemporaries. <br /><br />3. The time frame which is being utilized to assess performance is both short in duration and directionally positive for equities. <br /><br />4. Actively managed funds attempt to protect investors from downside, which also limits the upside potential for returns. <br /><br />5. As it relates to #5, actively managed funds must also keep greater amounts of cash and cash equivalents on hand. This equates to time out of the market. <br /><br />6. Actively managed funds have higher management fees, which are utilized to compensate fund managers. <br /><br />7. Index components are themselves, popular equities. This means that incoming investment money either chases the price of individual equites upward through individual stock purchase, or through periodic fund purchases. <br /><br />8. Randomly selecting equities is far less biased than making an “informed selection”. Such bias often manifests in vastly over-estimating ones abilities, and under-assessing the abilities of others. Such an over estimation may cause an individual manager to go overweight in a particular sector or individual equity, and in doing such, miss opportunities elsewhere. As indexes are broad, there is always the exposure to opportunity within a bull market. This combined with the other previously numbered aspects, likely partially explains the underperformance of actively managed funds. <br /><br />9. As it relates to point #9. In a bull market, over a short time span, randomly selecting a small number of equities may be the best method to employing a strategy which beats alpha. As bias will be eliminated, and beta will be increased. <br /><br />10. Active fund managers may not possess the fiduciary ability to allow their investment strategies to materialize. As the annual alpha is always the metric which high net worth clients measure all performance against, an informed, but otherwise risk incurring position must deliver results in the intermediate term, or risk liquidation at a loss in both capital and opportunity. </div><div><br /></div><div>That's all for today.</div><div class="separator" style="clear: both; text-align: left;"></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Until next time, stay curious data heads!</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-74119478802013438342023-06-17T15:03:00.005-04:002023-06-17T18:58:54.753-04:00(R) Is Wall Street Random? <div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN3rSmlajpih8cINejhI45vpG6uOd7inOGTcq9qQl7AzACf5yi1Onu81IyAs0LDU9mGWsuEP-Xj3uM_9-ffOIN5eqNyN54lRJ3uf2L0YLua4hzeKulYsrf0tSL8DRwwYbCAnZKngVWqlaVTMxaXUEJCuVUAa1Rp9_-oBHRtnxHb3artuiEpp5X--KZ/s954/Fuller_vs_Perron.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="328" data-original-width="954" height="138" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN3rSmlajpih8cINejhI45vpG6uOd7inOGTcq9qQl7AzACf5yi1Onu81IyAs0LDU9mGWsuEP-Xj3uM_9-ffOIN5eqNyN54lRJ3uf2L0YLua4hzeKulYsrf0tSL8DRwwYbCAnZKngVWqlaVTMxaXUEJCuVUAa1Rp9_-oBHRtnxHb3artuiEpp5X--KZ/w400-h138/Fuller_vs_Perron.png" width="400" /></a></div><div><br /></div>Anyone who has spent any serious time within the investment world, has probably at some point, encountered the book, <u>A Random Walk Down Wall Street</u>. While the book does contain some interesting historical anecdotes, and disproves numerous methods of stock picking quackery, the title itself refers to the following theory: <br /><br /><i>Burton G. Malkiel, an economics professor at Princeton University and writer of <u>A Random Walk Down Wall Street</u>, performed a test where his students were given a hypothetical stock that was initially worth fifty dollars. The closing stock price for each day was determined by a coin flip. If the result was heads, the price would close a half point higher, but if the result was tails, it would close a half point lower. Thus, each time, the price had a fifty-fifty chance of closing higher or lower than the previous day. Cycles or trends were determined from the tests. Malkiel then took the results in chart and graph form to a chartist, a person who "seeks to predict future movements by seeking to interpret past patterns on the assumption that 'history tends to repeat itself'." The chartist told Malkiel that they needed to immediately buy the stock. Since the coin flips were random, the fictitious stock had no overall trend. Malkiel argued that this indicates that the market and stocks could be just as random as flipping a coin. </i><br /><br />Source: <a href="https://en.wikipedia.org/wiki/Random_walk_hypothesis">https://en.wikipedia.org/wiki/Random_walk_hypothesis</a> <br /><br />It would seem that within the field of contemporary finance, that some critics believe that Malkiel's book is prefaced upon a flawed theory. In this article, we will perform our own analysis in order to determine which side is correct in their assumption. This isn’t to in any way to provide further evidence to either side as it pertains to the ages old: Managed Fund vs. Index Fund argument, but instead, to utilize the evidence which we have as it pertains to this purposed theory, and see if it withstands a thorough statistical assessment.<br /><div><br /><b><u>Random Walking </u></b><br /><br />To begin our foray into proving / disproving, “The Random Walk Hypothesis”, let’s take a random walk through R-Studio. <br /><br />First, we’ll set a number of sample observations (n = 101). Then, we’ll perform the same experiment that Malkiel performed with his students. We’ll do this by randomly generating two numbers (-1, 1), with 1 equating a step upward, and -1 equating a step downwards. <br /><br />However, we’ll make a few slight modifications to our experiment. As prices for an equity index cannot (theoretically) go negative, or in most cases, reach zero only to rebound, I’ve added a few caveats to our simulation. <br /><br />We will be creating a random walk. However, in every instance in which our random walk would take us below a zero threshold, an absolute value of the outcome will instead be returned. For example, in the case of a typical random walk, the values: (1,-1,1,-1,-1,-1), provide the corresponding cumulative elements of: (1,0,1,0,-1,-2), as each element is being summed against the previous sum of the prior elements. <br /><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br /><b><u>Example</u>:</b><br /><br />Random Walk Generated Values = {1,-1,1,-1,-1,-1} <br /><br />0 + <b>1</b> = 1 <br /><br />1 – <b>1</b> = 0 <br /><br />0 + <b>1</b> = 1 <br /><br />1 – <b>1</b> = 0 <br /><br />0 – 1 = -1 <br /><br />-1 – <b>1</b> = -2 <br /><br />Thus, our cumulative elements are: <br /><br />0 + 1 = <b>1 </b><br /><br />1 – 1 = <b>0 </b><br /><br />0 + 1 =<b> 1</b> <br /><br />1 – 1 = <b>0 </b><br /><br />0 – 1 = <b>-1 </b><br /><br />-1 – 1 = <b>-2 </b><br /><br />In our variation of the simulation, instead of returning negative values, we will only return the absolute values of the cumulative elements. <br /><br /><b><u>Example</u></b>:<br /><br />|1| = <b>1 </b><br /><br />|0| = <b>0 </b><br /><br />|1| = <b>1</b> <br /><br />|0| = <b>0 </b><br /><br />|-1| = <b>1 </b><br /><br />|-2| = <b>2 </b></p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br />Another modification that we will be making, is that every element of 0 will be changed to the value of 1. </p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br /><b><u>Example</u></b>: <br /><br />1 = <b>1 </b><br /><br />0 = <b>1 <br /></b><br />1 = <b>1 </b><br /><br />0 = <b>1 </b><br /><br />1 = <b>1 </b><br /><br />2 = <b>2 <br /></b><br /><b># Set the random seed to 7 so that this example’s outcome can be reproduced # <br /><br />set.seed(7) <br /><br /># 101 random elements are needed for our example # <br /><br />n <- 101 <br /><br /># Create a random walk with caveats (absolute values returned for negative numbers) # <br /><br />Random_Walk <- abs(cumsum(sample(c(-1, 1), n, TRUE))) <br /><br /># Further modify the random walk values (values of 0 will be modified to 1) # <br /><br />Random_Walk <- ifelse(Random_Walk == 0, 1, Random_Walk) <br /><br /># We’ll be attempting to simulate Dow Index returns from 1923 – 2023 # <br /><br />year <- 1923:2023 <br /><br /># Graph the Random_Walk simulation # <br /><br />Random_Walk_Plot <- data.frame(year, Random_Walk) <br /><br />plot(Random_Walk_Plot, type = "l", xlim = c(1923, 2023), ylim = c(0, 30), <br /><br /> <span> </span>col = "blue", xlab = "n", ylab = "Rw")</b></p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><b><br /></b></p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;">This produces the following graphic:</p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br /></p><div class="separator" style="clear: both; font-weight: bold; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz5fukpkGQCLAPwDv2LlM7M860--hRO6kDRws6knu0oIZj4YIesuNz8CCBM3sIIsl5hOHOSTFB-QFeZsR2hx-52LOqJbO568fTmhZRkgwFZKajjlGlmEeoCUckT7e6gzpbglZvh5K5C0-ebA8zm1AawMywiQ0-JVjjy-YaZcwBmzy22N0oK4tx24SR/s636/Random_Rplot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="636" data-original-width="540" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz5fukpkGQCLAPwDv2LlM7M860--hRO6kDRws6knu0oIZj4YIesuNz8CCBM3sIIsl5hOHOSTFB-QFeZsR2hx-52LOqJbO568fTmhZRkgwFZKajjlGlmEeoCUckT7e6gzpbglZvh5K5C0-ebA8zm1AawMywiQ0-JVjjy-YaZcwBmzy22N0oK4tx24SR/w340-h400/Random_Rplot.png" width="340" /></a></div><div class="separator" style="clear: both; font-weight: bold; text-align: center;"><br /></div>Which in some ways resembles:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlKBMWuknbwvObcqv9QBwEy44M8kC8E-7_gGx2ARv_fO0CcTMxGwC0Y1ZqB3snQWO58XOFT4veJOaDwzswMdgtUESNlpaeQyZUzLWnIL1SgaBJdsZ93RBjQZC8lucIV9YuG2AlGbzaUVvAiin4yEH1py0AfxugckyUNw4bogZTguIz4Ldj3lPkRuZG/s853/Dow_Jones_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="544" data-original-width="853" height="254" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlKBMWuknbwvObcqv9QBwEy44M8kC8E-7_gGx2ARv_fO0CcTMxGwC0Y1ZqB3snQWO58XOFT4veJOaDwzswMdgtUESNlpaeQyZUzLWnIL1SgaBJdsZ93RBjQZC8lucIV9YuG2AlGbzaUVvAiin4yEH1py0AfxugckyUNw4bogZTguIz4Ldj3lPkRuZG/w400-h254/Dow_Jones_0.png" width="400" /></a></div><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br /></p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;">Let’s overlay a resized version of the random walk graphic against the Dow Jones Industrial Average’s annualized returns:<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYG4WKKKyDjdPjeR_wv_GzZVnCc7Gt0H-uIAZO5e9dYfVwhj11yd0Xl-HsG-i73TBm7jnzewAxgzohA-gyl4CrVA0T2zYmztfbR8qjiv4fBqd4lMYvZa76Zb3IlNfv0mfVDElcjF91f0zG8D_cBBfb61Mm4vqAvLugX4XWlKJAhiKO4jk8BxwB9vhW/s804/Dow_Jones_Overlay.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="517" data-original-width="804" height="258" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYG4WKKKyDjdPjeR_wv_GzZVnCc7Gt0H-uIAZO5e9dYfVwhj11yd0Xl-HsG-i73TBm7jnzewAxgzohA-gyl4CrVA0T2zYmztfbR8qjiv4fBqd4lMYvZa76Zb3IlNfv0mfVDElcjF91f0zG8D_cBBfb61Mm4vqAvLugX4XWlKJAhiKO4jk8BxwB9vhW/w400-h258/Dow_Jones_Overlay.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;">Obviously, this proves nothing. It only demonstrates that by selectively choosing a randomly generated pattern, that one can draw aesthetic comparisons to an existing pattern. <br /><br />Instead of taking the Random Walk Hypothesis at face value, or postulating ill-informed criticisms, let’s attack this hypothesis like good data scientists. First, we’ll forget about amateurish comparative assessment, and run a few tests in order to make a well researched conclusion. <br /><br />To test as to whether or not the market is random, we’ll need real world (US) market data. Of the available indexes, I decided to choose the Dow Jones Industrial Average. The reasons supporting this decision are as follows: </p><p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px 0px 8px;"><br />1. The S&P 500 Stock Composite Index was not created until March of 1957. Therefore, not as many data points are available as compared to the Dow Jones Industrial Average. <br /><br />2. The NASDAQ Composite Index was not created until February of 1971. It also lacks the broad market exposure which is found within the Dow Jones Industrial Average. <br /><br />The data gathered to perform the following analysis, which also provided the Dow Jones Annual Average graphic above, originated from the source below.<br /><br />Source: <a href="https://www.macrotrends.net/1319/dow-jones-100-year-historical-chart">https://www.macrotrends.net/1319/dow-jones-100-year-historical-chart</a></p><div><br /><b># Dow Jones Industrial Average Years Vector #<br /><br />Dow_Year <- c(2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950, 1949, 1948, 1947, 1946, 1945, 1944, 1943, 1942, 1941, 1940, 1939, 1938, 1937, 1936, 1935, 1934, 1933, 1932, 1931, 1930, 1929, 1928, 1927, 1926, 1925, 1924, 1923)<br /></b><br /><b># Dow Jones Industrial Average Annual Closing Price Vector #</b><br /><br /><b>Dow_Close <- c(33301.87, 33147.25, 36338.3, 30606.48, 28538.44, 23327.46, 24719.22, 19762.6, 17425.03, 17823.07, 16576.66, 13104.14, 12217.56, 11577.51, 10428.05, 8776.39, 13264.82, 12463.15, 10717.5, 10783.01, 10453.92, 8341.63, 10021.57, 10787.99, 11497.12, 9181.43, 7908.3, 6448.27, 5117.12, 3834.44, 3754.09, 3301.11, 3168.83, 2633.66, 2753.2, 2168.57, 1938.83, 1895.95, 1546.67,<br />1211.57, 1258.64, 1046.54, 875, 963.99, 838.74, 805.01, 831.17, 1004.65, 852.41, 616.24, 850.86, 1020.02, 890.2, 838.92, 800.36, 943.75, 905.11, 785.69, 969.26, 874.13, 762.95, 652.1, 731.14, 615.89,<br />679.36, 583.65, 435.69, 499.47, 488.4, 404.39, 280.9, 291.9, 269.23, 235.41, 200.13, 177.3, 181.16, 177.2, 192.91, 152.32, 135.89, 119.4, 110.96, 131.13, 150.24, 154.76, 120.85, 179.9, 144.13, 104.04, 99.9, 59.93, 77.9, 164.58, 248.48, 300, 200.7, 157.2, 151.08, 120.51, 95.52)</b><br /><br /><b># Combine both vectors into a singular data frame #<br /><br />Dow_Data_Frame <- data.frame(Dow_Year, Dow_Close)<br /><br /># Preview the data frame #<br /><br />Dow_Data_Frame</b><br /><br />This produces the output:<br /><br /><i>> Dow_Data_Frame<br /> Dow_Year Dow_Close<br />1 2023 33301.87<br />2 2022 33147.25<br />3 2021 36338.30<br />4 2020 30606.48<br />5 2019 28538.44<br />6 2018 23327.46<br />7 2017 24719.22<br />8 2016 19762.60<br /></i><br />Though we aren’t testing for this hypothesis directly through the application of a singular methodology, our hypothesis for this general experiment would resemble something like:<br /><br /><b>H0 (null): The Dow Jones Industrial Average Index’s annual returns are NOT random.</b><br /><br /><b>Ha (alternative): The Dow Jones Industrial Average Index’s annual returns are random.</b><br /><br />The primary test that we will perform is the Phillips-Perron Unit Test. This particular method assess time series data for order of integration. In simplified terms, order of integration is the minimum number of differences required to obtain a covariance-stationary series. In the case of the Phillips-Perron Unit Test, we will be utilizing the underlying order of methodology to assess for random walk potential.<br /><br /><b># The package: ‘tseries’ must be downloaded and enabled in order to utilize the PP.test() function #<br /><br />library(tseries)<br /><br /># Phillips-Perron Unit Root Test - A methodology utilized to test data for random walk potential #<br /><br /># Null - The time series IS integrated of order (not-random) #<br /><br /># Alternative - The time series is NOT integrated of order (random) #<br /><br />PP.test(Dow_Data_Frame$Dow_Close)</b><br /><br />This produces the output:<br /><br /> <i><span> </span>Phillips-Perron Unit Root Test<br /><br />data: Dow_Data_Frame$Dow_Close<br />Dickey-Fuller = -3.6678, Truncation lag parameter = 4, p-value = 0.03055</i><br /><br />The secondary analysis which we will perform on our time series data, is the Dicky-Fuller Unit Root Test. This test assesses data for stationary potential. <br /><br /><b># Dicky-Fuller Unit Root Test - A methodology utilized to test data for stationarity # <br /><br /># Null - Data is NOT stationary #<br /><br /># Alternative - Data IS stationary #<br /><br />adf.test(Dow_Data_Frame$Dow_Close)</b><br /><br /></div><div>This produces the output:<br /><br /><i> <span> </span>Augmented Dickey-Fuller Test<br /><br />data: Dow_Data_Frame$Dow_Close<br />Dickey-Fuller = -5.0152, Lag order = 4, p-value = 0.01<br />alternative hypothesis: stationary<br /><br />Warning message:<br />In adf.test(Dow_Data_Frame$Dow_Close) :<br /> p-value smaller than printed p-value</i><br /><br />Assuming an alpha value of .05, we would reject the null hypothesis in both instances. Thus, we would conclude with 95% confidence, that the annual Dow Jones Industrial Average closing prices are not integrated of order, and are also stationary. The combination of these results would therefore indicate that The Dow Jones Industrial Average returns are random. <br /><br />I believe that the Dicky-Fuller test provides much more interesting insight. As unless a type I error was committed, we would eventually expect to witness either a Seneca Cliff event or parabolic downturn within the market given a long enough time frame. WOULD, EXPECT, EVENTUALLY, and UNLESS being the key terms here. (Don’t time the market or trade on the basis of a singular statistical methodology).<br /><br />Some of you might be wondering why The Wald-Wolfowitz Test was not utilized. As a reminder, this particular method is only applicable to factor series data. However, if we were to inappropriately apply it in this instance, it would resemble the following:<br /><br /><b># THIS IS NOT APPROPRIATE FOR OUR EXAMPLE #<br /><br /># THIS CODE IS FOR DEMONSTRATION PURPOSES ONLY #<br /><br /># The package: ‘trend’ must be downloaded and enabled in order to utilize the ww.test() function #<br /><br />library(trend)<br /><br /># Wald Wolfowitz Test #<br /><br /># Null - Each element in the sequence is (independently) drawn from the same distribution (not-random) #<br /><br /># Alternative - Each element in the sequence is not (independently) drawn from the same distribution (random) #<br /><br />Dow_Close_Factor < as.factor(Dow_Data_Frame$Dow_Close)<br /><br />ww.test(Dow_Close_Factor)</b><br /><br />So that is it for today’s article, Data Heads. It would seem that Malkiel is vindicated, at least as it pertains to the methodologies which we applied within this particular entry. I’ll be back again soon with more data content. <br /><br />-RD</div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-6919489766985958022022-12-13T18:51:00.009-05:002022-12-13T19:02:10.148-05:00(R) Stein’s Paradox / The James-Stein Estimator<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1nooGqz4h0LWGJ0Gz31oPyGyfgb1jy1jY8TvGFKAU-5h-yKGOyAWvVKmkiU9C7vJYdBLXA4qmaRY8wJevhFrnqbQvg4h3WvZClcwBWHvnnaEWhonpxtD6g1OGFmDhNC4fx9VESsWOb0K2z_VPHTZbwpOiJDYubSz_85ATytweMIRwQYqwQ4vc-BTc/s1280/Okabe.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="720" data-original-width="1280" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1nooGqz4h0LWGJ0Gz31oPyGyfgb1jy1jY8TvGFKAU-5h-yKGOyAWvVKmkiU9C7vJYdBLXA4qmaRY8wJevhFrnqbQvg4h3WvZClcwBWHvnnaEWhonpxtD6g1OGFmDhNC4fx9VESsWOb0K2z_VPHTZbwpOiJDYubSz_85ATytweMIRwQYqwQ4vc-BTc/w400-h225/Okabe.jpg" width="400" /></a></div><div><br /></div>Imagine a situation in which you were provided with data samples from numerous independent populations. Now what if I told you, that combining all of the samples into a single equation, is the best methodology for estimating the mean of each population.<br /><br />Hold on. <br /><br />Hold on. <br /><br />Wait. <br /><br />You’re telling me, that combining independently sampled data into a single pool, from independent sources, can provide assumptions as it pertains to the source of each sample? <br /><br />Yes! <br /><br />And this methodology provides a better estimator than other available conventional methods? <br /><br />Yes again. <br /><br />This was the conversation which divided the math world in 1956. <br /><br />Here is an article detailing the phenomenon and findings of Charles Steins from Scientific America (.PDF Warning): <br /><br /><a href="https://efron.ckirby.su.domains//other/Article1977.pdf">https://efron.ckirby.su.domains//other/Article1977.pdf</a> <br /><br />Since we have computers, let’s give the James-Stein’s Estimator a little test-er-roo. In the digital era, we are no longer forced to accept hearsay proofs. <br /><br />(The code below is a heavily modified and simplified version of code which was originally queried from: <a href="https://bookdown.org/content/922/james-stein.html">https://bookdown.org/content/922/james-stein.html</a>)<div><br /><b>################################################################################## <br /><br />### Stein’s Paradox / The James-Stein Estimator ### <br /><br />## We begin by creating 5 independent samples generated from normally distributed data sources ## <br /><br />## Each sample is comprised of random numbers ## <br /><br /># 100 Random Numbers, Mean = 500, Standard Deviation = 155 # <br /><br />Ran_A <- rnorm(100, mean=500, sd=155) <br /><br /># 100 Random Numbers, Mean = 50, Standard Deviation = 22 # <br /><br />Ran_B <- rnorm(100, mean=50, sd= 22) <br /><br /># 100 Random Numbers, Mean = 1, Standard Deviation = 2 # <br /><br />Ran_C <- rnorm(100, mean=1, sd = 2) <br /><br /># 100 Random Numbers, Mean = 1000, Standard Deviation = 400 # <br /><br />Ran_D <- rnorm(100, mean=1000, sd=400) <br /><br /># I went ahead and sampled a few of the elements from each series which were generated by my system # <br /><br />testA <- c(482.154, 488.831, 687.691, 404.691, 604.8, 639.283, 315.656) <br /><br />testB <- c(53.342841, 63.167245, 47.223326, 44.532218, 53.527203, 40.459877, 83.823073) <br /><br />testC <-c(-1.4257942504, 2.2265732374, -0.6124066829, -1.7529138598, -0.0156957983, -0.6018709735 ) <br /><br />testD <- c(1064.62403, 1372.42996, 976.02130, 1019.49588, 570.84984, 82.81143, 517.11726, 1045.64377) <br /><br /># We now must create a series which contains all of the sample elements # <br /><br />testall <- c(testA, testB, testC, testD) <br /><br /># Then we will take the mean measurement of each sampled series # <br /><br />MLEA <- mean(testA) <br /><br />MLEB <- mean(testB) <br /><br />MLEC <- mean(testC) <br /><br />MLED <- mean(testD) <br /><br /># Next, we will derive the mean of the combined sample elements # <br /><br />p_ <- mean(testall) <br /><br /># We must assign to ‘N’, the number of sets which we are assessing # <br /><br />N <- 4 <br /><br /># We must also derive the median of the combined sample elements # <br /><br />medianden <- median(testall) <br /><br /># Sigma2 = mean(testall) * (1 – (mean(testall)) / medianden # <br /><br />sigma2 <- p_ * (1-p_) / medianden <br /><br /># Now we’re prepared to calculate the assumed population mean of each sample series # <br /><br />c_A <- p_+(1-((N-3)*sigma2/(sum((MLEA-p_)^2))))*(MLEA-p_) <br /><br />c_B <- p_+(1-((N-3)*sigma2/(sum((MLEB-p_)^2))))*(MLEB-p_) <br /><br />c_C <- p_+(1-((N-3)*sigma2/(sum((MLEC-p_)^2))))*(MLEC-p_) <br /><br />c_D <- p_+(1-((N-3)*sigma2/(sum((MLED-p_)^2))))*(MLED-p_) <br /><br />################################################################################## <br /><br /># Predictive Squared Error # <br /><br />PSE1 <- (c_A - 500) ^ 2 + (c_B - 50) ^ 2 + (c_C - 1) ^ 2 + (c_D - 1000) ^ 2 <br /><br />######################## <br /><br /># Predictive Squared Error # <br /><br />PSE2 <- (MLEA- 500) ^ 2 + (MLEB - 50) ^ 2 + (MLEC - 1) ^ 2 + (MLED - 1000) ^ 2 <br /><br />######################## <br /><br />1 - 28521.5 / 28856.74 <br /><br />################################################################################## <br /></b><br /><i>1 - 28521.5 / 28856.74 = 0.01161739</i></div><div><br /></div><div>So, we can conclude, through the utilization of MSE as an accuracy assessment technique, that Stein’s Methodology (AKA The James-Stein Estimator), provided a 1.16% better estimation of the population mean for each series, as compared to the mean of each sample series assessed independently. <br /><br />Charles Stein really was a pioneer in the field of statistics as he discovered one of the first instances of dimension reduction. <br /><br />If we consider our example data sources below: </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIF_Ce4N-2T_MoQ6SN0SfDp4VeSR6Juh7DtA4AG31Uhjo0wbTR0ptMxrwCnRrA2QYxQogIrjeAZODYo1sydbIvI1cmRFhU719NLCxR-30O1gEbx78u_x_KqAv4ZY7-CtWmLCbCfPKUaA8MQO3F1G-UXYxnggR5PoteD4CNJaE9b9kw0OcFHBwtZdVf/s975/Stein_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="435" data-original-width="975" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIF_Ce4N-2T_MoQ6SN0SfDp4VeSR6Juh7DtA4AG31Uhjo0wbTR0ptMxrwCnRrA2QYxQogIrjeAZODYo1sydbIvI1cmRFhU719NLCxR-30O1gEbx78u_x_KqAv4ZY7-CtWmLCbCfPKUaA8MQO3F1G-UXYxnggR5PoteD4CNJaE9b9kw0OcFHBwtZdVf/w640-h285/Stein_0.png" width="640" /></a></div><div><div class="separator" style="clear: both; text-align: center;"><br /></div>Applying the James-Stein Estimator to the data samples from each series’ source, removes the innate distance which exist between each sample. In simpler terms, this essentially equates to all elements within each sample being shifted towards a central point. <br /><br /></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;"> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6XMHkiUpqP-Q5Lw6dD1dswWEbBbV4hvwKxDCujpwNSenkTFuGVylO90ZDwk8ViS3qptLjavdge-IMZrOKuzDk8AfIvw5DvDqMHA-h8ID1RD34NLuGocMZZmLpcOIEcdCPEUbjFhRUdO1o-f-nj3aoVEpfcIdgJ5Q0FcpkoH95eM4chnfxVPRofbd/s674/Stein_1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="401" data-original-width="674" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6XMHkiUpqP-Q5Lw6dD1dswWEbBbV4hvwKxDCujpwNSenkTFuGVylO90ZDwk8ViS3qptLjavdge-IMZrOKuzDk8AfIvw5DvDqMHA-h8ID1RD34NLuGocMZZmLpcOIEcdCPEUbjFhRUdO1o-f-nj3aoVEpfcIdgJ5Q0FcpkoH95eM4chnfxVPRofbd/w400-h238/Stein_1.png" width="400" /></a></div></blockquote><div><br />Series elements which were already in close proximity to the mean, now move slightly closer to the mean. Series elements which were originally far from the mean, move much closer to the mean. These outside elements still maintain their order, but they are brought closer to their fellow series peers. This shifting of the more extreme elements within a series, is what makes the James-Stein Estimator so novel in design, and potent in application. <br /><br />This one really blew my noggin when I first discovered and applied it. <br /><br />For more information on this noggin blowing technique, please check out: <br /><br /><a href="https://www.youtube.com/watch?v=cUqoHQDinCM">https://www.youtube.com/watch?v=cUqoHQDinCM</a><br /><br /></div><div><b>~ and ~ </b><br /><a href="https://www.statisticshowto.com/james-stein-estimator/"><br />https://www.statisticshowto.com/james-stein-estimator/</a></div><div><br /></div><div>That's all for today.</div><div><br /></div><div>Come back again soon for more perspective altering articles.</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-50122796639483528512022-10-28T18:54:00.005-04:002022-10-28T18:59:22.542-04:00(Python) The Number: 13 (Happy, Lucky, Primes) - A Spooky Special!This Halloween, we’ll be covering a very spooky topic. <div><br />I feel that the number “<b>13</b>”, for far too long, has been un-fairly maligned. <br /><br />Today, it will have its redemption. <br /><br />Did you know that the number “<b>13</b>”, by some definitions, is both a happy and lucky number? Let’s delve deeper into each definition, and together discover why this number deserves much more respect than it currently receives.<div><br /></div><div><b><u>Happy Numbers</u></b><br /> <br />In number theory, a happy number is a number which eventually reaches 1 when replaced by the sum of the square of each digit. * </div><div><br /></div><div><b>Example:</b> <br /><br />For instance, <b>13 </b>is a happy number because: <br /><br />(1 * 1) + (3 * 3) = 10 <br /><br />(1 * 1) + (0 * 0) = 1 <br /><br />and the number <b>19</b> is also a happy number because: <br /><br />(1 * 1) + (9 * 9) = 82 <br /><br />(8 * 8) + (2 * 2) = 68 <br /><br />(6 * 6) + (8 * 8) = 100 <br /><br />(1 * 1) + (0 * 0) + (0 * 0) = 1 <br /><br /><i>*- <a href="https://en.wikipedia.org/wiki/Happy_number">https://en.wikipedia.org/wiki/Happy_number</a></i></div><div><br /></div><div><b><u>Lucky Numbers</u></b></div><br />In number theory, a lucky number is a natural number in a set which is generated by a certain "sieve". *<br /><br />In the case of our (lucky) number generation process, we will be utilizing the, "the sieve of Josephus Flavius". <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3G6zZf-mIs_UuT09qoWD24JXkL3tzBPR__AmhxgfVCGfY581EmnILZwP9tIwly20xw_qkeCLCS9zAMkyDQMEyZabIKdnyTtmSwFqk2ryK-U3wrIxT1-vQhMaHl9I_rwHdiI8AAZNKuEXd8AwJ3RB2VKaw1HF7mul11xVfqGpPn8x5RMmqc1L35Owi/s279/Flavius.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="279" data-original-width="219" height="279" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3G6zZf-mIs_UuT09qoWD24JXkL3tzBPR__AmhxgfVCGfY581EmnILZwP9tIwly20xw_qkeCLCS9zAMkyDQMEyZabIKdnyTtmSwFqk2ryK-U3wrIxT1-vQhMaHl9I_rwHdiI8AAZNKuEXd8AwJ3RB2VKaw1HF7mul11xVfqGpPn8x5RMmqc1L35Owi/s1600/Flavius.png" width="219" /></a></div><br /><b>Example:</b><br /><div><br /></div><div>Beginning with a list of integers from 1 – 20:<br /><br />{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} <br /><br />We will remove all even numbers: <br /><div><br /></div><div>1, <b>2</b>, 3, <b>4</b>, 5, <b>6</b>, 7,<b> 8</b>, 9, <b>10</b>, 11, <b>12</b>, 13, <b>14</b>, 15, <b>16</b>, 17, <b>18</b>, 19, <b>20</b><br /><br />Leaving: <br /><br />1, 3, 5, 7, 9, 11, 13, 15, 17, 19 <br /><br />The first remaining number after the number “<b>1</b>”, is the number “<b>3</b>”. Therefore, every third number within the list must be removed: <br /><br />1, 3, <b>5</b>, 7, 9, <b>11</b>, 13, 15, <b>17</b>, 19 <br /><br />Leaving: <br /><br />1, 3, 7, 9, 13, 15, 19 <br /><br />Next, we will remove each seventh entry within the remaining list, as the number “<b>7</b>” is the value which occurs subsequent to “<b>3</b>”: <br /><br />1, 3, 7, 9, 13, 15, <b>19 <br /></b><br />Leaving: <br /><br />1, 3, 7, 9, 13, 15 <br /><br />If we were to continue with this process, each ninth entry which would also be subsequently removed from the remaining list, as the number “<b>9</b>” is the number which occurs subsequent to “<b>7</b>”. Since only 6 elements remain from our initial set, the process ends here. <br /><br />We can then conclude, that the following numbers are indeed lucky: <br /><br /> 1, 3, 7, 9, 13, 15<div><br /></div><div><i>*- <a href="https://en.wikipedia.org/wiki/Lucky_number">https://en.wikipedia.org/wiki/Lucky_number</a> </i><br /><br /><b style="text-decoration: underline;">Prime Numbers</b> <br /><br />A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. * <br /><br />13 fits this categorization, as it can only be factored down to a product of 13 and 1.<br /><br /><i>*- <a href="https://en.wikipedia.org/wiki/Prime_number">https://en.wikipedia.org/wiki/Prime_number</a> <br /></i><br /><b><u>(Python) Automating the Process </u></b><br /><br />Now that I have hopefully explained each concept in an understandable way, let’s automate some of these processes.</div></div></div><div><br /><b><u>Happy Numbers </u></b><br /><br /><b># Create a list of happy numbers between 1 and 100 # </b><br /><br /><b># https://en.wikipedia.org/wiki/Happy_number # <br /><br /># This code is a modified variation of the code found at: # <br /><br /># https://www.phptpoint.com/python-program-to-print-all-happy-numbers-between-1-and-100/ # <br /><br /><br /></b></div><div><b># Python program to print all happy numbers between 1 and 100 # </b><br /><br /><br /><b># isHappyNumber() will determine whether a number is happy or not #</b><br /><br /><b>def isHappyNumber(num): <br /></b></div><div><b><br /> <span> </span>rem = sum = 0; </b><br /><br /><br /><b># Calculates the sum of squares of digits #<br /><br /> <span> </span>while(num > 0): <br /><br /> <span> <span> </span></span>rem = num%10; <br /><br /> <span> <span> </span></span>sum = sum + (rem*rem); <br /><br /> <span> <span> </span></span>num = num//10; <br /><br /> <span> </span>return sum; </b><br /><br /><br /><b># Displays all happy numbers between 1 and 100 # <br /><br />print("List of happy numbers between 1 and 100: \n 1"); <br /></b><br /><br /><br /><b># for i in range(1, 101):, always utilize n+1 as it pertains to the number of element entries within the set # <br /><br /># Therefore, for our 100 elements, we will utilize 101 as the range variable entry # <br /></b><br /><br /><b>for i in range(1, 101): <br /><br /> <span> </span>result = i; <br /><br /><br /></b></div><div><b><span> </span>while(result != 1 and result != 4): <br /><br /> <span> <span> </span></span>result = isHappyNumber(result); <br /><br /> <span> <span> </span></span>if(result == 1): <br /><br /> <span> <span> <span> </span></span></span>print(i); </b><br /><br /><br /><u>Console Output:</u> <br /><br /><i>List of happy numbers between 1 and 100: <br />1 <br />7 <br />10 <br />13 <br />19 <br />23 <br />28 <br />31 <br />32 <br />44 <br />49 <br />68 <br />70 <br />79 <br />82 <br />86 <br />91 <br />94 <br />97 <br />100</i><br /><br /><b># Code which verifies whether a number is a happy number # <br /><br /># Code Source: # https://en.wikipedia.org/wiki/Happy_number # <br /><br /># This process is unfortunately two steps # </b></div><div><b><br /></b></div><div><br /><b>def pdi_function(number, base: int = 10): <br /><br /> <span> </span>"""Perfect digital invariant function.""" <br /><br /> <span> </span>total = 0 <br /><br /> <span> </span>while number > 0: <br /><br /> <span> <span> </span></span>total += pow(number % base, 2) <br /><br /> <span> <span> </span></span>number = number // base <br /><br /> return total <br /><br /><br />def is_happy(number: int) -> bool: <br /><br /> <span> </span>"""Determine if the specified number is a happy number.""" <br /><br /> <span> </span>seen_numbers = set() <br /><br /> <span> </span>while number > 1 and number not in seen_numbers: <br /><br /> <span> <span> </span></span>seen_numbers.add(number) <br /><br /> <span> <span> </span></span>number = pdi_function(number) <br /><br /> <span> </span>return number == 1 </b><br /><br /><br /><b># First, we must run the initial function on the number in question # <br /><br /># This function will calculate the number’s perfect digital invariant value # <br /><br /># Example, for 13 # </b><br /><br /><b>pdi_function(13)</b><br /><br /><u>Console Output:</u> <br /><br /><i>10 </i><br /><br /><b># The output value of the first function must then be input into the subsequent function, in order to determine whether or not the tested value (ex. 13) can appropriately be deemed “happy”. #<br /><br />is_happy(10) </b><br /><br /><u>Console Output:</u> <br /><br /><i>True </i><br /><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /><b><u>Lucky Numbers</u></b> <br /><br /><b># https://en.wikipedia.org/wiki/Lucky_number # </b><br /><br /><b># The code below will determine whether or not a number is "lucky", as defined by the above definition of the term #</b><br /><br /><b># The variable ‘number check’, must be set equal to the number which we wish to assess # <br /><br />number_check = 99 </b><br /><br /><br /><b># Python code to convert list of # <br /><br /># string into sorted list of integer # <br /><br /># https://www.geeksforgeeks.org/python-program-to-convert-list-of-integer-to-list-of-string/ # </b><br /><br /><br /><b># List initialization <br /><br />list_int = list(range(1,(number_check + 1),1)) </b><br /><br /><br /><b># mapping <br /><br />list_string = map(str, list_int) </b><br /><br /><br /><b># Printing sorted list of integers <br /><br />numbers = (list(list_string)) <br /></b><br /><br /><b># https://stackoverflow.com/questions/64956140/lucky-numbers-in-python # </b><br /><br /><br /><b>def lucky_numbers(numbers): <br /><br /> <span> </span>index = 1 <br /><br /> <span> </span>next_freq = int(numbers[index]) <br /><br /> <span> </span>while int(next_freq) < len(numbers): <br /><br /> <span> <span> </span></span>del numbers[next_freq-1::next_freq] <br /><br /> <span> <span> </span></span>print(numbers) <br /><br /> <span> <span> </span></span>if str(next_freq) in numbers: <br /><br /> <span> <span> <span> </span></span></span>index += 1 <br /><br /> <span> <span> <span> </span></span></span>next_freq = int(numbers[index]) <br /><br /> else: <br /><br /> <span> </span>next_freq = int(numbers[index]) <br /><br /> return <br /><br /><br />lucky_numbers(numbers) <br /></b><br /><br /><u>Console Output:</u> <br /><br />['1', '3', '5', '7', '9', '11', '13', '15', '17', '19', '21', '23', '25', '27', '29', '31', '33', '35', '37', '39', '41', '43', '45', '47', '49', '51', '53', '55', '57', '59', '61', '63', '65', '67', '69', '71', '73', '75', '77', '79', '81', '83', '85', '87', '89', '91', '93', '95', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '19', '21', '25', '27', '31', '33', '37', '39', '43', '45', '49', '51', '55', '57', '61', '63', '67', '69', '73', '75', '79', '81', '85', '87', '91', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '27', '31', '33', '37', '43', '45', '49', '51', '55', '57', '63', '67', '69', '73', '75', '79', '85', '87', '91', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '45', '49', '51', '55', '63', '67', '69', '73', '75', '79', '85', '87', '93', '97', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '55', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] <br /><br />['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] </p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">The output of this function returns a series of numbers up to and including the number which is being assessed. Therefore, from this function's application, we can conclude that the following numbers are "lucky":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><i>['1', '3', '7', '9', '13', '15', '21', '25', '31', '33', '37', '43', '49', '51', '63', '67', '69', '73', '75', '79', '85', '87', '93', '99'] </i><br /><br />(Only consider the final output as valid, as all other outputs are generated throughout the reduction process)<br /><br /><u><b>Prime Numbers</b></u> <br /><br /><b># The code below is rather self-explanatory #</b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><br /></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b># It is utilized to generate a a list of prime numbers included within the range() function # <br /></b><br /><b># Source: https://stackoverflow.com/questions/52821002/trying-to-get-all-prime-numbers-in-an-array-in-python # <br /><br />checkMe = range(1, 100) <br /><br />primes = [] <br /><br />for y in checkMe[1:]: <br /><br /> <span> </span>x = y <br /><br /> <span> </span>dividers = [] <br /><br /> <span> </span>for x in range(2, x): <br /><br /> <span> <span> </span></span>if (y/x).is_integer(): <br /><br /> <span> <span> <span> </span></span></span>dividers.append(x) <br /><br /> if len(dividers) < 1: <br /><span> </span><br /> <span> </span>primes.append(y) <br /><br />print("\n"+str(checkMe)+" has "+str(len(primes))+" primes") <br /><br />print(primes) <br /></b><br /><u>Console Output:</u> <br /><br /><i>range(1, 100) has 25 primes <br />[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]</i><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><i><br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u>Conclusion</u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Having performed all of the previously mentioned tests and functions, I hope that you have been provided with enough adequate information to reconsider <b>13</b>'s unlucky status. Based upon my number theory research, I feel that enough evidence exists to at least relegate the number <b>13</b> to the status of "misunderstood".</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b>13 </b>isn't to be feared or avoided. It actually shares unique company amongst other "<b>Happy Primes</b>":</p><br />7, 13, 19, 23, 31, 79, 97, 103, 109, 139, 167, 193, 239, 263, 293, 313, 331, 367, 379, 383, 397, 409, 487<p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Even intermingling with the company of "<b>Lucky Primes</b>":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">3, 7, 13, 31, 37, 43, 67, 73, 79, 127, 151, 163, 193, 211, 223, 241, 283, 307, 331, 349, 367, 409, 421, 433, 463, 487</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">And being a member of the very exclusive group, the "<b>Happy Lucky Primes</b>":</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">7, 13, 31, 79, 193, 331, 367, 409, 487</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><b><u><br /></u></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">----------------------------------------------------------------------------------------------------------------------------- <br /><br />I wish you all a very happy and safe holiday.</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">I'll see you next week with more (non-spooky) content!</p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br />-RD</p></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-72483959986085384412022-10-24T19:09:00.000-04:002022-10-24T19:09:37.732-04:00(Python) Machine Learning – Keras – Pt. VThroughout the previous articles, we thoroughly explored the various machine learning techniques which employ tree methodologies as their primary mechanism. We also discussed the evolution of machine learning techniques, and how gradient boosting eventually came to overtake the various forest models as the preferred standard. However, the gradient boosted model was soon replaced by the Keras model. The latter still remains the primary method of prediction at this present time. <br /><br />Keras differs from all of the other models in that it does not utilize the tree or forest methodologies as its primary mechanism of prediction. Instead, Keras employs something similar to a binary categorical method, in that, an observation is fed through the model, and at each subsequent layer prior to the output, Keras decides what the observation is, and what the observation is not. This may sound somewhat complicated, and in all manners, it truly is. However, what I am attempting to illustrate will become less opaque as you continue along with the exercise. <br /><br />A final note prior to delving any further, Keras is a member of a machine learning family known as deep learning. Deep learning can essentially be defined as an algorithmic analysis of data which can evaluate non-linear relationships. This analysis also provides dynamic model re-calibration throughout the modeling process. <br /><br /><b><u>Keras Illustrated</u></b> <br /><br />Below is a sample illustration of a Keras model which possesses a continuous dependent variable. The series of rows on the left represents the observational data which will be sent through the model so that it may “learn”. Each circle represents what is known as “neuron”, and each column of circles represents what is known as a “layer”. The sample illustration has 4 layers. The leftmost layer is known as the input layer, the middle two layers are known as the hidden layers, and the rightmost layer is referred to as the output layer.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSHf4HSgRjy6KbE6qG_BODTXGg58ZQBUjrZ1fTCo3bgtZEI9IraIMeeuUgfCJkRFJFyM28Gpf1tZH4Awf-43VvtqRrZ4ZZDbGfMuAi4dlMwaKnSqGxDqlaaUsof0k_jCPXz18X414ZzOOKdQFym2eMhFSsRVT1iVrG-oyP1TrK5-idd0KFykZDpaFS/s900/Keras_Nue.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="553" data-original-width="900" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSHf4HSgRjy6KbE6qG_BODTXGg58ZQBUjrZ1fTCo3bgtZEI9IraIMeeuUgfCJkRFJFyM28Gpf1tZH4Awf-43VvtqRrZ4ZZDbGfMuAi4dlMwaKnSqGxDqlaaUsof0k_jCPXz18X414ZzOOKdQFym2eMhFSsRVT1iVrG-oyP1TrK5-idd0KFykZDpaFS/w400-h246/Keras_Nue.png" width="400" /></a></div><div><br /></div>Due to the model’s continuous dependent variable classification, it will only possess a single output layer. If the dependent variable was categorical, it would have an appearance similar to the graphic below:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyAfhrdF8jSOeEmldo8-lN3NlRIuz3BfaQorYzsd5pUBKLEHI6s0t8rYn1mGXLMdD9nuWXe032oeeGeKWjiQbYeh2BmDPzpAIxMqRwLthF1Cz9XAs8uzzQ5RCXvUrITlP__9wZGslIRlJL1E8z52Y8wvTpIIaMfStwo7AIxvwK1IM1EG9HHNjSe1K9/s900/Keras_Nue2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="553" data-original-width="900" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyAfhrdF8jSOeEmldo8-lN3NlRIuz3BfaQorYzsd5pUBKLEHI6s0t8rYn1mGXLMdD9nuWXe032oeeGeKWjiQbYeh2BmDPzpAIxMqRwLthF1Cz9XAs8uzzQ5RCXvUrITlP__9wZGslIRlJL1E8z52Y8wvTpIIaMfStwo7AIxvwK1IM1EG9HHNjSe1K9/w400-h246/Keras_Nue2.png" width="400" /></a></div><div><br /></div><div><b><u>How the Model Works</u></b> <br /><br />Without getting overly specific, (as many other resources exist which provide detailed explanations as it pertains the model’s inner-most mechanisms), the training of the model occurs throughout two steps. The first step being: <b>“Forward Propagation”</b>, and the second step being: <b>“Backward Propagation”</b>. Each node which exists beyond the input layers, sans the output layer, is measuring for a potential interaction amongst variables. <br /><br />Each node is initially assigned a value. Those values shift as training data is processed through the model from the left to the right (forward propagation), and further, but more specifically modified, as the same data is then passed back through the model from the right to the left (back propagation). The entire training data set is not processed in its entirety in a simultaneous manner, instead, for the sake of allocating computing resources, the data is split into smaller sets known as batches. Batch size impacts learning significantly. With a smaller batch size, a model’s predictive capacity will be hindered. However, there are certain scenarios when lower batch size is advantageous, as the impact of noisy gradients will be reduced. The default size for a batch is 32 observations. <br /><br />In many ways, the method in which the model functions is analogous to the way in which a clock operates. Each training observation shifts a certain aspect of a neuron’s value, with the neuron’s final value being representational of all of the prior shifts. <br /><br />A few other terms which are also worth mentioning, as the selection of such is integral to model creation are: <br /><br /><b>Optimizer</b> – This specifies the algorithm which will be utilized for correcting the model as errors occur. <br /><br /><b>Epoch</b> – This indicates the number of times in which observational data will be passed through a model during the training process. <br /><br /><b>Loss Function</b> – This indicates the algorithm which will be utilized to determine how errors are penalized within the model. <br /><br /><b>Metric</b> - A metric is a function which is utilized to assess the performance of a model. However, unlike the Loss Function, it does not impact model training, and is only utilized to perform post-hoc analysis. <br /><br /><b><u>Model Application</u></b> <br /><br />As with any auxiliary python library, a library must first be downloaded and enabled prior to its utilization. To achieve this within the Jupyter Notebook platform, we will employ the following lines of code: <br /><br /><b># Import ‘pip’ to import to install auxiliary packages # <br /><br />import pip <br /><br /># Install ‘TensorFlow’ to act as the underlying mechanism of the Keras UI # <br /><br />pip.main(['install', 'TensorFlow']) <br /><br /># Import pandas to enable data frame utilization # <br /><br />import pandas <br /><br /># Import numpy to enable numpy array utilization # <br /><br />import numpy <br /><br /># Import the general Keras library # <br /><br />import keras <br /><br /># Import tensorflow to act as the ‘backend’ # <br /><br />import tensorflow <br /><br /># Enable option for categorical analysis # <br /><br />from keras.utils import to_categorical <br /><br />from keras.models import Sequential <br /><br />from keras.layers import Activation, Dense <br /><br /># Import package to enable confusion matrix creation # <br /><br />from sklearn.metrics import confusion_matrix <br /><br /># Enable the ability to save and load models with the ‘load_model’ option # <br /><br />from keras.models import load_model <br /><br /># Enable the creation of confusion matrixes with the ‘sklearn.metrics’ library # <br /><br />from sklearn.metrics import confusion_matrix </b><br /><br />With all of the appropriate libraries downloaded and enabled, we can begin building our sample model. <br /><br /><b><u>Categorical Dependent Variable Model</u></b> <br /><br />For the following examples, we will be utilizing a familiar data set, the<b> “iris”</b> data set, which is available within the R platform. <br /><br /><b># Import the data set (in .csv format), as a pandas data frame # <br /><br />filepath = "C:\\Users\\Username\\Desktop\\iris.csv" <br /><br />iris = pandas.read_csv(filepath) </b><br /><br />First we will randomize the observations within the data set. Observational data should always be randomized prior to model creation. <br /><br /><b># Shuffle the data frame # <br /><br />iris = iris.sample(frac=1).reset_index(drop=True) </b><br /><br />Next, we will remove the dependent variable entries from the data frame and modify the structure of the new data frame to consist only of independent variables. <br /><br /><b>predictors = iris.drop(['Species'], axis = 1).as_matrix() <br /></b><br />Once this has been achieved, we must modify the variables contained within the original data set so that the categorical outcomes are designated by integer values. <br /><br />This can be achieved through the utilization of the following code: <br /><br /><b># Modify the dependent variable so that each entry is replaced with a corresponding integer # <br /><br />from pandasql import * <br /><br />pysqldf = lambda q: sqldf(q, globals()) <br /><br />q = """ <br /><br />SELECT *, <br /><br />CASE <br /><br />WHEN (Species = 'setosa') THEN '0' <br /><br />WHEN (Species = 'versicolor') THEN '1' <br /><br />WHEN (Species = 'virginica') THEN '2' <br /><br />ELSE 'UNKNOWN' END AS SpeciesNum <br /><br />from iris; <br /><br />""" <br /><br />df = pysqldf(q) <br /><br />print(df) <br /><br />iris0 = df </b><br /><br />Next, we must make a few further modifications. <br /><br />First, we must modify the dependent variable type to integer. <br /><br />After such, we will identify this variable as being representative of a categorical outcome. <br /><br /><b># Modify the dependent variable type from string to integer # <br /><br />iris0['SpeciesNum'] = iris0['SpeciesNum'].astype('int') <br /><br /># Modify the variable type to categorical # <br /><br />target = to_categorical(iris0.SpeciesNum) </b><br /><br />We are now ready to build our model! <br /><br /><b># We must first specify the model type # <br /><br />model = Sequential() <br /><br /># Next, we will specify the output dimensions. This value will typically be [1] unless you are working with images. # <br /><br />n_cols = predictors.shape[1] <br /><br /># This next line specifies the traits of the input layer # <br /><br />model.add(Dense(100, activation = 'relu', input_shape = (n_cols, ))) <br /><br /># This line specifies the traits of the hidden layer # <br /><br />model.add(Dense(100, activation = 'relu')) <br /><br /># This line specifies the traits of the output layer # <br /><br />model.add(Dense(3, activation = 'softmax')) <br /><br /># Compile the model by adding the optimizer, the loss function type, and the metric type # <br /><br /># If the model’s dependent variable is binary, utilize the ‘binary_crossentropy' loss function # <br /><br />model.compile(optimizer = 'adam', loss='categorical_crossentropy', <br /><br /> metrics = ['accuracy']) </b><br /><br />With our model created, we can now go about training it with the necessary information. <br /><br />As was the case with prior machine learning techniques, only a portion of the original data frame will be utilized to train the mode. <br /><br /><b>model.fit(predictors[1:100,], target[1:100,], shuffle=True, batch_size= 50, epochs=100) <br /></b><br />With the model created, we can now test its effectiveness by applying it to the remaining data observations. <br /><br /><b># Create a data frame to store the un-utilized observational data # <br /><br />iristestdata = iris0[101:150] <br /><br /># Create a data frame to store the model predictions for the un-utilized observational data # <br /><br />predictions = model.predict_classes(predictors[101:150]) <br /><br /># Create a confusion matrix to assess the model’s predictive capacity # <br /><br />cm = confusion_matrix(iristestdata['SpeciesNum'], predictions) <br /><br /># Print the confusion matrix results to the console output window # <br /><br />print(cm) </b><br /><br />Console Output: <br /><br /><i>[[16 0 0] <br /> [ 0 17 2] <br /> [ 0 0 14]] </i></div><div><br /><b><u>Continuous Dependent Variable Model</u></b> <br /><br />The utilization of differing model types is necessitated by the scenario that each situation dictates. As was the case with previous machine learning methodologies, the Keras package also contains functionality which allows for continuous dependent variables types. <br /><br />The steps for applying this model methodology are as follows: <br /><br /><b># Import the 'iris' data frame # <br /><br />filepath = "C:\\Users\\Username\\Desktop\\iris.csv" <br /><br />iris = pandas.read_csv(filepath) <br /><br /># Shuffle the data frame # <br /><br />iris = iris.sample(frac=1).reset_index(drop=True) </b><br /><br />In the subsequent lines of code, we will first identify the model’s dependent variable <b>‘Sepal.Length’</b>. This variable, and its corresponding observations will be held within the new variable <b>‘iris0’</b>. Next, we will create the variable,<b> ‘predictors’</b>. This variable will be comprised of all of the variables contained within the<b> ‘iris0’</b> data frame, with the exception of the<b> ‘Sepal.Length’</b> variable. The new data frame will stored within a matrix format. Finally, we will again define the <b>‘n_cols’</b> variable. <br /><br /><b>target = iris['Sepal.Length'] <br /><br /># Drop Species Name # <br /><br />iris0 = iris.drop(columns=['Species']) <br /><br /><br /><br /># Drop Species Name # <br /><br />predictors = iris0.drop(['Sepal.Length'], axis = 1).as_matrix() <br /><br />n_cols = predictors.shape[1] </b><br /><br />We are now ready to build our model! <br /><br />#<b> We must first specify the model type # <br /><br />modela = Sequential() <br /><br /># Next, we will specify the output dimensions. This value will typically be [1] unless you are working with images. # <br /><br />n_cols = predictors.shape[1] <br /><br /># This next line specifies the traits of the input layer # <br /><br />modela.add(Dense(100, activation = 'relu', input_shape=(n_cols,))) <br /><br /># This line specifies the traits of the hidden layer # <br /><br />modela.add(Dense(100, activation = 'relu')) <br /><br /># This line specifies the traits of the output layer # <br /><br />modela.add(Dense(1)) <br /><br /># Compile the model by adding the optimizer and the loss function type # <br /><br />modela.compile(optimizer = 'adam', loss='mean_squared_error') </b><br /><br />With the model created, we must now train the model with the following code: <br /><br /><b>modela.fit(predictors[1:100,], target[1:100,], shuffle=True, epochs=100) </b><br /><br />As was the case with the prior examples, we will only be utilizing a sample of the original data frame for the purposes of model training. <br /><br />With the model created and trained, we can now test its effectiveness by applying it to the remaining data observations.<br /><br /><b>from sklearn.metrics import mean_squared_error <br /><br />from math import sqrt <br /><br />predictions = modela.predict(predictors) <br /><br />rms = sqrt(mean_squared_error(target, predictions)) <br /><br />print(rms) <br /></b><br /><b><u>Model Functionality</u> </b><br /><br />In some ways, the Keras modeling methodology shares similarities with the hierarchal cluster model. The main differentiating factor being, in addition to the underlying mechanism, the dynamic aspects of the Keras model. <br /><br />Each Keras neuron represents a relationship between independent data variables within the training set. These relationships exhibit macro phenomenon which may not be immediately observable within the context of the initial data. When finally providing an output, the model considers which macro phenomenon illustrated the strongest indication of identification. The Keras model still relies on generalities to make predictions, therefore, certain factors which are exhibited within the observational relationships are held in higher regard. This phenomenon is known as weighing, as each neuron is assigned a weight which is adjusted as the training process occurs. <br /><br />The logistic regression methodology functions in a similar manner as it pertains to assessing variable significance. Again however, we must consider the many differentiating attributes of each model. In addition to weighing latent variable phenomenon, the Keras model is able to assess for non-linear relationships. Both attributes are absent within the aforementioned model, as logistic regression only assesses for linear relationships and can only provide values for variables explicitly found within the initial data set. <br /><br />The <b>sequential()</b> model type, which was specified within the build process, is one of the many model options available within the Keras package. The sequential option differs from the other model types in that it creates a network in which each neuron within each layer, is connected to each neuron within each subsequent layer. </div><div><br /></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSxE2K1J7gUrxSSzXx-_jD_oFN7aAnwkdEKgHOUTVVCY1X48Ynxumq0swD6oe-sAYxNKtfOIiaVDOca4iGqHIqG0huQq8DDghtwPsFK4-zpNplcmtYnFHTA_glSCClf4Yo4dalF3IljPgasYbEPr10ApzeOj_2EKSBZPXKvgo_xDSmf3I6oCJl5D9V/s505/Keras_Nue3.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="505" data-original-width="311" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSxE2K1J7gUrxSSzXx-_jD_oFN7aAnwkdEKgHOUTVVCY1X48Ynxumq0swD6oe-sAYxNKtfOIiaVDOca4iGqHIqG0huQq8DDghtwPsFK4-zpNplcmtYnFHTA_glSCClf4Yo4dalF3IljPgasYbEPr10ApzeOj_2EKSBZPXKvgo_xDSmf3I6oCJl5D9V/w246-h400/Keras_Nue3.png" width="246" /></a></div></blockquote></blockquote></blockquote></blockquote><p> <b><u>Other Characteristics of the Keras Model</u></b></p><div>Depending on the size of the data set which acted as the training data for the model, significant time may be required to re-generate a model after a session is terminated. To avoid this un-necessary re-generation process, functions exist which enable the saving and reloading of model information. <br /><br /><b># Saving and Loading Model Data Requires # <br /><br />from keras.models import load_model <br /><br /># To save a model # <br /><br />modelname.save("C:\\Users\\filename.h5") <br /><br /># To load a model # <br /><br />modelname = load_model("C:\\Users\\filename.h5") <br /></b><br />It should be mentioned that as it pertains to Keras models, you do possess the ability to train existing models with additional data should the need the arise. <br /><br />For instance, if we wished to train our categorical iris model (“model”) with additional iris data, we could utilize the following code: <br /><br /><b> model.fit(newpredictors[100:150,], newtargets[100:150,], shuffle=True, batch_size= 50, epochs=100) </b><br /><br />There are errors which currently exist at the time of this article’s creation, which have yet to be resolved pertaining to learning rate fluctuation within re-loaded Keras models. Currently, a provisional fix has been suggested*, in which the "<b>adam"</b> optimizer is re-configured for re-loaded models. This re-configuring, while keeping all of the "<b>adam"</b> optimizer default configurations, significantly lowers the optimizer’s default learning rate. The purpose of this shift is to account for the differentiation in learning rates which occur in established models. <br /><br /><b># Specifying Optimizer Traits Requires # <br /><br />from keras import optimizers <br /><br /># Re-configure Optimizer # <br /><br />liladam = optimizers.adam(lr=0.00001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) <br /><br /># Utilize Custom Optimizer # <br /><br />model.compile(optimizer = liladam, loss='categorical_crossentropy', <br /><br /> <span> <span> <span> </span></span></span>metrics = ['accuracy']) </b><br /><br /><i>*Source - <a href="https://github.com/keras-team/keras/issues/2378" target="_blank">https://github.com/keras-team/keras/issues/2378 </a></i><br /><br /><b><u>Graphing Models</u></b><br /><br />As we cannot easily determine the innermost workings of a Keras model, the best method of visualization can be achieved by graphing the learning output. <br /><br />Prior to training the model, we will modify the typical fitting function to resemble something similar to the lines of code below: <br /><br /><b>history = model.fit(predictors[1:100,], target[1:100,], shuffle=True, epochs= 110, batch_size = 100, validation_data =(predictors[100:150,] , target[100:150,])) </b><br /><br />What this code enables, is the creation of the data variable<b> “history”</b>, in which, data pertaining to the model training process will be stored.<b> “validation_data” </b>is instructing the python library to assess the specified data within the context of the model after each epoch. This does not impact the learning process. The way in which this assessment will be analyzed is determined by the selection of the <b>“meteric”</b> option specified within the <b>model.fit()</b> function. <br /><br />If the above code is initiated, the model will be trained. To view the categories in which the model training history was organized upon being saved within the <b>“history” </b>variable, you may utilize the following lines of code. <br /><br /><b>history_dict = history.history <br /><br />history_dict.keys() </b><br /><br />This produces the console output: <br /><br /><i>dict_keys(['val_loss', 'val_acc', 'loss', 'acc']) <br /></i><br />To set the appropriate axis lengths for our soon to be produced graph, we will initiate the following line: <br /><br /><b>epochs = range(1, len(history.history['acc']) + 1) <br /></b><br />If we are utilizing Jupyter Notebook, we should also modify the graphic output size: <br /><br /><b>plt.rcParams["figure.figsize"] = [16,9] </b><br /><br />We are now prepared to create our outputs. The first graphic can be prepared with the following code: <br /><br /><b># Plot training & validation accuracy values # <br /><br /># (This graphic cannot be utilized to track the validation process of continuous data models) # <br /><br />plt.plot(epochs, history.history['acc'], 'b', label = 'Training acc') <br /><br />plt.plot(epochs, history.history['val_acc'], 'bo', color = 'orange', label = 'Validation acc') <br /><br />plt.title('Training and validation accuracy') <br /><br />plt.ylabel('Loss') <br /><br />plt.xlabel('Epoch') <br /><br />plt.legend(loc='upper left') <br /><br />plt.show() </b><br /><br />This produces the output:<br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTF8ZOYGVDzjjVzBfbPezztVIQ7yVuoBNlJ4mYQ1tFyR3Z9chcRE1lF-BFk0Qv7yCG3tvPVCzikUnVymLXnEyqDoPYxc6WkCCKSKmqAcPKNzSHM4NNPCYmlV-hDF2ISO0gzSAmEq20iL1nO69f3DFFe3pB8CKErWaSmKDHLYE_8wQ8zFWdn9OXHpPX/s947/Keras_graphic1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="550" data-original-width="947" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTF8ZOYGVDzjjVzBfbPezztVIQ7yVuoBNlJ4mYQ1tFyR3Z9chcRE1lF-BFk0Qv7yCG3tvPVCzikUnVymLXnEyqDoPYxc6WkCCKSKmqAcPKNzSHM4NNPCYmlV-hDF2ISO0gzSAmEq20iL1nO69f3DFFe3pB8CKErWaSmKDHLYE_8wQ8zFWdn9OXHpPX/w400-h233/Keras_graphic1.png" width="400" /></a></div><div><br /></div><b><u>Interpreting this Graphic</u></b> <br /><br />What this graphic is illustrating, is the level of accuracy in which the model predicts results. The solid blue line represents the data which was utilized to train the model, and the orange dotted line represents the data which is being utilized to test the model’s predictability. It should be evident that throughout the training process, the predictive capacity of the model improves as it pertains to both training and validation data. If a large gap emerged, similar to the gap which is observed from epoch # 20 to epoch # 40, we would assume that this divergence of data is indicative of <b>“overfitting”</b>. This term is utilized to describe a model which can predict training results accurately, but struggles to predict outcomes when applied to new data. <br /><br />The second graphic can be prepared with the following code: <br /><br /><b># Plot training & validation loss values <br /><br />plt.plot(epochs, history.history['loss'], 'b', label = 'Training loss') <br /><br />plt.plot(epochs, history.history['val_loss'], 'bo', label = 'Validation loss', color = 'orange') <br /><br />plt.title('Training and validation loss') <br /><br />plt.ylabel('Loss') <br /><br />plt.xlabel('Epoch') <br /><br />plt.legend(loc='upper left') <br /><br />plt.show()</b><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXIHoRjrsOazxs_yuELKIw_pREdGnVXq-MM1XGFWbkPUR1TidRR9ZYB6NKOxKQ1an1E3NDCJGPLiTmFC3MFIGecIZb83RmZ8Qg_zU4GZd1m7MFS2Zvodd0WiH3k8Oos6iUYSrvLZ3Fdv0ZDMHoZeseGEXDanXoLe-NOAHtm1pi1HkP09NMAcNcVUy4/s947/Keras_graphic2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="550" data-original-width="947" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXIHoRjrsOazxs_yuELKIw_pREdGnVXq-MM1XGFWbkPUR1TidRR9ZYB6NKOxKQ1an1E3NDCJGPLiTmFC3MFIGecIZb83RmZ8Qg_zU4GZd1m7MFS2Zvodd0WiH3k8Oos6iUYSrvLZ3Fdv0ZDMHoZeseGEXDanXoLe-NOAHtm1pi1HkP09NMAcNcVUy4/w400-h233/Keras_graphic2.png" width="400" /></a></div><div><br /></div><b><u>Interpreting this Graphic</u></b> <br /><br />This graphic is illustrates the improvement of the model over time. The solid blue line represents the data which was utilized to train the model, and the orange dotted line represents the data which is being utilized to test the model’s predictability. If a gap does not emerge between the lines throughout the training process, it is advisable to set the number of epochs to a figure which, after subsequent graphing occurs, demonstrates a flat plateauing of both lines. <br /><br /><b><u>Reproducing Model Training Results</u></b> <br /><br />If Keras is being utilized, and TensorFlow is the methodology selected to act as a “backend”, then the following lines of code must be utilized to guarantee reproductivity of results. <br /><br /><b># Any number could be utilized within each function # <br /><br /># Initiate the RNG of the general python library # <br /><br />import random <br /><br /># Initiate the RNG of the numpy package # <br /><br />import numpy.random <br /><br /># Set a random seed as it pertains to the general python library # <br /><br />random.seed(777) <br /><br /># Set a random seed as it pertains to the numpy library # <br /><br />numpy.random.seed(777) <br /><br /># Initiate the RNG of the tensorflow package # <br /><br />from tensorflow import set_random_seed <br /><br /># Set a random seed as it pertains to the tensorflow library # <br /><br />set_random_seed(777) <br /><br /><u>Missing Values in Keras </u></b><br /><br />Much like the previous models discussed, Keras has difficulties as it relates to variables which contain missing observational values. If a Keras model is trained on data which contains missing variable values, the training process will occur without interruption, however, the missing values will be analyzed under the assumption that they are representative of a measurement. Meaning, that the library will<b> NOT</b> automatically assume that the value is a missing value, and from such, estimate a place holder value based on other variable observations within the set. <br /><br />To make assumptions for the missing values based on the process described above, we must utilize the <b>imputer()</b> function from the python library: <b>“sklearn”</b>. Sample code which can be utilized for this purpose can be found below: <br /><b><br />from sklearn.preprocessing import Imputer <br /><br />imputer = Imputer() <br /><br />transformed_values = imputer.fit_transform(predictors) </b><br /><br />Additional details pertaining to this function, its utilization, and its underlying methodology, can be found within the previous article: <b>“(R) Machine Learning - The Random Forest Model – Pt. III”</b>. <br /><br />Having tested this method of variable generation on sets which I purposely modified, I can attest that its capability for achieving such is excellent. After generating fictitious placeholder values and then subsequently utilizing the Keras package to create a model, comparatively speaking, I saw no differentiation between the predicted results related to each individual set. <br /><br /><b style="text-decoration: underline;">Early Stopping</b> <br /><br />There may be instances which necessitate the creation of a model that will be applicable to a very large data set. This essentially, in most cases, guarantees a very long training time. To help assist in shortening this process, we can utilize an <b>“early stopping monitor”</b>. <br /><br />First, we must import the package related to this feature: <br /><br /><b>from keras import losses </b><br /><br />Next we will create and define the parameters pertaining to the feature: <br /><br /><b># If model improvement stagnates after 2 epochs, the fitting process will cease # <br /><br />early_stopping_monitor = keras.callbacks.EarlyStopping(monitor='loss', patience = 2, min_delta=0, verbose=0, mode='auto', baseline=None, restore_best_weights=True) <br /></b><br />Many of the options present within the code above are defaults. However, there are few worth mentioning. <br /><br /><b>monitor = ‘loss’</b> - This option is specifically instructing the function to monitor the loss value during each training epoch. <br /><br /><b>patience = 2</b> – This option is instructing the function to cease training if the loss value ceases to decline after 2 epochs. <br /><br /><b>restore_best_weights=True</b> – This option is indicating to the function that the values which occurred prior to lack of loss within the training process, should be the last values applied as it pertains to model training. The subsequent training values will be discarded. <br /><br />With the early stopping feature defined, we can add it to the training function below: <br /><br /><b>history = model.fit(predictors[101:150,], target[101:150,], shuffle=True, epochs=100, callbacks =[early_stopping_monitor], validation_data =(predictors[100:150,] , target[100:150,])) </b><br /><br /><b><u>Final Thoughts on Keras</u></b> <br /><br />In my final thoughts pertaining to the Keras model, I would like to discuss the pros and cons of the methodology. Keras is, without doubt, the machine learning model type which possesses the greatest predictive capacity. Keras can also be utilized to identify images, which is a feature that is lacking within most other predive models. However, despite these accolades, Keras does fall short in a few categories. <br /><br />For one, the mathematics which act a mechanism for the model’s predicative capacity are incredibly complex. As a result of such, model creation can only occur within a digital medium. With this complexity comes an inability to easily verify or reproduce results. Additionally, creating the optimal model configuration as it pertains to the number of neurons, layers, epochs, etc., becomes almost a matter of personal taste. This sort of heuristic approach is negative for the field of machine learning, statistics, and science in general. <br /><br />Another potential flaw relates to the package documentation. The website for the package is poorly organized. The videos created by researchers who attempt to provide instruction are also poorly organized, riddled with heuristic approach, and scuttled by a severe lack of awareness. It would seem that no single individual truly understands how to appropriately utilize all of the features of the Keras package. In my attempts to properly understand and learn the Keras package, I purchased the book, DEEP LEARNING with Python, written by the package’s creator, Francois Chollet. This book was also poorly organized, and suffered from the assumption that the reader could inherently understand the writer’s thoughts. <br /><br />This being said, I do believe that the future of statistics and predictive analytics lies parallel with the innovations demonstrated within the Keras package. However, the package is so relatively new, that not a single individual, including the creator, has had the opportunity to utilize and document its potential. In this lies latent opportunity for the patient individual to prosper by pioneering the sparse landscape. <br /><br />It is my opinion that at this current time, the application of the Keras model should be paired with other traditional statistical and machine learning methodologies. This pairing of multiple models will enable the user and potential outside researchers to gain a greater understanding as to what may be motivating the Keras model’s predictive outputs. </div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-25645155042821402842022-10-21T16:17:00.002-04:002022-10-21T16:22:14.791-04:00(R) Machine Learning - Gradient Boosted Algorithms – Pt. IV Of all the models which have been discussed thus far, the most complicated, and the most effective of the models which utilize the tree methodology, are the models which belong to a primary sub-group known as, <b>“gradient boosted algorithms”</b>. <br /><br />Gradient boosted models are similar to the random forest model, the primary difference between the two is that the gradient boosted models synthesize their individual trees differently. Whereas random forests seek to minimize errors through a randomization process, gradient boosted models address each incorrect model within each tree as it is created. Meaning, that each tree is re-assessed after its creation occurs, and the subsequent tree is optimized based on acknowledgement of the prior tree’s errors. <br /><br /><b><u>Model Creation Options</u></b> <br /><br />As the gradient boosted algorithm possesses components of all of the previously discussed model methodologies, the complexities of the algorithm’s internal mechanism are evident by design. In essence, the evolved capacity of the model, possessing various foundational elements which were initially designated as aspects of prior methodologies, ultimately, through various stages of synthesis, produces a model with a greater number of options. These options can remain at their default assignments in which they were initially designated. As such, they will assume predetermined values in accordance to the surrounding circumstances. However, if you would like to customize the model’s synthesis, the following options are available for such: <br /><br /><b><u>distribution</u></b> – This option refers to the distribution type which the model will assume when analyzing the data utilized within the model design process. The following distribution types are available within the <b>“gbm”</b> package: <b>“gaussian”</b>, <b>“laplace”</b>, <b>“tdist”</b>, <b>“bernoulli”</b>, <b>“huberized”</b>, <b>“adaboost”</b>, <b>“poisson”,</b> <b>“coph”</b>, <b>“quantile” </b>and <b>“pairwise”</b>. If this option is not explicitly indicated by the user, they system will automatically decide between <b>“gaussian”</b> and <b>“bernoulli”</b>, as to which distribution type best suits the model data. <br /><br /><b><u>n.minobsinnode</u></b> – This option indicates the integer specifying the minimum number of observations in the terminal nodes of the trees. <br /><br /><b><u>n.trees</u></b> – The number of trees which will be utilized to create the final model. <br /><br /><b><u>interaction.depth</u></b> - Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1. <br /><br /><b><u>cv.folds</u></b> – Specifies the number of “cross-validation” folds to perform. This option essentially provides additional model output in the form of additional testing results. Similar output is generated by default within the random forest model package. <br /><br /><b><u>shrinkage</u></b> - A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually works, but a smaller learning rate typically requires more trees. Default is 0.1. <br /><br /><b><u>Optimizing a Model with the “CARET” Package</u></b> <br /><br />For the everyday analyst, being confronted with the task of appropriately assigning values to the aforementioned fields can be disconcerting. This task is also undertaken with the understanding that by incorrectly assigning a variable field, that an individual can vastly compromise the validity of a model’s results. Thankfully, the <b>“CARET”</b> package exists to assist us with our model optimization needs. <br /><br /><b>“CARET”</b> is an auxiliary package with numerous uses, primarily among them, is a function which can be utilized to assess model optimization prior to synthesis. It the case of our example, we will be utilizing the following packages to demonstrate this capability: <br /><br /><b># With the “CARET” package downloaded and enabled # <br /><br /># With the “e1071” package downloaded and enabled # </b><br /><br />With the above packages downloaded and enabled, we can run the following <b>“CARET”</b> function to generate console output pertaining to the various model types which <b>“CARET”</b> can be utilized to optimize: <br /><br /><b># List different models which train() function can optimize # <br /><br />names(getModelInfo()) </b><br /><br />The console output is too voluminous to present in its entirety within this article. However, a few notable options which warrant mentioning as they pertain to previously discussed methodologies are: <br /><br />rf – Which refers to the random forest model. <br /><br />treebag – Which refers to the bootstrap aggregation model. <br /><br />glm – Which refers to the general linear model. <br /><br />(and) <br /><br />gbm – Which refers to the gradient boosted model. <br /><br />Let’s start by regenerating the random sets which comprise of random observations from our favorite <b>“iris”</b> set. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. # <br /><br />train(Species~.,data=raniris[1:100,], method = "gbm") </b><br /><br />This produces a voluminous amount of console output, however, the primary portion of the output which we will focus upon is the bottom most section. <br /><br />This output should resemble something similar to: <br /><br /><i>Tuning parameter 'shrinkage' was held constant at a value of 0.1 <br />Tuning parameter 'n.minobsinnode' was held constant at a value of 10 <br />Accuracy was used to select the optimal model using the largest value. <br />The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10. </i><br /><br />From this information, we discover the optimal parameters in which to establish a gradient boosted model. <br /><br />In this particular case: <br /><br />n.trees = 50 <br /><br />interaction.depth = 2 <br /><br />shrinkage = 0.1 <br /><br />n.minobsinnode = 10 <br /><br /><b style="text-decoration: underline;">A Real Application Demonstration (Classification)</b> <br /><br />With the optimal parameters discerned, we may continue with the model building process. The model created for this example is of the classification type. Typically for a classification model type, the “multinomial” option should be specified. <br /><br /><b># Create Model # <br /><br />model <- gbm(Species ~., data = raniris[1:100,], distribution = 'multinomial', n.trees = 50, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10) <br /><br /># Test Model # <br /><br />modelprediction <- predict(model, n.trees = 50, newdata = raniris[101:150,] , type = 'response') <br /><br /># View Results # <br /><br />modelprediction0 <- apply(modelprediction, 1, which.max) <br /><br /># View Results in a readable format # <br /><br />modelprediction0 <- colnames(modelprediction)[modelprediction0] <br /><br /># Create Confusion Matrix # <br /><br />table(raniris[101:150,]$Species, predicted = modelprediction0) <br /></b><br /><u>Console Output:</u> <br /><br /> predicted <br /> setosa versicolor virginica <br />setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 <br /><br /><b style="text-decoration: underline;">A Real Application Demonstration (Continuous Dependent Variable</b><b><span style="text-decoration: underline;">)</span> </b><br /><br />As was the case with the previous example, we will again be utilizing the <b>train()</b> function within the <b>“CARET”</b> package to determine model optimization. As it pertains to continuous dependent variables, the <b>“gaussian”</b> option should be specified if the data is normally distributed, and the <b>“tdist” </b>option should be specified if the data is non-parametric. <br /><br /><b># Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. # <br /><br />model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", method = "gbm") <br /><br />model </b><br /><br /><u>Console Output:</u> <br /><br />Stochastic Gradient Boosting <br /><br />100 samples <br /> 3 predictor <br /><br />No pre-processing <br />Resampling: Bootstrapped (25 reps) <br />Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... <br />Resampling results across tuning parameters: <br /><br /><i> interaction.depth n.trees RMSE Rsquared MAE <br /> 1 50 0.4256570 0.7506086 0.3316030 <br /> 1 100 0.4083072 0.7623251 0.3258838 <br /> 1 150 0.4067113 0.7607363 0.3270202 <br /> 2 50 0.4241599 0.7471639 0.3347628 <br /> 2 100 0.4184793 0.7466858 0.3335772 <br /> 2 150 0.4212821 0.7427328 0.3369379 <br /> 3 50 0.4248178 0.7433384 0.3345428 <br /> 3 100 0.4260524 0.7391382 0.3385778 <br /> 3 150 0.4278416 0.7345970 0.3398392 </i><br /><br /><i>Tuning parameter 'shrinkage' was held constant at a value of 0.1 <br />Tuning parameter 'n.minobsinnode' was held constant at a value of 10 <br />RMSE was used to select the optimal model using the smallest value. <br />The final values used for the model were n.trees = 150, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10. </i><br /><br /><b># Optimal Model Parameters # <br /><br /># n.trees = 150 # <br /><br /># interaction.depth = 1 # <br /><br /># shrinkage = 0.1 # <br /><br /># n.minobsinnode = 10 # <br /><br /># Create Model # <br /><br />tmodel <- gbm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", n.trees = 150, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10) <br /><br /># Test Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response') <br /><br /># Compute the Root Mean Standard Error (RMSE) of model testing data # <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, tmodelprediction) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response') <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, tmodelprediction) </b><br /><br /><u>Console Output:</u> <br /><br /><i>[1] 0.4060854 <br /><br />[1] 0.3144518 </i><br /><br /><b># Mean Absolute Error # <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA # <br /><br /># Utilize MAE function on model testing data # <br /><br /># Regenerate Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response') <br /><br /># Generate Output # <br /><br />MAE(raniris[101:150,]$Sepal.Length, tmodelprediction) <br /><br /># Utilize MAE function on model training data # <br /><br /># Regenerate Model # <br /><br />tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response') <br /><br /># Generate Output # <br /><br />MAE(raniris[1:100,]$Sepal.Length, tmodelprediction) </b><br /><br /><u>Console Output:</u> <br /><br /><i>[1] 0.3320722 <br /><br />[1] 0.2563723 </i><br /><br /><b><u>Graphing and Interpreting Output</u></b> <br /><br />The following method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of each model, the code samples below produce the subsequent outputs: <br /><br /><b># Multinomial Model # <br /><br />summary(model)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMfFlkmkL6zrZVyE1sPSBvhNz7SzLuDlOuMzDHpGHzfBR8QjbvVXwPVvq827y9dWSm6Z_vLLq8ES9AOr0DrEXSVAgKGUv7LFRME79Oj9YIFOHW74f7Zvqd2WJxqRgsc_tSoEFhhUBjrnj5B_CKxWzkjg4FnJcX__g93VuTb8feWl3j-KKilsHC5dsY/s583/GBM1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="493" data-original-width="583" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMfFlkmkL6zrZVyE1sPSBvhNz7SzLuDlOuMzDHpGHzfBR8QjbvVXwPVvq827y9dWSm6Z_vLLq8ES9AOr0DrEXSVAgKGUv7LFRME79Oj9YIFOHW74f7Zvqd2WJxqRgsc_tSoEFhhUBjrnj5B_CKxWzkjg4FnJcX__g93VuTb8feWl3j-KKilsHC5dsY/w400-h338/GBM1.png" width="400" /></a></div><div><br /></div><u>Console Output:</u> <br /><br /><i> var rel.inf <br />Petal.Length Petal.Length 59.0666833 <br />Petal.Width Petal.Width 38.6911265 <br />Sepal.Width Sepal.Width 2.1148704 <br />Sepal.Length Sepal.Length 0.1273199 </i><br /><br /><br /><b>####################################### <br /><br /># T-Distribution Model # <br /><br />summary (tmodel)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzVbBmvn_C8vtmIQz77frRAXUlu_yTONs424YVpXBQGSEj54iSWjx5sacBiCn7bD4FNrKabQAQDbDwqA8TgfRbF7cFNed4f5uUkE7-FFiDwRnV_aC9K4_VYx3Dob_rG2dsBhaRc3IAx8UcjpR-XH1Gte47WdhBiJnk5-QSxGMvtAn838vpr0pqR_XV/s588/GBM2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="588" height="341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzVbBmvn_C8vtmIQz77frRAXUlu_yTONs424YVpXBQGSEj54iSWjx5sacBiCn7bD4FNrKabQAQDbDwqA8TgfRbF7cFNed4f5uUkE7-FFiDwRnV_aC9K4_VYx3Dob_rG2dsBhaRc3IAx8UcjpR-XH1Gte47WdhBiJnk5-QSxGMvtAn838vpr0pqR_XV/w400-h341/GBM2.png" width="400" /></a></div><div><br /><u>Console Output:</u> <br /><i><br /> var rel.inf <br />Petal.Length Petal.Length 74.11473 <br />Sepal.Width Sepal.Width 14.18743 <br />Petal.Width Petal.Width 11.69784</i></div><div><i><br /></i></div><div>That's all for now.</div><div><br /></div><div>I'll see you next time, Data Heads!</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-15162184907209897942022-10-13T10:30:00.005-04:002022-10-13T22:34:28.397-04:00(R) Machine Learning - The Random Forest Model – Pt. IIIWhile unsupervised machine learning methodologies were enduring their initial genesis, the Random Forest Model ruled the machine learning landscape as the best predictive model type available. In this article, we will review the Random Forest Model. If you haven’t done so already, I would highly recommend reading the prior articles pertaining to Bagging and Tree Modeling, as these articles illustrate many of the internal aspects which together converge into the Random Forest model methodology.<br /><br /><b><u>What a Random Forest and How is it Different?</u></b><br /><br />The random forest method of model creation contains certain elements of both the bagging, and standard tree methodologies. The random forest sampling step is similar to that of the bagging model. Also, in a similar manner, the random forest model is comprised of numerous individual trees, with the output figure being the majority consensus reached as data is passed through each individual tree model. The only real differentiating factor which is present within the random forest model, is the initial nodal split designation, which occurs proceeding the model’s root pathway.<br /><br />For example, if the following data frame was structured and prepared to serve as a random forest model’s foundation, the first step which would occur during the initial algorithmic process, would be random sampling. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-eqKZNfAcK3FCfKlGq170pcash-VDAqVUKI47CQJaoBDKyKvtEuQyOEU1-rwI32xzPGFdgbAEim3D5-tmb4OT1WaV93II-G3CeKR9T7IzoUGY62gOU-WSmQNMyki8VqYm9l3AztSmKI6gYiQxGPEeSCNddMTwGwJZhveSGOd5B96v2uEmtQf2mriI/s505/RF0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="253" data-original-width="505" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-eqKZNfAcK3FCfKlGq170pcash-VDAqVUKI47CQJaoBDKyKvtEuQyOEU1-rwI32xzPGFdgbAEim3D5-tmb4OT1WaV93II-G3CeKR9T7IzoUGY62gOU-WSmQNMyki8VqYm9l3AztSmKI6gYiQxGPEeSCNddMTwGwJZhveSGOd5B96v2uEmtQf2mriI/s16000/RF0.png" /></a></div><div><br /></div><div>Like the bagging model’s sampling process, the performance of this step might also resemble:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9y6v-kyBeRlDdckAzIfAjVZqhGtb3OVoZHtc2hmDW_q84K-SmL_QZLErHJKuV-8LGOiLRKvM2yMQrxZklldMbMmTUf4AOvG2HXD1_UnH4rdoaGrzmIlqJaSL91aOBuBM-EiO_dktHXqJluWbzSsdvR8fo9dnK9uTDksklZJuKMiJAml2lo7G1n_D9/s817/RF_01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="817" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9y6v-kyBeRlDdckAzIfAjVZqhGtb3OVoZHtc2hmDW_q84K-SmL_QZLErHJKuV-8LGOiLRKvM2yMQrxZklldMbMmTUf4AOvG2HXD1_UnH4rdoaGrzmIlqJaSL91aOBuBM-EiO_dktHXqJluWbzSsdvR8fo9dnK9uTDksklZJuKMiJAml2lo7G1n_D9/w640-h226/RF_01.png" width="640" /></a></div><div><br /></div>As was previously mentioned, the main differentiating factor which separates the random forest model from the other models whose parts it incorporates, is the manner in which the initial nodal split is designated. In the bagging model, numerous individual trees are created, and each tree is created from the same algorithmic equation as it is applied to each individual data set. In this manner, the optimization pattern is static, while the data for each set is dynamic. <br /><br />As it pertains to the random forest model, after the creation of each individual set has been established, a pre-selected number of independent variable categories are designated at random from each set, this selection will be assessed by the algorithm, with the most optimal pathway being ultimately selected from amongst the selection of pre-determined variables. <br /><br />For example, we’ll assume that the number of pre-designate variables which will be selected prior to the creation of each individual tree is 3. If this were the case, each tree within the model will have its initial nodal designation decided upon by which one of the three variables is optimal as it pertains to performing the initial filtering process. The other two variables which are not selected, are then considered for additional nodal splits, along with all of the other variables which the model finds particularly worthy. <br /><br />With this in mind, a set of variables which would consist of three randomly selected independent variables, might resemble the following as it relates to the initial nodal split:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc7AHZahkp-9_WECfPmt_BYEof5-Q-3lhc0yEL5yQcb8wwaP-6MAGFQ2ijbZYn7XKnkSzdmQMFPLRdtNOAAaKONIu41I5tDmVsAKEGyGOemvq8eK6dgXKTdZCTVu9OdxmvWC4tZ62a3Fu5Bu_k-zSEWtlkv5eunbk_adSHNIFwYKNNxfRtM9hHL7q6/s599/RForest.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="244" data-original-width="599" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc7AHZahkp-9_WECfPmt_BYEof5-Q-3lhc0yEL5yQcb8wwaP-6MAGFQ2ijbZYn7XKnkSzdmQMFPLRdtNOAAaKONIu41I5tDmVsAKEGyGOemvq8eK6dgXKTdZCTVu9OdxmvWC4tZ62a3Fu5Bu_k-zSEWtlkv5eunbk_adSHNIFwYKNNxfRtM9hHL7q6/w640-h260/RForest.png" width="640" /></a></div><div><br /></div>In this case, the blank node’s logical discretion would be selected from the optimal selection of a single variable from the set: {Sepal.Length, Sepal.Width, Petal.Length}. <br /><br />One variable would be selected from the set, with the other two variables then being returned to the larger set of all other variables from the initial data frame. From this larger set, all additional nodes would be established based on the optimal placement values determined by the underlying algorithm. <br /><br /><b><u>The Decision-Making Process</u></b> <br /><br />In a manner which exactly resembles the bagging-boostrap aggregation method described within the prior article, the predictive output figure consists of the majority consensus reached as data is passed through each individual tree model.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8qYwS70xlvOQNcPMEwg_1irPLA1nDN7ViYp0642oAsHHHZXPZ9Y8ngWcTIw5uwfMCa_KjD31H8uVy7xastMZY9AT5-K7yq2049jxhN4S24gAjZBohiJh53aE4OJuJ8VpcK9Ue9WWpzjLNkWL5Vq1zgxu82vfdy9YrGgO3FAzK_0P2hr3Rdepb6Pm/s625/RF_3x.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="245" data-original-width="625" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8qYwS70xlvOQNcPMEwg_1irPLA1nDN7ViYp0642oAsHHHZXPZ9Y8ngWcTIw5uwfMCa_KjD31H8uVy7xastMZY9AT5-K7yq2049jxhN4S24gAjZBohiJh53aE4OJuJ8VpcK9Ue9WWpzjLNkWL5Vq1zgxu82vfdy9YrGgO3FAzK_0P2hr3Rdepb6Pm/w640-h250/RF_3x.png" width="640" /></a></div><div><br /></div>The above graphical representation illustrates observation 8 being passed through the model. The model, being comprised of three separate decision trees, which were synthesized from three separate data sets, produces three different internal outcomes. The average of these outcomes is what is eventually returned to the user as the ultimate product of the model. <br /><br /><b><u>A Real Application Demonstration (Classification)</u></b> <br /><br />Again, we will utilize the "iris" data set which comes embedded within the R data platform. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "randomForest" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- randomForest(Species ~., data= raniris[1:100,], type = "class") <br /><br /># View the model summary # <br /><br />mod</b><br /><br /><u>Console Output:</u> <br /><br /><i>Call: <br /> randomForest(formula = Species ~ ., data = raniris[1:100, ], type = "class") <br /> Type of random forest: classification <br /> Number of trees: 500 <br />No. of variables tried at each split: 2 <br /><br /> OOB estimate of error rate: 4% <br />Confusion matrix: <br /> setosa versicolor virginica class.error <br />setosa 31 0 0 0.00000000 <br />versicolor 0 34 1 0.02857143 <br />virginica 0 3 31 0.08823529 </i><br /><br /><b><u>Deciphering the Output</u> </b><br /><br /><b>Call:</b> - The formula which initially generated the console output. <br /><br /><b>Type of random forest: Classification </b>– The model type applied to the data frame passed through the “randomForest()” function. <br /><br /><b>Number of trees: 500 </b>– The number of individual trees from which the data model is comprised of. <br /><br /><b>No. of variable tried at each split: 2</b> – The number of randomly selected variables considered as candidates for the initial nodal split criteria. <br /><br /><b>OOB estimate of error rate: 4%</b> - The amount of erroneous predictions which were discovered within the model as a result of passing OOB (out of bag) data through the completed model. <br /><br /><b>Class.error</b> – The percentage which appears within the rightmost column represents the total number of observations within the row divided by the number of incorrectly categorized observations within the row. <br /><br /><b><u>OOB and the Confusion Matrix</u></b> <br /><br />OOB is an abbreviation for “Out of Bag”. As it pertains to the random forest model, as each individual tree is being established within the model, additional observations from the original data set will, as a consequence of the method, not be selected for inclusion as it pertains to the creation the subsets. To generate both the OOB estimate of the error rate, and the confusion matrix within the object summary, the withheld data is passed through each individual tree once it is created. Through an internal tallying and consensus methodology, the confusion matrix presents an estimate of all observational predictions which existed within the initial data set, however, not all of the observational values which were predicted through this method were evenly assessed throughout the entire series of tree models. The consensus is that this test of prediction specificity is superior to testing the complete model with the entire set of initial variables. However, due to the level of complexity which is innate within the methodology, which, as an aspect of such, makes explaining findings to others extremely difficult, I will often also run the standard prediction function as well. <br /><br /><b># View model classification results with training data # <br /><br />prediction <- predict(mod, raniris[1:100,], type="class") <br /><br />table(raniris[1:100,]$Species, predicted = prediction ) <br /><br /> # View model classification results with test data # <br /><br />prediction <- predict(mod, raniris[101:150,], type="class") <br /><br />table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br /><u>Console Output (1):</u> <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 31 0 0 <br /> versicolor 0 35 0 <br /> virginica 0 0 34 </i><br /><br /><u>Console Output (2):</u> <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 </i><br /><br />As you probably already noticed, the “Console Output (1)” values differ from those produced within the object’s Confusion Matrix. This is a result of the phenomenon which was just previously discussed. <br /><br />To further illustrate this concept, if I were to change the number of trees to be created to: 2, thus, overriding the package default, the Confusion Matrix will lack enough observations to reflect the total number of observations within the initial set. The result would be the following: <br /><br /><b># With the package "randomForest" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- randomForest(Species ~., data= raniris[1:100,], ntree= 2, type = "class") <br /><br /># View the model summary # <br /><br />mod </b><br /><br /><i>Call: <br /> randomForest(formula = Species ~ ., data = raniris[1:100, ], ntree = 2, type = "class") <br /> Type of random forest: classification <br /> Number of trees: 2 <br />No. of variables tried at each split: 2 </i><br /><br /><i> OOB estimate of error rate: 3.57% <br />Confusion matrix: <br /> setosa versicolor virginica class.error <br />setosa 15 0 0 0.00000000 <br />versicolor 0 19 0 0.00000000 <br />virginica 0 2 20 0.09090909 </i><br /><br /><b><u>Peculiar Aspects of randomForest</u></b> <br /><br />There are few particular aspects of the randomForest package which differ from the previously discussed packages. One of which is how the randomForest() assesses variables within a data frame. Specifically, as it relates to such, the package function requires that variables which will be analyzed must have their types specifically assigned. <br /><br />To address this, we must first view the data type in which each variable is assigned. <br /><br />This can be accomplished with the following code: <br /><br /><b>str(raniris) <br /></b><br />Which produces the output: <br /><br /><i>'data.frame': 150 obs. of 5 variables: <br /> $ Sepal.Length: num 5 5.6 4.6 6.4 5.7 7.7 6 5.8 6.7 5.6 ... <br /> $ Sepal.Width : num 3.4 2.5 3.6 3.1 2.5 3.8 3 2.7 3.1 3 ... <br /> $ Petal.Length: num 1.5 3.9 1 5.5 5 6.7 4.8 5.1 4.4 4.5 ... <br /> $ Petal.Width : num 0.2 1.1 0.2 1.8 2 2.2 1.8 1.9 1.4 1.5 ... <br /> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 1 3 3 3 3 3 2 2 ... </i><br /><br />While this data frame does not require additional modification, if there was a need to change or assign variable types, this can be achieved through the following lines of code: <br /><br /><b># Change variable type to continuous # <br /><br />dataframe$contvar <- as.integer(dataframe$contvar) <br /><br /># Change variable type to categorical # <br /><br />dataframe$catvar <- as.factor(dataframe$catvar) </b><br /><br />Another unique differentiation which applies to the randomForest() function is the way in which it handles missing observational variable entries. You may recall from when we were previously building tree models within the <b>“rpart”</b> package, that the model methodology included within such contained an internal algorithm which assessed missing variable observational values, and assigned those values <b>“surrogate values”</b> based on other similar variable observations. <br /><br />Unfortunately, the randomForest() function requires that the user take a on more manual approach as it pertains to working around, and otherwise including these observational values within the eventual model. <br /><br />First, be sure that all variables within the model are appropriately assigned to the correct corresponding data types. <br /><br />Next, you will need to impute the data. To achieve this, you will need to utilize the following code for each variable column which is absent data. <br /><br /><b># Impute missing variable values # <br /><br />rfImpute(variablename ~., data=dataframename, iter = 500) </b><br /><br />This function instructs the randomForest package library to create new variable entries for whatever the specified variable may be by considering similar entries contained with other variable columns. “iter = “ specifies the number of iterations to utilize when accomplishing this task, as for whatever reason, this method of variable generation requires the creation of numerous tree models. A maximum of 6 iterations is enough to accomplish this task, however, I err on the side of extreme caution. If your data frame is colossal, 6 iterations should suffice. <br /><br />Though it’s un-necessary, let’s apply this function to each variable within our “iris” data frame: <br /><br /><b>raniris[1:100,]$Sepal.Length <- rfImpute(Sepal.Length ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Sepal.Width <- rfImpute(Sepal.Width ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Petal.Length <- rfImpute(Petal.Length ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Petal.Width <- rfImpute(Petal.Width ~., data=raniris[1:100,], iter = 500) <br /><br />raniris[1:100,]$Species <- rfImpute(Species ~., data=raniris[1:100,], iter = 500) </b><br /><br />You will receive the error message: <br /><br /><i>Error in rfImpute.default(m, y, ...) : No NAs found in m </i><br /><br />Which correctly indicates that there were no NA values to be found in the initial set. <br /><br /><b><u>Variables to Consider for Initial Nodal Split</u></b> <br /><br />The randomForest package has embedded within its namesake function, a default assignment as it pertains to the number of variables which are consider for each initial nodal split. This value can be modified by the user for optimal utilization of the model’s capabilities. The functional option to specify this modification is <b>“mtry”</b>. <br /><br />How would a researcher decide what the optimal value of this option ought to be? Thankfully, a Youtube user named: <b>StatQuest with Josh Starmer</b>, has created the following code to assist us with this decision. <br /><br /><b># Optimal mtry assessment # <br /><br /># vector(length = ) must equal the number of independent variables within the function # <br /><br /># for(i in 1: ) must have a value which equals the number of independent variables within the function # <br /><br />oob.values <- vector(length = 4) <br /><br />for(i in 1:4) { <br /><br /> temp.model <- randomForest(Species ~., data=raniris[1:100,], mtry=i, ntree=1000) <br /><br /> oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate), 1] <br /><br />} <br /><br /># View the object # <br /><br />oob.values </b><br /><br /><u>Console Output</u> <br /><br /><i>[1] 0.04 0.04 0.04 0.04</i> <br /><br />The values produced are the OOB error rates which are associated with each number of variable inclusions. <br /><br />Therefore, the leftmost value would be the OOB error rate with one variable included within the model. The rightmost value would be the OOB error rate with four variables included with the model. <br /><br />In the case of our model, as there is no change in OOB error as it pertains to the number of variables utilized for initial nodal split consideration, the option “mtry” can remain unaltered. However, if for whatever reason, we wished to consider a set of 3 random variables for each initial split within our model, we would utilize the following code: <br /><br /><b>mod <- randomForest(Species ~., data= raniris[1:100,], mtry= 3, type = "class") </b><br /><br /><b><u>Graphing Output</u></b> <br /><br />There are numerous ways to graphically represent the inner aspects of a random forest model as its aspects work in tandem to generate a predictive analysis. In this section, we will review two of the simplest methods for generating illustrative output. <br /><br />The first method creates a general error plot of the model. This can be achieved through the utilization of the following code: <br /><br /><b># Plot model # <br /><br />plot(mod) <br /><br /># include legend # <br /><br />layout(matrix(c(1,2),nrow=1), <br /><br /> width=c(4,1)) <br /><br />par(mar=c(5,4,4,0)) #No margin on the right side <br /><br />plot(mod, log="y") <br /><br />par(mar=c(5,0,4,2)) #No margin on the left side <br /><br />plot(c(0,1),type="n", axes=F, xlab="", ylab="") <br /><br /># “col=” and “fill=” must both be set to one plus the total number of independent variables within the model # <br /><br />legend("topleft", colnames(mod$err.rate),col=1:4,cex=0.8,fill=1:4) <br /><br /># Source of Inspiration: <a href="https://stackoverflow.com/questions/20328452/legend-for-random-forest-plot-in-r">https://stackoverflow.com/questions/20328452/legend-for-random-forest-plot-in-r</a> # </b><br /><br />This produces the following output: <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr2xW_Q8x3Xlwm9Zh8sMDSaQd-Nco3DniDC7XQ8JsyoO4A1DlKNQ0kE3_h0Rgc3ZPWsqN4QmnaOQB8_1n5rl7qFKAfanwsRMMsu1JE506diGujimbQwps9L5hcwPtJaW_dcd7sFXSYQQDKCQtEs_0GlRvr4SgKg_x_S0F7ZxVxxAUfiMbOV6i0Qsvc/s868/RForest2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="868" data-original-width="718" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr2xW_Q8x3Xlwm9Zh8sMDSaQd-Nco3DniDC7XQ8JsyoO4A1DlKNQ0kE3_h0Rgc3ZPWsqN4QmnaOQB8_1n5rl7qFKAfanwsRMMsu1JE506diGujimbQwps9L5hcwPtJaW_dcd7sFXSYQQDKCQtEs_0GlRvr4SgKg_x_S0F7ZxVxxAUfiMbOV6i0Qsvc/w529-h640/RForest2.png" width="529" /></a></div><div><br /></div>This next method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of our classification model, the following graphical output is produced from the line of code below: <br /><br /><b>varImpPlot(mod)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7tFWDmq_PUyOuXrX1RtA-5YADudkaoq1CHbph35EEhUcOQoOjCgSUtdRpefA76oz8ggrsRYiY3sNeJNtCUM9nWWywzXv9YoYgUBUPTtybDN1KTGe-GJEdQxuetovXjk6N_PKPIUF_fP8Q5pvFDLSWXmIQih5nlDSqJW3AnOt0ihpg7OvWLAeAVqVy/s557/RForest3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="294" data-original-width="557" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7tFWDmq_PUyOuXrX1RtA-5YADudkaoq1CHbph35EEhUcOQoOjCgSUtdRpefA76oz8ggrsRYiY3sNeJNtCUM9nWWywzXv9YoYgUBUPTtybDN1KTGe-GJEdQxuetovXjk6N_PKPIUF_fP8Q5pvFDLSWXmIQih5nlDSqJW3AnOt0ihpg7OvWLAeAVqVy/s16000/RForest3.png" /></a></div><div><br /><b><u>A Real Application Demonstration (ANOVA)</u></b> <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />anmod <- randomForest(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") <br /></b><br />Like the previously discussed methodologies, you also have the option of utilizing Root Mean Standard Error, or Mean Absolute Error, to analyze the model’s predictive capacity. <br /><br /><b># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />prediction <- predict(anmod, raniris[1:100,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, prediction ) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model testing data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, prediction ) <br /><br /># Mean Absolute Error # <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA # <br /><br /># Regenerate Predictive Model # <br /><br />anprediction <- predict(anmodel , raniris[1:100,]) <br /><br /># Utilize MAE function on model training data # <br /><br />MAE(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Mean Absolute Error # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># Utilize MAE function on model testing data # <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction) </b><br /><br /><u>Console Output (RMSE)</u> <br /><br /><i>[1] 0.2044091 <br /><br />[1] 0.3709858 </i><br /><br /><u>Console Output (MAE)</u> <br /><br /><i>[1] 0.2215909 <br /><br />[1] 0.2632491 </i><br /><br /> Just like the classification variation of the random forest model, graphical outputs can also be created to illustrate the internal aspects of the ANOVA version of the model. <br /><br /><b># Plot model # <br /><br />plot(anmod)</b></div><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu6a1NezRetETof1Tm_U30uidFLjCq_zyBbPoVL0YYG9QoR5whDXt5rYQdqdJjKwAi0K1UZRhYsvixsjj-L0uC3lEsIqNMSBrS4ULynLPHPc2xdxQW1xkBh3NRezT8JqAOLTg4KmfhabppEVWW8KZzq3khyukrroFGPjXYHL2H5lNY6rf4K5UxIGya/s517/RForest4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="293" data-original-width="517" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu6a1NezRetETof1Tm_U30uidFLjCq_zyBbPoVL0YYG9QoR5whDXt5rYQdqdJjKwAi0K1UZRhYsvixsjj-L0uC3lEsIqNMSBrS4ULynLPHPc2xdxQW1xkBh3NRezT8JqAOLTg4KmfhabppEVWW8KZzq3khyukrroFGPjXYHL2H5lNY6rf4K5UxIGya/w640-h362/RForest4.png" width="640" /></a></div><div><br /></div><b># Measure variable significance # <br /><br />varImpPlot(anmod)</b><div><b><br /></b></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQmXeLw-ISzJZPUTG-Z4xn6t3iHY5kjBq-C5wDPvbLYo2DuLBVmxRqwuyHAVFrDg5dDShg3gXqtdRT-kmzf1RoTOpjskQ22xgirtD9wBkUEYUXP0n4pNwzEQn0EuZugBowmIFzxoRMuuJelfIpufzW2kX3L8sse6HxLKlhe-RUVWVEEQylPZ_Hh8g4/s707/RForest5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="293" data-original-width="707" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQmXeLw-ISzJZPUTG-Z4xn6t3iHY5kjBq-C5wDPvbLYo2DuLBVmxRqwuyHAVFrDg5dDShg3gXqtdRT-kmzf1RoTOpjskQ22xgirtD9wBkUEYUXP0n4pNwzEQn0EuZugBowmIFzxoRMuuJelfIpufzW2kX3L8sse6HxLKlhe-RUVWVEEQylPZ_Hh8g4/w640-h266/RForest5.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">That's all for this entry, Data Heads.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">We'll continue on the topic of machine learning next week.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Until then, stay studious!</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">-RD</div><div><br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-68109324435135815272022-10-08T17:45:00.003-04:002022-10-08T22:37:16.944-04:00(R) Machine Learning - Bagging, Boosting and Bootstrap Aggregation – Pt. IINow that you have a fundamental understanding of tree based modeling, we can begin to discuss the concept of <b>"Bootstrap Aggregation"</b>. Both of the previously mentioned concepts will come to serve as compositional aspects of a separate model known as <b>"The Random Forest"</b>. This methodology will be discussed in the subsequent article.<br /><br />All three of these concepts classify as <b>"Machine Learning"</b>, specifically, supervised machine learning. <br /><br /><b>"Bagging", </b>is a word play synonym, which serves as a short abbreviation for <b>"Boot</b>strap <b>Agg</b>regation<b>"</b>. Bootstrap aggregation is a term which is utilized to describe a methodology in which multiple randomized observations are drawn from a sample data set. <b>"Boosting" </b>refers to the algorithm which analyzes numerous sample sets which were composed as a result of the previous process. Ultimately, from these sets, numerous decision trees are created. Into which, test data is eventually passed. Each observation within the test data set is analyzed as it passes through the numerous nodes of each individual tree. The results of the predictive output being the consensus of the results reached from a majority of the individual internal models. <br /><br /><b><u>How Bagging is Utilized<br /></u></b><br />As previously discussed, <b>"Bagging" </b>is a data sampling methodology. For demonstrative purposes, let's consider its application as it is applied to a randomized version of the <b>"iris" </b>data frame. Here is a portion of the data frame as it currently exists within the "R" platform.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJyJYn8AIpR7HWH9BtRYP3zozxC1u9CmhjykcNKrgxyAPpJsOWciLGiyQUtmLZRCzjVk8d6jpLq7IsJq-tjNBcerp_oXfDUoSQ2QM1CzszvHyZlzwHqhe7PTGNSkPkOkhwuNn0WwpYboXqp5gAvMw0HBusBSQASls-FsdOmj-XV_a5qjNGnDtLi42V/s505/bag1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="253" data-original-width="505" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJyJYn8AIpR7HWH9BtRYP3zozxC1u9CmhjykcNKrgxyAPpJsOWciLGiyQUtmLZRCzjVk8d6jpLq7IsJq-tjNBcerp_oXfDUoSQ2QM1CzszvHyZlzwHqhe7PTGNSkPkOkhwuNn0WwpYboXqp5gAvMw0HBusBSQASls-FsdOmj-XV_a5qjNGnDtLi42V/s16000/bag1.png" /></a></div><div><br /></div>From this data frame, we could utilize the<b> "bagging" </b>methodology to create numerous subsets which contain aspects of the observations contained therein. This methodology will sample from the data frame a pre-determined number of times until it has created a single data sub-set. Once this task has been completed, the process will be completed until a pre-determined number of subsets have been created. Observations from the initial data frame can be sampled multiple times in order to build each individual subset. Therefore, each data frame may contain multiple instances of the same observation. <br /><br />A graphical representation of this process is illustrated below:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEVwnEEMBJuq6g3-LTk4fdsOJ0yuQ097Am3VD-hxgYn_OsbUX45vxaAmRwv9q4zF9VoaaVgXYAfLIrirB3H6KIGPpnaQ-1sJcywRkBVQA1TZpRy845gqFhw11m-Mim296NHJgtH5XFBl0REEw64kf6HgEKtGtb6vkH1P-bCVYOEECiKVNYKVQE3eLY/s817/bag2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="817" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEVwnEEMBJuq6g3-LTk4fdsOJ0yuQ097Am3VD-hxgYn_OsbUX45vxaAmRwv9q4zF9VoaaVgXYAfLIrirB3H6KIGPpnaQ-1sJcywRkBVQA1TZpRy845gqFhw11m-Mim296NHJgtH5XFBl0REEw64kf6HgEKtGtb6vkH1P-bCVYOEECiKVNYKVQE3eLY/w640-h226/bag2.png" width="640" /></a></div><div><br /></div>In the case of our illustrated example, three new data samples were created. Each new sample contains a similar number of observations, however, observations from the original data frame are not exclusive in each set. Also, as demonstrated in the above graphic, data observations can repeat within the same sample. <br /><br /><b><u>Boosting Described</u></b> <br /><br />Once new data samples have been created, the <b>"boosting"</b> process, which is the portion of the algorithm which is initiated following the <b>"bagging"</b> methodology's application, begins to create individualized decision trees for each newly created set. Once each decision tree has been created, the model’s creation process is complete. <br /><br /><b><u>The Decision Making Process</u></b> <br /><br />With the model created, the process of predicting dependent variable values can be initiated. <br /><br />Remember that each decision tree was created from data observations from which each corresponding set was comprised of.<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQyQ6_E29WDC-6z3qH9n8TKZZ_QmuRQRr5y1G9sJNGf3_d5YtvOqKC62KnWw043yVQGSALVE6LUMcNvkQm2XmBlNVtPsuR2XxjW23kMoH5jBgiY90dVW2LZuTrDwHfpwZSkG_TeFkI1YD_fkc86bFyoTrWPBPoDsgpOgp8E3g51eHUkXItK5CKBFz/s625/bag3x.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="245" data-original-width="625" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFQyQ6_E29WDC-6z3qH9n8TKZZ_QmuRQRr5y1G9sJNGf3_d5YtvOqKC62KnWw043yVQGSALVE6LUMcNvkQm2XmBlNVtPsuR2XxjW23kMoH5jBgiY90dVW2LZuTrDwHfpwZSkG_TeFkI1YD_fkc86bFyoTrWPBPoDsgpOgp8E3g51eHUkXItK5CKBFz/s16000/bag3x.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The above graphical representation illustrates observation - 8 ,being passed through the model. The model, being comprised of three separate decision trees, which were synthesized from three separate data subsets, produces three different internal outcomes. The average of these outcomes is what is eventually returned to the user as the ultimate product of the model. <br /><br /><u style="font-weight: bold;">A Real Application Demonstration (Classification)</u> <br /><br />Again, we will utilize the <b>"iris"</b> data set which comes embedded within the R data platform. <br /><br />A short note on the standard notation utilized for this model type: <br /><br /><b>D = The training data set. <br /><br />n = The number of observations within the training data set. <br /><br />n^1 = "n prime". The number of observations within each data subset. <br /><br />m = The number of subsets. </b><br /><br />In this example we will allow the bagging package command to perform its default function without specifying any additional options. If n^1 = n, then each subset which is created from the training data set is expected to contain at least (1 - 1/e) (≈63.2%) of the unique observations contained within the training data set. If this does not occur, the <b>bagging() </b>function will automatically enable an option which ensures this occurrence. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the data frame rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />mod <- bagging(Species ~., data= raniris[1:100,], type = "class") <br /><br /># View model classification results with training data # <br /><br />prediction <- predict(mod, raniris[1:100,], type="class") <br /><br />table(raniris[1:100,]$Species, predicted = prediction ) <br /><br /># View model classification results with test data # <br /><br />prediction <- predict(mod, raniris[101:150,], type="class") <br /><br />table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br /><u>Console Output (1):</u> <br /><br /><i> predicted <br /><br /> setosa versicolor virginica <br /><br /> setosa 31 0 0 <br /><br /> versicolor 0 35 0 <br /><br /> virginica 0 0 34 </i><br /><br /><u>Console Output (2):</u> <br /><br /><i> predicted <br /><br /> setosa versicolor virginica <br /><br />setosa 19 0 0 <br /><br />versicolor 0 13 2 <br /><br />virginica 0 2 14 </i><br /><br /><b><u>A Real Application Demonstration (ANOVA)</u></b> <br /><br />In this second example demonstration, all of the notational aspects of the model and the restrictions of the function still apply. However, in this case, the dependent variable is continuous, not categorical. To test the predictability of the model, the Root Mean Standard Error and the Mean Absolute Error values are calculated. For more information as it pertains to the calculation and interpretation of these measurements of predictability, please consult the prior article. <br /><br /><b># Create a training data set from the data frame: "iris" # <br /><br /># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ] <br /><br /># With the package "ipred" downloaded and enabled # <br /><br /># Create the model # <br /><br />anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") <br /><br /># Compute the Root Mean Standard Error (RMSE) of model training data # <br /><br />prediction <- predict(anmod, raniris[1:100,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, prediction ) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model test data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[101:150,]$Sepal.Length, prediction ) </b><br /><br /><u>Console Output (1) - Training Data:</u> <br /><br /><i>[1] 0.3032058 </i><br /><br /><u>Console Output (2) - Test Data:</u> <br /><br /><i>[1] 0.3427076 </i><br /><br /><b># Create a function to calculate Mean Absolute Error # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Compute the Mean Absolute Error (MAE) of model training data # <br /><br />anprediction <- predict(anmodel , raniris[1:100,]) <br /><br />MAE(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Compute the Mean Absolute Error (MAE) of model test data # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction) </b><br /><br /><u>Console Output (1) - Training Data:</u> <br /><br /><i>[1] 0.2289299 <br /></i><br /><u>Console Output (2) - Test Data:</u> <br /><br /><i>[1] 0.2706003 <br /></i><br /><b><u>Conclusions</u></b> <br /><br />The method from which the <b>Bagging()</b> function was derived, was initially postulated by Leo Breiman, the same individual who created the tree model methodology. You will likely never be inclined to use this methodology as a standalone method of analysis. As was previously mentioned within this article, the justification for this topic’s discussion pertains solely its applicability as an aspect of the random forest model. Therefore, from a pragmatic standpoint, if tree models are the model type which you wish to utilize when performing data analysis, you would either be inclined to select the basic tree model for its simplicity, or the random forest model for its enhanced ability. <br /></div><div><br /></div><div>That's all for today.</div><div><br /></div><div>I'll see you next week,</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-23605355899129620292022-09-25T14:56:00.004-04:002022-09-25T14:56:58.287-04:00(R) Machine Learning - Trees - Pt. I<p>This article will serve as the first of many articles which will be discussing the topic of Machine Learning. Throughout the series of subsequent articles published on this site, we will discuss Machine Learning as a topic, and the theories and algorithms which ultimately serve as the subject’s foundation. <br /><br />While I do not personally consider the equations embedded within the<b> “rpart” </b>package to be machine learning from a literal perspective, those who act as authorities on such matters define it as otherwise. By the definition postulated by the greater community, Tree-Based models represent an aspect of machine learning known as "supervised learning". What this essentially implies, is that the computer software implements a statistical solution to an evidence based question posed by the user. After which, the user has the opportunity to review the solution and the rational, and make model edits where necessary. <br /><br />The functionality which is implemented within tree-based models, is often drawn from an abstract or white paper written by mathematicians. Therefore, in many cases, the algorithms which ultimately animate the decision making process, are often too difficult, or too cumbersome, for a human being to apply by hand. This does not mean that such undertakings are impossible, however, given the time commitment dependent on the size of the data frame which will ultimately be analyzed, the more pragmatic approach would be to leave the process entirely to the machines which are designed to perform such functions. </p><b><u>Introducing Tree-Based Models with "rpart" </u></b><br /><br />Like the K-Means Cluster, <b>"rpart" </b>is reliant on an underlying algorithm which, due to its complexity, produces results which are difficult to verify. Meaning, that unlike a process such as categorical regression, there is much that occurs outside of the observation of the user from a mathematical standpoint. Due to the nature of the analysis, no equation is output for the user to check, only the model itself. Without this proof of concept, the user can only assume that the analysis was appropriately performed, and the model produced was the optimal variation necessary for future application. <br /><br />For the examples included within this article, we will be using the R data set <b>"iris"</b>. <div><br /><b><u>Perparing for Analysis </u></b><br /><br />Before we begin, you will need to download two separate auxiliary packages from the CRAN repository, those being: <br /><br /><b>"rpart" </b><br /><br />and <br /><br /><b>"rpart.plot" </b><br /><br />Once you have completed this task, we will move forward by reviewing the data set prior to analysis. <br /><br />This can be achieved by initiating the following functions: <br /><br /><b>summary(iris) <br /><br />head(iris) </b><br /><br />Since the data frame is initially sorted and organized by <b>"Species"</b>, prior to performing the analysis, we must take steps to randomize the data contained within the data frame.</div><div><br /><b><u>Justification for Randomization </u></b><br /><br />Presenting a machine with data which is performing analysis through the utilization of an algorithm, is somewhat analogous to teaching a young child. To better illustrate this concept, I will present a demonstrative scenario. <br /><br />Let's imagine, that for some particular reason, you were attempting to instruct a very young child on the topic of dogs, and to accomplish such, you presented the child with a series of pictures which consisted of only golden Labradors. As you might imagine, the child would walk away from the exercise with the notion that dogs, as an object, always consisted of the features associated with the Labradors of the golden variety. Instead of believing that a dog is a generalized descriptor which encompasses numerous minute and discretely defined features, the child will believe that all dogs are golden Labradors, and that golden Labradors, are the only type of dog. <br /><br />Machines learn* in a similar manner. Each algorithm provides a distinct and unique applicable methodology as it pertains to the overall outcome of the analysis, however, the typical algorithmic standard possesses a bias, in a similar manner to the way in which humans also possess such, based solely on the data as it is initially presented. This is why randomization of data, which instead presents a diverse and robust summary of the data source, is so integral to the process. <br /><br />This method of randomization was inspired by the YouTube user: Jalayer Academy. A link to the video which describes this randomization technique can be found below. <br /><br /><i>* - or the algorithm that is associated with the application which creates the appearance of such. </i><br /><br /><b># Set randomization seed # <br /><br />set.seed(454) <br /><br /># Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. # <br /><br />rannum <- runif(nrow(iris)) <br /><br /># Order the dataframe rows by the values in which the random set is ordered # <br /><br />raniris <- iris[order(rannum), ]</b></div><div><br /></div><div><i>Jalayer Academy: <a href="https://www.youtube.com/watch?v=XLNsl1Da5MA">https://www.youtube.com/watch?v=XLNsl1Da5MA</a></i></div><div><br /><b><u>Training Data and The "rpart" Algorithm </u></b><br /><br />Before we apply the algorithm within the <b>"rpart" </b>package, there are two separate topics which I wish to discuss. <br /><br />The<b> "rpart"</b> algorithm, as was previously mentioned, is one of many machine learning methodologies which can be utilized to analyze data. The differentiating factor which separates methodologies is typically based on the underlying algorithm which is applied to initial data frame. In the case of <b>"rpart"</b>, the methodology utilized, was initially postulated by: Breiman, Friedman, Olshen and Stone.</div><div><br /><b><u>Classiﬁcation and Regression Trees </u></b><br /><br />On the topic of training data, let us again return to our previous child training example. When teaching a child, if utilizing the flash card method that was discussed prior, you may be inclined to set a few of the cards which you have designed aside. The reason for such, is that these cards could be utilized after the initial training, in order to test the child's comprehension of the subject matter. <br /><br />Most machines are trained in a similar manner. A portion of the initial data frame is typically set aside in order to test the overall strength of the model after the model's synthesis is complete. After passing the additional data through the model, a rough conclusion can be drawn as it pertains to the overall effectiveness of the model's design. </div><div><br /><b><u>Method of Application (categorical variable) </u></b><br /><br />As is the case as it pertains to linear regression, we must designate the dependent variable that we wish to predict. If the variable is a categorical variable, we will specify the <b>rpart() </b>function to include a method option of <b>"case"</b>. If the variable is a continuous variable, we will specify the <b>“rpart”</b> function to include a method option of <b>"anova"</b>. <br /><br />In this first case, we will attempt to create a model which can be utilized to, through the assessment of independent variables, properly predict the species variable. <br /><br />The structure of the <b>rpart()</b> function is incredibly similar to the linear model function which is native within R. <br /><br /><b>model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = raniris [1:100,], method="class")</b><br /><br /><b><u>Let's break this structure down: </u></b><br /><br /><b>Species</b> - Is the model's dependent variable. <br /><br /><b>Sepal.Length + Sepal.Width + Petal.Length + Petal.Width</b> - Are the model's independent variables. <br /><br /><b>data = irisr[1:100,] </b>- This option is specifying the data which will be included within the analysis. As we discussed previously, for the purposes our model, only the first 100 row entries of the initial data frame will be included as the foundational aspects in which to structure the model. <br /><br /><b>method = "case"</b> - This option indicates to the computer that the dependent variable is categorical and not continuous. <br /><br />After running the above function, we are left with newly created variable:<b> "model"</b>.</div><div><br /><b style="text-decoration: underline;">Conclusions</b> <br /><br />From this variable we can draw various conclusions. <br /><br />Running the variable: <b>"model"</b> within the terminal should produce the following console output: <br /><br /><i>n= 100 <br /><br />node), split, n, loss, yval, (yprob) <br /><br /> * denotes terminal node <br /><br />1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000) <br /><br />2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * <br /><br />3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) <br /><br />6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * <br /><br />7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * </i><br /></div><div><br /><b><u>Let's break this structure down:</u></b><br /><br /><b>Structure Summary </b><br /><br /><b>n = 100</b> - This is the initial number of observations passed into the model. <br /><br /><b>Logic of the nodal split</b> – Example: Petal.Length>=2.45 <br /><br /><b>Total Observations Included within node</b> - Example: 69 <br /><br /><b>Observations which were incorrectly designated</b> - Example: 34 <br /><br /><b>Nodal Designation </b>– Example: versicolor <br /><br /><b>Percentage of categorical observations occupying each category </b>– Example: <i>(0.00000000 0.50724638 0.49275362) <br /></i><br /><b>The Structure Itself</b></div><div><br /><b><i>root 100 65 versicolor (0.31000000 0.35000000 0.34000000)</i> </b>- Root is the initial number of observations which are fed through the tree model, hence the term root. The numbers which are found within the parenthesis are the percentage breakdowns of the observations by category. <br /><br /><b><i>Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *</i> </b>- The first split which filters model data between two branches. The first branch, sorts data to the left leaf, in which, 31 of the observations are setosa (100%). The condition which determines the discrimination of data is the Petal.Length (<2.45) variable value of the observation. The (*) symbol is indicating that the node is a terminal node. This means that this node leads to a leaf. <br /><br /><i><b>Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)</b> </i>- This branch indicates a split based on the right sided alternative to the prior condition. The initial number within the first set of numbers indicates the number of cases which remain prior to further sorting, and the subsequent number indicates the number of cases which are virginica (and not veriscolor) . The next set of numbers indicates the percentage of the remaining 69 cases which are versicol (50%), and the percentage of the remaining 69 cases which are virginica (49%) . <br /><br /><b><i>Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * </i></b>- This branch indicates a left split. The (*) symbol is indicates that the node is a terminal node. Of the cases sorted through the node, 35 of the observations are veriscolor (95%) and 2 of the observations are virginica (5%). <br /><br /><b><i>Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *</i></b> - This branch indicates a right split alternative. The (*) symbol indicates that the node is a terminal node. Of the cases sorted through the node, 32 of the observations are veriscolor (100%), and 0 of the observations are virginica (0%).</div><div><br />Further information, for inference, can be generated by running the following code within the terminal: <br /><b><br />summary(model) <br /></b><br />This produces the following console output: <br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><br />(I have created annotations beneath each relevant portion of output)<br /><br /><i>Call: </i><br /><br /><i>rpart(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + </i><br /><br /><i> Petal.Width, data = raniris[1:100, ], method = "class") </i><br /><br /><i> n= 100 </i><br /><br /><i> CP nsplit rel error xerror xstd </i><br /><br /><b><i>1 0.4846154 0 1.00000000 1.26153846 0.05910576 </i><br /><br /><i>2 0.0100000 2 0.03076923 0.04615385 0.02624419 </i><br /><br />This portion of the output will be useful as we explore the process of "pruning" later in the article. </b><br /><br /><i>Variable importance <br /><br /> Petal.Width Petal.Length Sepal.Length Sepal.Width <br /><br /> 35 31 20 14 </i><br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><b><i>Node number 1: 100 observations, complexity param=0.4846154 <br /><br /> predicted class=versicolor expected loss=0.65 P(node) =1 <br /><br /> class counts: 31 35 34 <br /><br /> probabilities: 0.310 0.350 0.340 <br /><br /> left son=2 (31 obs) right son=3 (69 obs) <br /><br /> Primary splits: <br /><br /> Petal.Length < 2.45 to the left, improve=32.08725, (0 missing) <br /><br /> Petal.Width < 0.8 to the left, improve=32.08725, (0 missing) <br /><br /> Sepal.Length < 5.55 to the left, improve=18.52595, (0 missing) <br /><br /> Sepal.Width < 3.05 to the right, improve=12.67416, (0 missing) <br /><br /> Surrogate splits: <br /><br /> Petal.Width < 0.8 to the left, agree=1.00, adj=1.000, (0 split) <br /><br /> Sepal.Length < 5.45 to the left, agree=0.89, adj=0.645, (0 split) <br /><br /> Sepal.Width < 3.35 to the right, agree=0.83, adj=0.452, (0 split) <br /></i></b><br /><br /><br /><b>The initial split from the root. </b><br /><br /><br /><br /><b><i>Node number 2: 31 observations <br /><br /> predicted class=setosa expected loss=0 P(node) =0.31 <br /><br /> class counts: 31 0 0 <br /><br /> probabilities: 1.000 0.000 0.000 </i></b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "setosa" leaf. </b><br /><br /><br /> <br /><br /><b><i>Node number 3: 69 observations, complexity param=0.4846154 <br /><br /> predicted class=versicolor expected loss=0.4927536 P(node) =0.69 <br /><br /> class counts: 0 35 34 <br /><br /> probabilities: 0.000 0.507 0.493 <br /><br /> left son=6 (37 obs) right son=7 (32 obs) </i></b><br /><br /><br /> <br /><br /><b>The results of the aforementioned split prior to being filtered through the pedal width conditional. </b><br /><br /><br /> <br /><br /><i> Primary splits: <br /><br /> Petal.Width < 1.65 to the left, improve=30.708970, (0 missing) <br /><br /> Petal.Length < 4.75 to the left, improve=25.420120, (0 missing) <br /><br /> Sepal.Length < 6.35 to the left, improve= 7.401845, (0 missing) <br /><br /> Sepal.Width < 2.95 to the left, improve= 3.878961, (0 missing) <br /><br /> Surrogate splits: <br /><br /> Petal.Length < 4.75 to the left, agree=0.899, adj=0.781, (0 split) <br /><br /> Sepal.Length < 6.15 to the left, agree=0.754, adj=0.469, (0 split) <br /><br /> Sepal.Width < 2.95 to the left, agree=0.696, adj=0.344, (0 split) </i><br /><br /><b><i>Node number 6: 37 observations <br /><br /> predicted class=versicolor expected loss=0.05405405 P(node) =0.37 <br /><br /> class counts: 0 35 2 <br /><br /> probabilities: 0.000 0.946 0.054 </i></b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "versicolor" leaf. </b><br /><br /><br /> <br /><br /><b>Node number 7: 32 observations <br /><br /> predicted class=virginica expected loss=0 P(node) =0.32 <br /><br /> class counts: 0 0 32 <br /><br /> probabilities: 0.000 0.000 1.000 </b><br /><br /><br /> <br /><br /><b>Filtered results which exist within the "virginica" leaf. </b><br /><br /><br /><br /><br /><b><u>Visualizing Output with a Well Needed Illustration</u> </b><br /><br />If you got lost somewhere along the way during the prior section, don't be ashamed, it is understandable. I am not in any way operating under the pretense that any of this is innate or easily scalable. <br /><br />However, much of what I attempted to explain the preceding paragraphs can be best surmised through the utilization of the <b>"rpart.plot" </b>package. <br /><br /><b># Model Illustration Code # <br /><br />rpart.plot(model, type = 3, extra = 101) </b><br /><br />Console Output:</p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px;"><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s671/rpart1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="405" data-original-width="671" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s16000/rpart1.png" /></a></div><div><br /></div>What is being illustrated in the graphic are the decision branches, and the leaves which ultimately serve as the destinations for the final categorical filtering process. <br /><br />The leaf <b>"setosa"</b> contains 31 observations which were correctly identified as <b>"setosa"</b> observations. The total number of observations equates for 31% of the total observational rows which were passed through the model. <br /><br />The leaf <b>"versicolor" </b>contains 35 observations which were correctly identified as <b>"versicolor"</b>, and 2 observations which were misidentified. The misidentified observations would instead belong within the <b>“virginica”</b> categorical leaf. The total number of observation contained within the <b>"versicolor" </b>leaf, both correct and incorrect, equal for a total of 37% of the observational rows which were passed through the model. <br /><br />The leaf <b>"virginica" </b>contains 32 observations which were correctly identified as <b>"virginica"</b>. The total number of observations equates for 32% of the total observational rows which were passed through the model.</div><div><br /><b><u>Testing the Model</u></b> <br /><br />Now that our decision tree model has been built, let's test its predictive ability with the data which we left absent from our initial analyzation. <br /><br /><b># Create "confusion matrix" to test model accuracy # <br /><br />prediction <- predict(model, raniris[101:150,], type="class")</b><br /><div><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 10px; min-height: 13px;"><br /><b>table(raniris[101:150,]$Species, predicted = prediction ) </b><br /><br />A variable named <b>"prediction"</b> is created through the utilization of the <b>predict() </b>function. Passed to this function as options are: the model variable, the remaining rows of the randomized <b>"iris" </b>data frame, and the model type. <br /><br />Next, a table is created which illustrates the differentiation between what was predicted and what the observation occurrence actually equals. The option <b>"predicted = " </b>will always equal your prediction variable. The numbers within the brackets <b>[101:150, ] </b>specify the rows of the randomized data frame which will act as test observations for the model. <b><i>“raniris” </i></b>is the data frame from which these observations will be drawn, and <b>“$Species”</b> specifies the data frame variable which will be assessed. <br /><br />The result of initiating the above lines of code produces the following console output: <br /><br /><i> predicted <br /> setosa versicolor virginica <br /> setosa 19 0 0 <br /> versicolor 0 13 2 <br /> virginica 0 2 14 </i><br /><br />This output table is known as a <b>“confusion matrix”</b>. Its purpose of existence is to sort the output provided into a readable format which illustrates the number of correctly predicted outcomes, and the number of incorrectly predicted outcomes within each category. In this particular case, all setosa observations were correctly predicted. 13 virsicolor observations were correctly predicted with 2 observations misattributed as virginica observations. 14 virginica observations were correctly attributed, with 2 observations misattributed as versicolor categorical entries. <br /><br /><b><u>Method of Application (continuous variable)</u></b> <br /><br />Now that we’ve successfully analyzed categorical data, we will progress within our study by also demonstrating rpart’s capacity as it pertains to the analysis of continuous data. <br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px;">Again, we will be utilizing the <b>“iris”</b> data set. However, in this scenario, we will omit <b>“species”</b> from our model, and instead of attempting to identify the species of the iris in question, we will attempt to identify the septal length of an iris plant based on its other attributes. Therefore, in this example, our dependent variable will be <b>“Sepal.Length”</b>. <br /><br />The main differentiation between the continuous data model and the categorical data model within the <b>“rpart” </b>package is the option which specifies the analytical methodology. Instead of specifying (method=”class”), we will instruct the package function to utilize (method=”anova”). Therefore, the function which will lead to creation of the model will resemble: <br /><br /><b>anmodel <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova") </b><br /><br />Once the model is built, let’s take a look at the summary of its internal aspects: <br /><br /><b>summary(anmodel) </b><br /><br />This produces the output: <br /></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br /><i>Call: <br />rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, <br /> data = raniris[1:100, ], method = "anova") <br /> n= 100 <br /><br /> CP nsplit rel error xerror xstd <br />1 0.57720991 0 1.0000000 1.0240753 0.12908984 <br />2 0.12187301 1 0.4227901 0.4792432 0.07380297 <br />3 0.06212228 2 0.3009171 0.3499328 0.04643313 <br />4 0.03392768 3 0.2387948 0.2920761 0.04577809 <br />5 0.01783361 4 0.2048671 0.2920798 0.04349656 <br />6 0.01614077 5 0.1870335 0.2838212 0.04639387 <br />7 0.01092541 6 0.1708927 0.2792003 0.04602130 <br />8 0.01000000 7 0.1599673 0.2849910 0.04586765 <br /><br />Variable importance <br />Petal.Length Petal.Width Sepal.Width <br /> 46 37 17 <br /><br />Node number 1: 100 observations, complexity param=0.5772099 <br /> mean=5.834, MSE=0.614244 <br /> left son=2 (49 obs) right son=3 (51 obs) <br /> Primary splits: <br /> Petal.Length < 4.25 to the left, improve=0.57720990, (0 missing) <br /> Petal.Width < 1.15 to the left, improve=0.53758000, (0 missing) <br /> Sepal.Width < 3.35 to the right, improve=0.02830809, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 1.35 to the left, agree=0.96, adj=0.918, (0 split) <br /> Sepal.Width < 3.35 to the right, agree=0.65, adj=0.286, (0 split) <br /><br />Node number 2: 49 observations, complexity param=0.06212228 <br /> mean=5.226531, MSE=0.1786839 <br /> left son=4 (34 obs) right son=5 (15 obs) <br /> Primary splits: <br /> Petal.Length < 3.45 to the left, improve=0.4358197, (0 missing) <br /> Petal.Width < 0.35 to the left, improve=0.3640792, (0 missing) <br /> Sepal.Width < 2.95 to the right, improve=0.1686580, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 0.8 to the left, agree=0.939, adj=0.8, (0 split) <br /> Sepal.Width < 2.95 to the right, agree=0.878, adj=0.6, (0 split) <br /><br />Node number 3: 51 observations, complexity param=0.121873 <br /> mean=6.417647, MSE=0.3375317 <br /> left son=6 (39 obs) right son=7 (12 obs) <br /> Primary splits: <br /> Petal.Length < 5.65 to the left, improve=0.4348743, (0 missing) <br /> Sepal.Width < 3.05 to the left, improve=0.1970339, (0 missing) <br /> Petal.Width < 1.95 to the left, improve=0.1805629, (0 missing) <br /> Surrogate splits: <br /> Sepal.Width < 3.15 to the left, agree=0.843, adj=0.333, (0 split) <br /> Petal.Width < 2.15 to the left, agree=0.824, adj=0.250, (0 split) <br /><br />Node number 4: 34 observations, complexity param=0.03392768 <br /> mean=5.041176, MSE=0.1288927 <br /> left son=8 (26 obs) right son=9 (8 obs) <br /> Primary splits: <br /> Sepal.Width < 3.65 to the left, improve=0.47554080, (0 missing) <br /> Petal.Length < 1.35 to the left, improve=0.07911083, (0 missing) <br /> Petal.Width < 0.25 to the left, improve=0.06421307, (0 missing) <br /><br />Node number 5: 15 observations <br /> mean=5.646667, MSE=0.03715556 <br /><br />Node number 6: 39 observations, complexity param=0.01783361 <br /> mean=6.205128, MSE=0.1799737 <br /> left son=12 (30 obs) right son=13 (9 obs) <br /> Primary splits: <br /> Sepal.Width < 3.05 to the left, improve=0.1560654, (0 missing) <br /> Petal.Width < 2.05 to the left, improve=0.1506123, (0 missing) <br /> Petal.Length < 4.55 to the left, improve=0.1334125, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 2.25 to the left, agree=0.846, adj=0.333, (0 split) <br /><br />Node number 7: 12 observations <br /> mean=7.108333, MSE=0.2257639 <br /><br />Node number 8: 26 observations <br /> mean=4.903846, MSE=0.07344675 <br /><br />Node number 9: 8 observations <br /> mean=5.4875, MSE=0.04859375 <br /><br />Node number 12: 30 observations, complexity param=0.01614077 <br /> mean=6.113333, MSE=0.1658222 <br /> left son=24 (23 obs) right son=25 (7 obs) <br /> Primary splits: <br /> Petal.Length < 5.15 to the left, improve=0.19929710, (0 missing) <br /> Petal.Width < 1.45 to the right, improve=0.07411631, (0 missing) <br /> Sepal.Width < 2.75 to the left, improve=0.06794425, (0 missing) <br /> Surrogate splits: <br /> Petal.Width < 2.05 to the left, agree=0.867, adj=0.429, (0 split) <br /><br />Node number 13: 9 observations <br /> mean=6.511111, MSE=0.1054321 <br /><br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><i>Node number 24: 23 observations, complexity param=0.01092541 <br /> mean=6.013043, MSE=0.1620038 <br /> left son=48 (9 obs) right son=49 (14 obs) <br /> Primary splits: <br /> Petal.Width < 1.65 to the right, improve=0.18010500, (0 missing) <br /> Petal.Length < 4.55 to the left, improve=0.12257150, (0 missing) <br /> Sepal.Width < 2.75 to the left, improve=0.03274482, (0 missing) <br /> Surrogate splits: <br /> Petal.Length < 4.75 to the right, agree=0.783, adj=0.444, (0 split) <br /><br />Node number 25: 7 observations <br /> mean=6.442857, MSE=0.03673469 <br /><br />Node number 48: 9 observations <br /> mean=5.8, MSE=0.1466667 <br /><br />Node number 49: 14 observations <br /> mean=6.15, MSE=0.1239286 </i><br /><br />The largest distinguishing factor between outputs is that instead of categorical sorting, <b>“rpart”</b> has organized the data by mean value and sorted in this manner. <b>“MSE”</b> is an abbreviation for <b>“Mean Squared Error”</b>, which measures the level of differentiation of other values in regards to the mean. The larger this value is, the greater the spatial differential between the set’s data points. </p><br />As always, the phenomenon which is demonstrated within the raw output will look better in graphical form. To create an illustration of the model, utilize the code below: <br /><br /><b># Note: R-Part Part will not round off the numerical figures within an ANOVA model’s output graphic # <br /><br /># For this reason, I have explicitly disabled the “roundint” option # <br /><br />rpart.plot(anmodel,extra = 101, type =3, roundint = FALSE) </b><br /><br />This creates the following output:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR6HiFlrce51NqgzY7C0k5HTw1VxHRAXcnSMilWJkIF8T6chChx9G4LEb5ZHwnkfuyuXO83PLQPbDeY0R-2KI4GOfayzya01n5RWdrNL5lYVBojIiugSjOerxNAtKAnRuF-5n1mXkF4j7oOQ49bYPYCb01syOISp0VSEwZjFSYlAlKnRf57Jf5iSFK/s638/rpartan.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="430" data-original-width="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR6HiFlrce51NqgzY7C0k5HTw1VxHRAXcnSMilWJkIF8T6chChx9G4LEb5ZHwnkfuyuXO83PLQPbDeY0R-2KI4GOfayzya01n5RWdrNL5lYVBojIiugSjOerxNAtKAnRuF-5n1mXkF4j7oOQ49bYPYCb01syOISp0VSEwZjFSYlAlKnRf57Jf5iSFK/s16000/rpartan.png" /></a></div><div><br /></div>In the leaves at the bottom of the graphic, the topmost value represents the mean value, the n value represents the number of observations which occupy that assigned filtered category, and the percentage value represents the percentage ratio of the number observations within the mean, divided by the number of observations within the entire set. <br /><br /><b><u>Testing the Model</u> </b><br /><br />Now that our decision tree model has been built, let's test its predictive ability with the data which was left absent from our initial analyzation. <br /><br />When assessing non-categorical models for their predictive capacity, there are numerous methodologies which can be employed. In this article, we will be discussing two specifically. <br /><br /><b><u>Mean Absolute Error</u> </b><br /><br />The first method of predictive capacity that we will be discussing is known as The Mean Absolute Error. The Mean Absolute Error is essentially the mean of the absolute value of the total sum of differentiations, which are derived from subtracting the predictive observed values from the assigned observational values. <br /><br /><a href="https://en.wikipedia.org/wiki/Mean_absolute_error">https://en.wikipedia.org/wiki/Mean_absolute_error</a> <br /><br />Within the R platform, deriving this value can be achieved through the utilization of the following code: <br /><br /><b># Create predictive model # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># Create MAE function # <br /><br />MAE <- function(actual, predicted) {mean(abs(actual - predicted))} <br /><br /># Function Source: <a href="https://www.youtube.com/watch?v=XLNsl1Da5MA">https://www.youtube.com/watch?v=XLNsl1Da5MA</a> # <br /><br /># Utilize MAE function # <br /><br />MAE(raniris[101:150,]$Sepal.Length, anprediction)</b></div><div><div><br /><u>Console Output:</u> <br /><br /><i>[1] 0.2976927 </i><br /><br />The above output is indicating that there is, on average, a difference of 0.298 inches between the predicted value of sepal length and the actual value of sepal length. <br /><br /><b><u>Root Mean Squared Error </u></b><br /><br />The Root Mean Squared Error is a value produced by a methodology utilized to measure the predictive capacity of models. Like the Mean Absolute Error, this formula is applied to the observational values as they appear within the initial data frame, and the predicted observational values which are generated by the predictive model. <br /><br />However, the manner in which the output value is synthesized is less straight forward. The value itself is generated by solving for the square root of the average of squared differences between the predicted observational value and the original observational value. As a result, the interpretation of the final output value of the Root Mean Squared Error is more difficult to interpret than its Mean Absolute Error counterpart. <br /><br />The Root Mean Squared Error is more sensitive to large differentiations between predictive value and observational error. With Mean Absolute Error, with enough observations, the eventual output value is smoothed out enough to provide the appearance of less distance between individual values than is actually the case. However, as was previously mentioned, Root Mean Squared Error maintains, through the method in which the eventual value is synthesized, a certain amount of distance variation regardless of the size of the set. <br /><br /><a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">https://en.wikipedia.org/wiki/Root-mean-square_deviation</a> <br /><br />Within the R platform, deriving this value can be achieved through the utilization of the following code:</div><br /><b># Create predictive model # <br /><br />anprediction <- predict(anmodel , raniris[101:150,]) <br /><br /># With the package "metrics" downloaded and enabled # <br /><br />rmse(raniris[1:100,]$Sepal.Length, anprediction) <br /><br /># Compute the Root Mean Standard Error (RMSE) of model test data # <br /><br />prediction <- predict(anmod, raniris[101:150,], type="class") <br /></b><br /><u>Console Output:</u> <br /><br /><i>[1] 1.128444 </i><br /><br /><u><b>Decision Tree Nomenclature</b></u> <br /><br />As much of the terminology within the field of “machine learning” is synonymously applied regardless of model type, it is important to understand the basic descriptive terms in order to familiarize oneself with the contextual aspects of the subject matter. <br /><br />In generating the initial graphic with the code: <br /><br /><b>rpart.plot(model, type = 3, extra = 101) </b><br /><br />We were presented with the illustration below:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s671/rpart1.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="405" data-original-width="671" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqW8pC70shJSzXA65uFIXUJa2JxJTpa48omZJxk2kr-M5xDqk98E8V4nIFcjJu-wnV_JPpVJabU794B-D7lM6fX9-aeOS8yvg30K8IObDh4DQI9MPvk2298yqlp2zB-cv6d6EUNOOHfH6ubcYVCrS2P6Kgp8_5QrRE8eYtQFtrrajEfiuRB5kKKC-8/s16000/rpart1.png" /></a></div></div><br />The <b>“rpart” </b>package, as it pertains to the model output provided, identifies each aspect of the model in the following manner: <br /><br /><b># Generate model output with the following code # <br /><br />model </b><br /><br /><i>> model <br />n= 100 </i><br /><br /><i>node), split, n, loss, yval, (yprob) <br />* denotes terminal node <br /><br />1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000) <br /> 2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * <br /> 3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) <br /> 6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * <br /> 7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * <br /></i><br />If this identification was provided within a graphical representation of the model, the illustration would resemble the graphic below:<div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-BcXBnonkkeGlwIxr66l4Uh6MVLqrd_fyxL8wWlagLVodGz8shmbVfmunzkz5xgghs3Jz9aNCtHgOoiOg_U3lPGRE-rOwHIcAEXluhOafyWQeYObhP1ue3DR2gQtnB6fOCnp06OGoHm8d2AtDN-jZwkdbdNYxDy0nY0AGnd1GVn8ZOGbvpFAGmEaB/s681/rootnode2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="394" data-original-width="681" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-BcXBnonkkeGlwIxr66l4Uh6MVLqrd_fyxL8wWlagLVodGz8shmbVfmunzkz5xgghs3Jz9aNCtHgOoiOg_U3lPGRE-rOwHIcAEXluhOafyWQeYObhP1ue3DR2gQtnB6fOCnp06OGoHm8d2AtDN-jZwkdbdNYxDy0nY0AGnd1GVn8ZOGbvpFAGmEaB/s16000/rootnode2.png" /></a></div><div><br /><div>However, universally, the following graphic is a better representation of what each term is utilized to describe within the context of the field of study. <br /><br /><b># Illustrate the model # <br /><br />rpart.plot(model)</b></div><div><b><br /></b></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWAiyd3qPIE7UnIcrh4ndTUup4ssn6VXCu0jtLx1aQXMzeoQLoBoGz_ZGhHbNaEkXP4H9GV3LMaV8etEMee5a5QkBwRCDAT4Eg3iDi_TUoe0ZtppWzFGQJ0yamcZhVM0TDs-iGuyqf5i2kYh6-twEOK_CJoUmJavN-aGRZ9OgnxxmQYuj2GzbtkbYY/s599/rootnode.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="487" data-original-width="599" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWAiyd3qPIE7UnIcrh4ndTUup4ssn6VXCu0jtLx1aQXMzeoQLoBoGz_ZGhHbNaEkXP4H9GV3LMaV8etEMee5a5QkBwRCDAT4Eg3iDi_TUoe0ZtppWzFGQJ0yamcZhVM0TDs-iGuyqf5i2kYh6-twEOK_CJoUmJavN-aGRZ9OgnxxmQYuj2GzbtkbYY/s16000/rootnode.png" /></a></div><br />The first graphic provides a much more pragmatic representation of the model, a representation which is perfectly in accordance with the manner in which the <b>rpart()</b> function surmises the data. The latter graphic, illustrates the technique which is traditionally synonymous with the way in which a model of this type would be represented. <br /><br />Therefore, if an individual were discussing this model with an outside researcher, he would refer to the model as possessing 3 leaves and 2 nodes. The tree being in possession of 1 root is essentially inherent. The term<b> “branches”</b> is the descriptor utilized to describe the black line which connect the various other aspects of the model. However, like the root of the tree, the branches themselves do not warrant mention. In summary, when referring to a tree model, it is a common practice to define it generally by the number of nodes and leaves it possesses.</div><div><br /><b style="text-decoration: underline;">Pruning with prune()</b> <br /><br />There will be instances in which you may wish to simplify a model by removing some of its extraneous nodes. The motivation for accomplishing such can be motivated by either a desire to simplify the model, or, as an attempt to optimize the model’s predictive capacity. <br /><br />We will apply the pruning function to the second example model that we previously created. <br /><br />First, we must find the CP value of the model that we wish to prune. This can be achieved through the utilization of the code: <br /><br /><b>printcp(anmodel) </b><br /><br />This presents the following console output: <br /><br /><i>> printcp(anmodel) </i><br /><br /><i>Regression tree: <br />rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, <br /> data = raniris[1:100, ], method = "anova") <br /><br />Variables actually used in tree construction: <br />[1] Petal.Length Petal.Width Sepal.Width <br /><br />Root node error: 61.424/100 = 0.61424 <br /><br />n= 100 <br /><br /> CP nsplit rel error xerror xstd <br />1 0.577210 0 1.00000 1.04319 0.133636 <br />2 0.121873 1 0.42279 0.52552 0.081797 <br />3 0.062122 2 0.30092 0.39343 0.051912 <br />4 0.033928 3 0.23879 0.32049 0.050067 <br />5 0.017834 4 0.20487 0.32167 0.050154 <br />6 0.016141 5 0.18703 0.29403 0.047955 <br />7 0.010925 6 0.17089 0.29242 0.048231 <br />8 0.010000 7 0.15997 0.29256 0.048205 </i><br /><br />Each value in the list represents a node, with the initial value (1) representing the model’s root. They typical course of action for pruning an <b>“rpart”</b> tree is to first identify the node with the lowest cross-validation error (<b>xerror</b>). Once this value has been identified, we must make note of the value’s corresponding CP score (<b>0.01925</b>). It is this value which will be utilized within our pruning function to modify the model. <br /><br />With the above information ascertained, we can move forward in the pruning process by initiating the following code within R console. <br /><br /><b>prunedmodel <- prune(anmodel, 0.010925) </b><br /><br />In the case of our example, due to the small CP value, no modifications were made to the original model. However, this not always the case. I encourage you to experiment with this function as it pertains to your own <b>rpart</b> models, the best way to learn is through repetition. <br /><br /><b><u>Dealing with Missing Values</u></b> <br /><br />Typically when analyzing real world data sets, there will be instances in which different variable observation values are absent. You should not let this all too common occurrence hinder your model ambitions. Thankfully, within the rpart function, there exists a mechanism for dealing with missing values. However, this mechanism only applies to observations which consist of missing independent variables, values which will be designated as dependent variables which are missing entries should be removed prior to analysis. <br /><br />After testing the functionality of the method with data sets which I had previously removed portions of data from, there appeared to be very little impact on the model creation or prediction capacity. The algorithms which animate the data functions also exist in such a manner in which incomplete data sets can be passed through the model to generate predictions. <br /><br />I’m not exactly sure how the underlying functionality of the rpart package specifically estimates the values of the missing, or “surrogate” variable observations. From reading various articles and the manual associated with the rpart package, I can only assume, from what was described, that the values of the missing variables are derived from full observations which share variable similarities. <br /><br /><b><u>Conclusion</u></b> <br /><br />The basic tree model as it is discussed within the contents of this article, is often passed over in favor of the random forest model. However, as you will observe in future articles, the basic tree model is not without merit, as due its singular nature, it is the easier model to explain and conceptually visualize. Both of the latter concepts are extremely valuable as it relates to data presentation and research publication. In the next article well will be discussing <b>“Bagging”</b>. Until then, stay subscribed, Data Heads. <br /></div></div><br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-39095513535894063572022-09-21T17:44:00.002-04:002022-09-21T17:44:26.502-04:00(Python) Enabling the Nvidia GPU - Tensor Flow and Keras UtilizationThough this website does not typically feature articles related to Tensor Flow or Keras (machine learning libraries), due to reader requests, in this entry, I will illustrate how to enable Nvidia GPU utilization as it pertains to the aforementioned packages. <br /><br />I will be operating under the following assumptions: <br /><br /><div>1. You are in possession of a Windows PC which contains a Nvidia GPU. </div><div><br />2. You are relatively familiar with the Tensor Flow and Keras libraries. </div><div><br /></div><div>3. You are utilizing the Anaconda Python distribution. <br /><br />If the above assumptions are correct, then I have designated 4 steps towards the completion of this process: <br /><br /></div><div>1. Installation of the most recent Nvidia GPU drivers. </div><div><br /></div><div>2. Installation of the<b> “tensorflow-gpu” </b>package library. <br /><br /></div><div>3. Installation of the CUDA toolkit. <br /><br /></div><div>4. Trouble-shooting. <br /><br /><b><u>Installing the most recent Nvidia drivers: </u></b><br /><br />This step is relatively self explanatory. If you are in possession of a computer which contains a Nvidia GPU, you should have the following program located on your hard drive: <b>“GeForce Experience”</b>. Depending on the type of Nvidia GPU which you possess, the name of the program may vary. To locate this program, or a similar program which achieves the same result, search: <b>“Nvidia”</b>, from the desktop start bar. <br /><br />After you have launched the Nvidia desktop interface, you will be asked to create a Nvidia Account. To achieve this, enter the appropriate information within the coinciding menu prompts. Once this has been completed, follow the link within the conformation e-mail to finalize the creation of your user account. <br /><br />With your new Nvidia account created, you will possess the ability to access the latest driver updates within the Nvidia console interface. Be sure to update all of the drivers which are listed, prior to proceeding to the next step.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8xgW1vCVy3aoTXh24ZgS702hIhKmwJ6ErHZ0S5LluUrFcd4Gfu98owbzhA0yf3I57KgVfH5maP5SpP2kocP4bXl1QuTdL4G-fuUTz2P95MESp7jA_hQK4MF9EZDrOM_qaFl7Vu5M2RNkd9e2ACo6zQcpofB-jxLMtAZdHuSdln59VLVRyk_2RI8rk/s789/CUDA_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="216" data-original-width="789" height="110" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8xgW1vCVy3aoTXh24ZgS702hIhKmwJ6ErHZ0S5LluUrFcd4Gfu98owbzhA0yf3I57KgVfH5maP5SpP2kocP4bXl1QuTdL4G-fuUTz2P95MESp7jA_hQK4MF9EZDrOM_qaFl7Vu5M2RNkd9e2ACo6zQcpofB-jxLMtAZdHuSdln59VLVRyk_2RI8rk/w400-h110/CUDA_0.png" width="400" /></a></div><br /><b><u>Installing the TensorFlow GPU Package Library </u></b><br /><br />Completing this simple pre-requisite can be achieved by either: <br /><br />A. Running the following code within the Jupyter Notebook programming environment: <br /><br /><b>import pip <br /><br />pip.main(['install', 'tensorflow-gpu']) <br /></b><br />B. Running the following code within the <b>“Anaconda Prompt”</b>: <br /><br /><b>conda install tensorflow-gpu </b><br /><br /><i>To reach the <b>“Anaconda Prompt”</b> terminal, type “Anaconda Prompt” into the Windows desktop search bar. <br /><br /></i><b><u>Installing the CUDA toolkit </u></b><br /><br />You are now prepared to complete the final pre-requisite, which is the most complicated of all of the required steps. <br /><br />First, you must click the link below: <br /><br /><a href="https://developer.nvidia.com/cuda-downloads">https://developer.nvidia.com/cuda-downloads</a> <br /><br />The address above will direct you to the Nvidia webpage. <br /><br />Select the appropriate options which pertain to your operating system from the list of selections. Doing such, will present a download link to the version of the CUDA software which is best suited for your PC. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGB5ujA4PGEOY5TOBpdeFk1hfDPrJsOns3vLEQQdBg70lQX4l-OZvIyY9dMiHHwcgah7qeSB3q_Xwx_94rVJ2wlS79gyg5rWOSC1UAAtd3zKnElT0BmbZdRyljbmRCUIuu5lyEGwGAVKW1T0hquKqTj7P7tp3YEictiI-QBoz8mFdBuOBsTYD2nNjy/s937/CUDA_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="937" data-original-width="925" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGB5ujA4PGEOY5TOBpdeFk1hfDPrJsOns3vLEQQdBg70lQX4l-OZvIyY9dMiHHwcgah7qeSB3q_Xwx_94rVJ2wlS79gyg5rWOSC1UAAtd3zKnElT0BmbZdRyljbmRCUIuu5lyEGwGAVKW1T0hquKqTj7P7tp3YEictiI-QBoz8mFdBuOBsTYD2nNjy/w395-h400/CUDA_1.png" width="395" /></a></div><div><br /></div><div>The rest of the installation process is relatively straight-forward. <br /><br />The product of the download will produce the file below: </div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg72VQHg21sU_WVeTV3nEoFkln0Q95uoOuFiXNRtJtkvGmBK7IdibQ_hHrjHsn6WCdDUzmkM24IRCZ7Ne8HPpsRumHzHCRhLtz7Qxj9JrjR50VVNov981TZ75-bMX12oou6gs7iOWmOdR7EckoEOptJZsaE083Ewq2s0ampEPfdZob4kbLMJOyOuJZp/s60/CUDA_2A.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="55" data-original-width="60" height="55" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg72VQHg21sU_WVeTV3nEoFkln0Q95uoOuFiXNRtJtkvGmBK7IdibQ_hHrjHsn6WCdDUzmkM24IRCZ7Ne8HPpsRumHzHCRhLtz7Qxj9JrjR50VVNov981TZ75-bMX12oou6gs7iOWmOdR7EckoEOptJZsaE083Ewq2s0ampEPfdZob4kbLMJOyOuJZp/s1600/CUDA_2A.png" width="60" /></a></div><br /><i>(File name will vary based on operating system and version selection) </i><br /><br />Double clicking this file icon will begin the installation process.</div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_XOLevAEm6TkK3sdK1HX85LfGPrRT2n--xIoQc3cmQAK7M0W1Lgk5nDDR7YsrEb13Wb4et8OqTMqFUQeSdnpwwRSn1_gCSZ9opl3u_3iZdVgDABVkG59RDzf2BsRivLBMrOYLMJce0P6uFPrTL5LaQYGkR6BljHmNYt0eILS-toamvKV-Ed8NJ6YY/s416/CUDA_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="163" data-original-width="416" height="156" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_XOLevAEm6TkK3sdK1HX85LfGPrRT2n--xIoQc3cmQAK7M0W1Lgk5nDDR7YsrEb13Wb4et8OqTMqFUQeSdnpwwRSn1_gCSZ9opl3u_3iZdVgDABVkG59RDzf2BsRivLBMrOYLMJce0P6uFPrTL5LaQYGkR6BljHmNYt0eILS-toamvKV-Ed8NJ6YY/w400-h156/CUDA_3.png" width="400" /></a></div></div><div><br /></div>After clicking through the associated options, the following screen should appear:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1upE9sHury2sfLY1Njzb_ub0TPDkst-dK9j6W6fCYiltFi3ZPRCTtxJiZWPUzig291-TLmmn19ZnzUTsY4dg2_SuMkwl-xNsSF839fVa_lrLcfulqjqWinWiXNCErLBS13Ek3Hu3xsjwC5BDtBeJhyr5aiGTF749_2V5Ig67BKghd6wxGAmhreWh/s594/CUDA_4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="444" data-original-width="594" height="297" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj1upE9sHury2sfLY1Njzb_ub0TPDkst-dK9j6W6fCYiltFi3ZPRCTtxJiZWPUzig291-TLmmn19ZnzUTsY4dg2_SuMkwl-xNsSF839fVa_lrLcfulqjqWinWiXNCErLBS13Ek3Hu3xsjwC5BDtBeJhyr5aiGTF749_2V5Ig67BKghd6wxGAmhreWh/w400-h297/CUDA_4.png" width="400" /></a></div><div><br /></div>If you do not have Microsoft Visual Studios installed on your PC, you will be presented with an installation error. However, if you do not intend to utilize the CUDA software for development related to such, you can continue the installation process without further hesitation. <br /><br />Once the process is fully completed, the following shortcut icon should appear on your PC’s desktop:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62bb1Wbs9wI-fddI5KGt2F-wfVWvhX9CiQkqzqumBHszM7XJ2VQ9FHmwhH5rLOvRcmzArSWWPyZXZUL_qx4XyXJRNX4y7d80iwZVtG6msbv8D-hzxA9X8expNofh7OJ4xrmmphBSXwd6HdKDAQk_EESw0yEnT3gp-eKmaZ60Zf6uV8lVTF-n4ZkLU/s96/CUDA_5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="77" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg62bb1Wbs9wI-fddI5KGt2F-wfVWvhX9CiQkqzqumBHszM7XJ2VQ9FHmwhH5rLOvRcmzArSWWPyZXZUL_qx4XyXJRNX4y7d80iwZVtG6msbv8D-hzxA9X8expNofh7OJ4xrmmphBSXwd6HdKDAQk_EESw0yEnT3gp-eKmaZ60Zf6uV8lVTF-n4ZkLU/s1600/CUDA_5.png" width="77" /></a></div><div><br /></div>I would now advise that you re-start your PC prior to implementing GPU utilization within your machine learning projects. <br /><br /><b><u>Trouble Shooting </u></b><br /><br />I’ve found that GPU enabled Tensor Flow projects, at least from my experience, tend to be more error prone from session to session. However, I accept this shortcoming, due to the significant speed increase enabled by GPU utilization. <br /><br />Utilizing the newly installed GPU implementation is relatively simple, as the implementation is automatically assumed within the model structure. Meaning, that the alteration of pre-existing machine learning project code is un-necessary as it pertains to GPU optimization. If you run code which has been previously created for the purpose of utilizing Keras and Tensor Flow library implementation, then the computer will automatically assume that you now desire to have the analysis performed through the GPU hardware architecture. <br /><br />To ensure that GPU functionality is enabled, you may run the following lines of code within the Anaconda coding platform: <br /><br />f<b>rom tensorflow.python.client import device_lib <br /><br />from keras import backend as K <br /><br />print(device_lib.list_local_devices()) <br /><br />K.tensorflow_backend._get_available_gpus() <br /></b><br />This should produce output which includes the term: <b>‘GPU’</b>. If this is the case, then GPU utilization has been successfully enabled. <br /><br />If, for whatever reason, errors occur as it relates to keras or tensorflow implementation following the installation of the prior programs, try completing any of the following steps to remedy this occurrence. <br /><br />1. Restart the Anaconda platform and Jupyter Notebook. <br /><br />2. Uninstall and re-install both the tensorflow and tensorflow-gpu libraries from the Anaconda Prompt (command line). This can be achieved by utilizing the code below: <br /><br /><b>conda uninstall tensorflow <br /><br />conda uninstall tensorflow-gpu <br /><br />conda install tensorflow <br /><br />conda install tensorflow-gpu </b><br /><br />Assuming that these remedies addressed and solved the previously present issue, assuming that an issue previously existed, then you should be prepared to experience the blazing speed enabled by Nvidia GPU utilization. <br /><br />That’s all for this entry. <br /><br />Stay busy, Data Heads!Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.comtag:blogger.com,1999:blog-1608768736913930926.post-20378421144402363462021-07-18T11:40:00.010-04:002021-07-18T11:45:57.839-04:00Pivot Tables (MS-Excel)You didn’t honestly believe that I would continue to write articles without mentioning every analyst’s favorite Excel technique, did you? <br /><br /><b><u>Example / Demonstration</u>: </b><br /><br />For this demonstration, we are going to be utilizing the, <b>“Removing Duplicate Entries (MS-Excel).csv”</b> data file. This file can found within GitHub data repo, upload data: July 12, 2018. If you are too lazy to navigate over the repo site, the raw .csv data can be found down below:<br /><br /><i>VARA,VARB,VARC,VARD<br />Mike,1,Red,Spade<br />Mike,2,Blue,Club<br />Mike,1,Red,Spade<br />Troy,2,Green,Diamond<br />Troy,1,Red,Heart<br />Archie,2,Orange,Heart<br />Archie,2,Yellow,Diamond<br />Archie,2,Orange,Heart<br />Archie,1,Red,Spade<br />Archie,1,Blue,Spade<br />Archie,2,Red,Club<br />Archie,2,Red,Club<br />Jack,1,Red,Diamond<br />Jack,2,Blue,Diamond<br />Jack,2,Blue,Diamond<br />Rob,1,Green,Club<br />Rob,2,Orange,Spade<br />Brad,1,Red,Heart<br />Susan,2,Blue,Heart<br />Susan,2,Yellow,Club<br />Susan,1,Pink,Heart<br />Seth,2,Grey,Heart<br />Seth,1,Green,Club<br />Joanna,2,Pink,Club<br />Joanna,1,Green,Spade<br />Joanna,1,Green,Spade<br />Bertha,2,Grey,Diamond<br />Bertha,1,Grey,Diamond<br />Liz,1,Green,Spade</i><br /><br />Let’s get started! <br /><br />First, we’ll take a nice look at the data as it exists within MS-Excel:<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-eXRAaqVNGbI/YPREvq4JfeI/AAAAAAAABXw/eYi5NeUSynk3grAlE_TMbY0lUya_O4dIQCLcBGAsYHQ/s628/Pivot_0.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="628" data-original-width="291" height="400" src="https://1.bp.blogspot.com/-eXRAaqVNGbI/YPREvq4JfeI/AAAAAAAABXw/eYi5NeUSynk3grAlE_TMbY0lUya_O4dIQCLcBGAsYHQ/w185-h400/Pivot_0.png" width="185" /></a></div><div><br /></div>Now we’ll pivot to excellence! <br /><br />The easiest way to start building pivot tables, is to utilize the <b>“Recommended PivotTables” </b>option button located within the <b>“Insert”</b> menu, listed within Excel’s ribbon menu.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-AeTxI-0CD-c/YPRE8Dfo_QI/AAAAAAAABX0/8w7CB-7bTSswjstGd--7LfVM8NtANYhZACLcBGAsYHQ/s294/Pivot_1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="285" data-original-width="294" src="https://1.bp.blogspot.com/-AeTxI-0CD-c/YPRE8Dfo_QI/AAAAAAAABX0/8w7CB-7bTSswjstGd--7LfVM8NtANYhZACLcBGAsYHQ/s0/Pivot_1.png" /></a></div><div><br />This should bring up the menu below:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-hZD0z_FL58s/YPRFHOcIk0I/AAAAAAAABX4/W4FiFpRnt1gjECTIfWngbi25xyqQqvagwCLcBGAsYHQ/s702/Pivot_2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="634" data-original-width="702" height="361" src="https://1.bp.blogspot.com/-hZD0z_FL58s/YPRFHOcIk0I/AAAAAAAABX4/W4FiFpRnt1gjECTIfWngbi25xyqQqvagwCLcBGAsYHQ/w400-h361/Pivot_2.png" width="400" /></a></div><div><br /></div>Go ahead and select all row entries, across all variable columns. <br /><br />Once this has been completed, click <b>“OK”</b>. <br /><br />This should generate the following menu:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-xQ94Y2Jw5QM/YPRFOLMJb7I/AAAAAAAABYA/k38frTPutiM1-OILmfIQdnHqkQjfHN1ngCLcBGAsYHQ/s541/Pivot_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="510" data-original-width="541" height="378" src="https://1.bp.blogspot.com/-xQ94Y2Jw5QM/YPRFOLMJb7I/AAAAAAAABYA/k38frTPutiM1-OILmfIQdnHqkQjfHN1ngCLcBGAsYHQ/w400-h378/Pivot_3.png" width="400" /></a></div><div><br /></div>Let’s break down each recommendation. <br /><br /><b>“Sum of VARB by VARD” </b>– This table is summing the total of the numerical values contained within <b>VARB</b>, as they correspond with <b>VARD </b>entries. <br /><br /><b>“Count of VARA by VARD”</b> – This table is counting the total number of occurrences of categorical values within variable column <b>VARD</b>. <br /><br /><b>“Sum of VARB by VARC”</b> – This table is summing the total of numerical values contained within <b>VARB</b>, as they correspond with <b>VARC</b> entries.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Ywr03FZOT00/YPRFTW-IjZI/AAAAAAAABYE/m31u0lXpzFctjRZfLUM-4tlJR8_hLZDegCLcBGAsYHQ/s548/Pivot_4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="515" data-original-width="548" height="376" src="https://1.bp.blogspot.com/-Ywr03FZOT00/YPRFTW-IjZI/AAAAAAAABYE/m31u0lXpzFctjRZfLUM-4tlJR8_hLZDegCLcBGAsYHQ/w400-h376/Pivot_4.png" width="400" /></a></div><div><br /></div><b>“Count of VARA by VARC”</b> – This table is counting the total number of occurrences of categorical values within variable column <b>VARA</b>. <br /><br /><b>“Sum of VARB by VARA” </b>– This table is summing the total of the numerical values contained within <b>VARB</b>, as they correspond with <b>VARA</b> entries. <br /><br />Now, there may come a time in which none of the above options match exactly what you are looking for. In this case, you will want to utilize the<b> “PivotTable”</b> option button, located within the<b> “Insert”</b> menu, listed within Excel’s ribbon menu.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-eWEvgpsnfJc/YPRFdDn8LMI/AAAAAAAABYI/Qbr64P9t9j4H-llmJPBXKJ9vOofkhIOhACLcBGAsYHQ/s306/Pivot_5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="306" data-original-width="227" src="https://1.bp.blogspot.com/-eWEvgpsnfJc/YPRFdDn8LMI/AAAAAAAABYI/Qbr64P9t9j4H-llmJPBXKJ9vOofkhIOhACLcBGAsYHQ/s0/Pivot_5.png" /></a></div><div><br />Go ahead and select all row entries, across all variable columns. <br /><br />Change the option button to <b>“New Worksheet”</b>, instead of <b>“Existing Worksheet”</b>. <br /><br />Once this has been complete, click <b>“OK”</b>. <br /><br />Once this has been accomplished, you’ll be graced with a new menu, on a new Excel sheet (same workbook).</div><div><br /></div><div>I won’t go into every single output option that you have available, but I will list a few you may want to try yourself. Each output variation can be created by dragging and dropping the variables listed within the topmost box, in varying order, into the boxes below: </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-8uLR8QY1JLk/YPRMpnEXb_I/AAAAAAAABZM/SLxHGfBaFc4Jbf9c6dlSm11KK2A7CeSZgCLcBGAsYHQ/s748/Pivot_7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="748" data-original-width="347" height="400" src="https://1.bp.blogspot.com/-8uLR8QY1JLk/YPRMpnEXb_I/AAAAAAAABZM/SLxHGfBaFc4Jbf9c6dlSm11KK2A7CeSZgCLcBGAsYHQ/w185-h400/Pivot_7.png" width="185" /></a></div><div><br />If <b>VARA</b> and <b>VARC</b> are both added to Rows, you will view the categorical occurrences of variable entries from <b>VARC</b>, with V<b>ARA</b> acting as the unique ID. <br /><br />Order matters in each pivot table variable designation place. <br /><br />So, if we reverse the position of <b>VARA</b> and <b>VARC</b>, and instead list <b>VARC</b> first, followed by <b>VARA</b>, then we will a table which lists the categorical occurrences of <b>VARA</b>, with <b>VARC</b> acting as a unique ID. <br /><br />If we include <b>VARA</b> and <b>VARC</b> as rows (in that order), and set the values variable to Sum of <b>VARB</b>, then the output should more so resemble an accounting sheet, with the sum of each numerical value corresponding with <b>VARA</b>, categorized by <b>VARC</b>, is summed (<b>VARB</b>). </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iPUl8afBfwc/YPRG5qZjGmI/AAAAAAAABYg/Ztr8XebQVrMopxkSOPZcYn7EpK4XoS2DQCLcBGAsYHQ/s646/Pivot_11.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="172" height="400" src="https://1.bp.blogspot.com/-iPUl8afBfwc/YPRG5qZjGmI/AAAAAAAABYg/Ztr8XebQVrMopxkSOPZcYn7EpK4XoS2DQCLcBGAsYHQ/w106-h400/Pivot_11.png" width="106" /></a></div><div style="text-align: center;"><br /></div>If we instead wanted the count, as opposed to the sum, we could click on the drop down arrow located next to <b>“Count of VARB”</b>, which presents the following options:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-tZBQ7k6ElPw/YPRHQYnX7kI/AAAAAAAABYs/iIcdKvDcYpohWGLSZ_EVQ2aFQ16yqmKcQCLcBGAsYHQ/s401/Pivot_8.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="401" data-original-width="347" height="320" src="https://1.bp.blogspot.com/-tZBQ7k6ElPw/YPRHQYnX7kI/AAAAAAAABYs/iIcdKvDcYpohWGLSZ_EVQ2aFQ16yqmKcQCLcBGAsYHQ/s320/Pivot_8.png" /></a></div><div style="text-align: center;"><br /></div>From the options listed, we well select <b>“Value Field Settings”</b>.<br /><br />This presents the following menu, from which we will select <b>“Count”</b>.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-C1T4Rmy-T7E/YPRHo2YmSHI/AAAAAAAABY0/YEI7gBM69dcVsuVSN3LctU8YOrJah3JoACLcBGAsYHQ/s390/Pivot_9.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="333" data-original-width="390" src="https://1.bp.blogspot.com/-C1T4Rmy-T7E/YPRHo2YmSHI/AAAAAAAABY0/YEI7gBM69dcVsuVSN3LctU8YOrJah3JoACLcBGAsYHQ/s320/Pivot_9.png" width="320" /></a></div><div><br /></div>The result of following the previously listed steps is illustrated below:<br /><br /><div style="text-align: center;"><a href="https://1.bp.blogspot.com/-Oufu6b3hi2Y/YPRGra32p2I/AAAAAAAABYc/IoDiZ6ARsq4Z17YqKT6O7PBN5BaKKdS-gCLcBGAsYHQ/s639/Pivot_10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="639" data-original-width="177" height="400" src="https://1.bp.blogspot.com/-Oufu6b3hi2Y/YPRGra32p2I/AAAAAAAABYc/IoDiZ6ARsq4Z17YqKT6O7PBN5BaKKdS-gCLcBGAsYHQ/w111-h400/Pivot_10.png" width="111" /></a></div><div style="text-align: center;"><br /></div><div style="text-align: left;">The Pivot Table creation menu also allows for further customization through the addition of column variables.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">In the case of our example, we will make the following modifications to our table output:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-8gHSCiLKLMk/YPRIRelQgcI/AAAAAAAABY8/8kja4AehEeMqbHaY1S83Y5L-uOlot62igCLcBGAsYHQ/s354/Pivot_x1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="354" data-original-width="345" height="320" src="https://1.bp.blogspot.com/-8gHSCiLKLMk/YPRIRelQgcI/AAAAAAAABY8/8kja4AehEeMqbHaY1S83Y5L-uOlot62igCLcBGAsYHQ/s320/Pivot_x1.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><b>VARC</b> will now be designated as a column variable, <b>VARA</b> will be a row variable, and the count of <b>VARB </b>will be out values variable. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The result of these modifications is shown below:</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-NXkYi3Nfvro/YPRIye-F1lI/AAAAAAAABZE/MK6yzHZ3UTYBmtearq0QWWSpiDM_aotAQCLcBGAsYHQ/s550/Pivot_x2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="292" data-original-width="550" height="211" src="https://1.bp.blogspot.com/-NXkYi3Nfvro/YPRIye-F1lI/AAAAAAAABZE/MK6yzHZ3UTYBmtearq0QWWSpiDM_aotAQCLcBGAsYHQ/w400-h211/Pivot_x2.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Our output format now contains a table which contains the count of each occurrence of each color (<b>VARC</b>), as each color corresponds with each individual listed (<b>VARA</b>) within the original data set. </div><br /><div class="separator" style="clear: both; text-align: left;">In conclusion, the pivot table option within MS-Excel, offers a variety of different display outputs which can be utilized to display statistical summary data.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The most important skill to develop as it pertains to this feature, is the ability to ascertain when a pivot table is necessary for your data project needs.</div><div class="separator" style="clear: both; text-align: left;"><br /></div>So with that, we will end this article.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I will see you next time, Data Head.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-90336341986793670742021-07-14T23:35:00.003-04:002021-07-14T23:44:08.013-04:00Getting to Know the GreeksIn today’s article, we are going to go a bit off the beaten path and discuss, The Greek Alphabet!<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-i1a5JO7k-NI/YO-mMqhCllI/AAAAAAAABXk/IzXL2FF_F-si6Os2eOtFBNjcoBlFyQ8YgCLcBGAsYHQ/s334/Plato.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="229" data-original-width="334" src="https://1.bp.blogspot.com/-i1a5JO7k-NI/YO-mMqhCllI/AAAAAAAABXk/IzXL2FF_F-si6Os2eOtFBNjcoBlFyQ8YgCLcBGAsYHQ/s320/Plato.png" width="320" /></a></div><div><br />You might be wondering, why the sudden change of subject content…?<br /><br />In order to truly master of the craft of data science, you will be required to stretch your mind in creative ways. The Greek Alphabet is utilized throughout the fields of statistics, mathematics, finance, computer science, astronomy and other western intellectual pursuits. For this reason, it really ought to be taught in elementary schools. However, to my knowledge, in most cases, it is not. <br /><br />The Romans borrowed heavily from Greek Civilization, and contemporary western civilization borrowed heavily from the Romans. Therefore, to truly be a person of culture, you should learn the Greek Alphabet, and really, as much as you possibly can about Ancient Greek Culture. This includes the legends, heroes, and philosophers. We might be getting more into this in other articles, but for today, we will be sticking to the alphabet. <br /><br /><b><u>The Greek Alphabet </u></b><br /><br />The best way to learn the Greek alphabet is to be Greek (j/k, but not really). In all other cases, application is probably the best way to commit various Greek letters, as symbols, to memory. <br /><br />I would recommend drawing each letter in order, uppercase, and lowercase, and saying the name of the letter as it is written. <br /><br />Let’s try this together! <br /><br /><b>Α α (Alpha) (Pronounced: AL-FUH)</b> - Utilized in statistics as the symbol which connotates significance level. In finance, it is the percentage return of an investment above or below a predetermined index. <br /><br /><b>B β (Beta) (Pronounced: BAY-TUH)</b> - In statistics, this symbol is utilized to represent type II errors. In finance, it is utilized to determine asset volatility. <br /><br /><b>Γ γ (Gamma) (Pronounced: GAM-UH)</b> - In physics, this symbol is utilized to represent particle decay (Gamma Decay). There also exists Alpha Decay, and Beta Decay. The type of decay situationally differs depending on the circumstances. <br /><br /><b>Δ δ (Delta) (Pronounced: DEL-TUH)</b> - This is currently the most common strain of the novel coronavirus (7/2021). In the field of chemistry, uppercase Delta is utilized to symbolize heat being added to a reaction. <br /><br /><b>Ε ε (Epsilon) (Pronounced: EP-SIL-ON)</b> - “Machine Epsilon” is utilized in computer science as a way of dealing with floating point values and their assessment within logical statements. <br /><br /><b>Ζ ζ (Zeta) (Pronounced: ZAY-TUH)</b> - The most common utilization assignment which I have witnessed for this letter, is its designation as the variable which represents the Reimann Zeta Function (number theory). <br /><br /><b>Η η (Eta) (Pronounced: EE -TUH)</b> - I’ve mostly seen this letter designated as variable for the Dedekind eta function (number theory). <br /><br /><b>Θ θ (Theta) (Pronounced: THAY-TUH)</b> - Theta is utilized as the symbol to represent a pentaquark, a transitive subatomic particle. <br /><br /><b>Ι ι (Iota) (Pronounced: EYE-OL-TA)</b> - I’ve never seen this symbol utilized for anything outside of astronomical designations. Maybe if you make it big in science, you could give Iota the love that it so deserves. <br /><br /><b>Κ k (Kappa) (Pronounced: CAP-UH) </b>- Kappa is the chosen variable designation for Einstein’s gravitational constant. <br /><br /><b>Λ λ (Lambda) (Pronounced: LAMB-DUH)</b> - A potential emergent novel coronavirus variant (7/2021). Lowercase Lambda is also utilized throughout the Poisson Distribution function. <br /><br /><b>Μ μ (Mu) (Pronounced: MEW)</b> - Lowercase Mu is utilized to symbolize the mean of a population (statistics). In particle physics, it can also be applied to represent the elementary particle: Muon. <br /><br /><b>Ν ν (Nu) (Pronounced: NEW)</b> - As a symbol, this letter represents degrees of freedom (statistics). <br /><br /><b>Ξ ξ (Xi) (Pronounced: SEE) </b>- In mathematics, uppercase Xi can be utilized to represent the Reimann Xi Function. <br /><br /><b>Ο ο (Omnicron) (Pronounced: OMNI-CRON)</b> - A symbol which does not get very much love, or use, unlike its subsequent neighbor… <br /><br /><b>Π π (Pi) (Pronounced: PIE) </b>- In mathematics, lowercase Pi often represents the mathematical real transcendental constant ≈ 3.1415…etc. <br /><br /><b>Ρ ρ (Rho) (Pronounced: ROW) </b>- In the Black-Scholes model, Rho represent the rate of change of a portfolio with respect to interest rates <br /><br /><b>Σ σ (Sigma) (Pronounced: SIG-MA) </b>- Lower case Sigma represents the standard deviation of a population (statistics). Upper case sigma represents a sum function (mathematics). <br /><br /><b>Τ τ (Tau) (Pronounced: TAIL)</b> - Lower case Tau represents an elementary particle within the field of particle physics <br /><br /><b>Υ υ (Upsilon) (Pronounced: EEP-SIL-ON)</b> - Does not really get very much use… <br /><br /><b>Φ φ (Phi) (Pronounced: FAI) </b>- Lowercase Phi is utilized to represent the Golden Ratio. <br /><br /><b>Χ χ (Chi) (Pronounced: KAI) </b>- Lower case Chi is utilized as a variable throughout the Chi-Square distribution function. <br /><br /><b>Ψ ψ (Psi) (Pronounced: PSY)</b> - Lower case Psi is used to represent the (generalized) positional states of a qubit within a quantum computer.</div><div><br /><b>Ω ω (Omega) (Pronounced: OHMEGA)</b> - Utilized for just about everything.</div><br />Αυτα για τωρα. Θα σε δω την επόμενη φορά!<div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-76559890037969361732021-06-25T11:30:00.002-04:002021-06-25T11:34:13.663-04:00(R) The Levene's TestIn today’s article we will be discussing a technique which is not specifically interesting or pragmatically applicable. Still, for the sake of true data science proficiency, today we will be discussing, <b>THE LEVENE'S TEST! </b><br /><br />The Levene's Test is utilized to compare the variances of two separate data sets. <br /><br />So naturally, our hypothesis would be: <br /><br /><b>Null Hypothesis:</b> The variance measurements of the two data sets do not significantly differ. <br /><br /><b>Alternative Hypothesis:</b> The variance measurements of the two data sets do significantly different. <br /><br /><b><u>The Levene's Test Example</u>:</b><br /><br /><b># The leveneTest() Function is included within the “car” package # <br /><br />library(car) <br /><br />N1 <- c(70, 74, 76, 72, 75, 74, 71, 71) <br /><br /> N2 <- c(74, 75, 73, 76, 74, 77, 78, 75) <br /><br />N_LEV <- c(N1, N2) <br /><br />group <- as.factor(c(rep(1, length(N1)), rep(2, length(N2)))) <br /><br />leveneTest(N_LEV, group) <br /><br /># The above code is a modification of code provided by StackExchange user: ocram. # <br /><br /># Source https://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r # <br /></b><div><br />This produces the output: <br /><br /><i>Levene's Test for Homogeneity of Variance (center = median) <br /> Df F value Pr(>F) <br />group 1 1.7677 0.2049 <br /> 14 <br /></i><br />Since the p-value of the output exceeds .05, we will not reject the null hypothesis (alpha = .05). <br /><br /><b><u>Conclusions</u>: </b><br /><br />The Levene’s Test for Equality of Variances did not indicate a significant differentiation in the variance measurement of Sample N1, as compared to the variance measurement of Sample N2, F(1,14) = 1.78, p= .21. <br /><br />So, what is the overall purpose of this test? Meaning, when would its application be appropriate? The Levene’s Test is typically utilized as a pre-test prior to the application of the standard T-Test. However, it is uncommon to structure a research experiment in this manner. Therefore, the Levene’s Test is more so something which is witnessed within the classroom, and not within the field. <br /><br />Still, if you find yourself in circumstances in which this test is requested, know that it is often required to determine whether a standard T-Test is applicable. If variances are found to be un-equal, a Welch’s T-Test is typically preferred as an alternative to the standard T-Test. <br /><br />----------------------------------------------------------------------------------------------------------------------------- <br /><br />I promise that my next article will be more exciting. <br /><br />Until next time. <br /><br />-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-75809943870570482482021-06-18T10:50:00.004-04:002021-06-18T10:59:03.863-04:00(R) Imputing Missing Data with the MICE() Package<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-bMkdAJMAANI/YMJsQxkfdWI/AAAAAAAABWI/mg8J6fDAYIIYEtWsywiW1hBZllYxKEKfwCLcBGAsYHQ/s220/HouseMouse.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="106" data-original-width="220" src="https://1.bp.blogspot.com/-bMkdAJMAANI/YMJsQxkfdWI/AAAAAAAABWI/mg8J6fDAYIIYEtWsywiW1hBZllYxKEKfwCLcBGAsYHQ/s0/HouseMouse.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>In today’s article we are going to discuss basic utilization of the MICE package. <br /><br />The MICE package, is a package which assists with performing analysis on shoddily assembled data frames. <br /><br />In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter. <br /><br />Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”,
<br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br /><b>~OR~</b><br /></p><div class="separator" style="clear: both;"><div style="text-align: center;"><a href="https://1.bp.blogspot.com/-PGQflZAapDU/YMJsZnHOxKI/AAAAAAAABWM/vbf2_M6z4IQmJGuweB_s1260QLExHRcXwCLcBGAsYHQ/s464/DeleteAll.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="236" data-original-width="464" src="https://1.bp.blogspot.com/-PGQflZAapDU/YMJsZnHOxKI/AAAAAAAABWM/vbf2_M6z4IQmJGuweB_s1260QLExHRcXwCLcBGAsYHQ/s320/DeleteAll.png" width="320" /></a></div><div style="text-align: center;"><i style="text-align: left;">(Don’t succumb to temptation!) </i></div></div><br />Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses. <br /><br />Fortunately, for all of us, there is a way forward. <br /><br /><b><u>Example</u></b>:<br /><br />Let’s say that you were given this small data set for analysis:<div><br /><div style="text-align: left;"> <a href="https://1.bp.blogspot.com/-5jA3b1zAxO4/YMJskB8oL5I/AAAAAAAABWU/9naKWUrXs6Q_c99qYTR_ZxnRKaxapDwCACLcBGAsYHQ/s453/DataFrameB.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="453" data-original-width="359" height="320" src="https://1.bp.blogspot.com/-5jA3b1zAxO4/YMJskB8oL5I/AAAAAAAABWU/9naKWUrXs6Q_c99qYTR_ZxnRKaxapDwCACLcBGAsYHQ/s320/DataFrameB.png" /></a></div><br />The data is provided in an .xls format, because why wouldn’t it be?<br /><br />For the sake of not having you download an example data file, I have re-coded this data into the R format.<br /><br /><b># Create Data Frame: "SheetB" #<br /><br />VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)<br /><br />VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)<br /><br />VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)<br /><br />VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)<br /><br />VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)</b><br /><b><br />SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)</b><br /><br />If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:<br /><br /><b># Create Data Frame: "SheetA" #<br /><br />VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)<br /><br />VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)<br /><br />VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)<br /><br />VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)<br /><br />VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)<br /><br />SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)</b><br /><br />In our example, we’ll assume that the sheet which contains all values is unavailable to you (<b>“SheetA”</b>). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.<br /><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1BtIb3hYvqg/YMJs4YQ_BQI/AAAAAAAABWg/3SwZx3zKMyM2sYJgRkt6VpZP49Odsa1SQCLcBGAsYHQ/s250/LogMouse.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="139" data-original-width="250" src="https://1.bp.blogspot.com/-1BtIb3hYvqg/YMJs4YQ_BQI/AAAAAAAABWg/3SwZx3zKMyM2sYJgRkt6VpZP49Odsa1SQCLcBGAsYHQ/s0/LogMouse.png" /></a></div><br /><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;">We will opt to do the latter, and the function which we will utilize, is the <b>mice() </b>function. <br /><br />First, we will initialize the appropriate library: <br /><b><br /># Initalitze Library # <br /><br />library(mice) <br /></b><br />Next, we will perform the imputation function contained within the library. <br /><br /><b># Perform Imputation # </b><br /><br /><b>SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)</b><br /><br /><b>SheetB</b>:<b> </b>is the data frame which is being called by the function. <br /><br /><b>m = 1</b>: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary. <br /><br /><b>maxit</b>: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell. <br /><br /><b> method</b>: Is the method which will be utilized to perform this function. <br /><br /><b>seed</b>: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate. <br /><br />After performing the above function, you should be greeted with the output below: <br /><i><br />iter imp variable <br /> 1 1 VarA VarB VarC VarD VarE <br /> 2 1 VarA VarB VarC VarD VarE <br /> 3 1 VarA VarB VarC VarD VarE <br /> 4 1 VarA VarB VarC VarD VarE <br /> 5 1 VarA VarB VarC VarD VarE <br /> 6 1 VarA VarB VarC VarD VarE<br /> 7 1 VarA VarB VarC VarD VarE <br /> 8 1 VarA VarB VarC VarD VarE <br /> 9 1 VarA VarB VarC VarD VarE <br /> 10 1 VarA VarB VarC VarD VarE <br /> 11 1 VarA VarB VarC VarD VarE <br /> 12 1 VarA VarB VarC VarD VarE <br /> 13 1 VarA VarB VarC VarD VarE <br /> 14 1 VarA VarB VarC VarD VarE <br /> 15 1 VarA VarB VarC VarD VarE <br /> 16 1 VarA VarB VarC VarD VarE <br /> 17 1 VarA VarB VarC VarD VarE <br /> 18 1 VarA VarB VarC VarD VarE <br /> 19 1 VarA VarB VarC VarD VarE <br /> 20 1 VarA VarB VarC VarD VarE <br /> 21 1 VarA VarB VarC VarD VarE <br /> 22 1 VarA VarB VarC VarD VarE <br /> 23 1 VarA VarB VarC VarD VarE <br /> 24 1 VarA VarB VarC VarD VarE <br /> 25 1 VarA VarB VarC VarD VarE <br /> 26 1 VarA VarB VarC VarD VarE <br /> 27 1 VarA VarB VarC VarD VarE <br /> 28 1 VarA VarB VarC VarD VarE <br /> 29 1 VarA VarB VarC VarD VarE <br /> 30 1 VarA VarB VarC VarD VarE <br /> 31 1 VarA VarB VarC VarD VarE <br /> 32 1 VarA VarB VarC VarD VarE <br /> 33 1 VarA VarB VarC VarD VarE <br /> 34 1 VarA VarB VarC VarD VarE <br /> 35 1 VarA VarB VarC VarD VarE <br /> 36 1 VarA VarB VarC VarD VarE <br /> 37 1 VarA VarB VarC VarD VarE <br /> 38 1 VarA VarB VarC VarD VarE <br /> 39 1 VarA VarB VarC VarD VarE <br /> 40 1 VarA VarB VarC VarD VarE <br /> 41 1 VarA VarB VarC VarD VarE <br /> 42 1 VarA VarB VarC VarD VarE <br /> 43 1 VarA VarB VarC VarD VarE <br /> 44 1 VarA VarB VarC VarD VarE <br /> 45 1 VarA VarB VarC VarD VarE <br /> 46 1 VarA VarB VarC VarD VarE <br /> 47 1 VarA VarB VarC VarD VarE <br /> 48 1 VarA VarB VarC VarD VarE <br /> 49 1 VarA VarB VarC VarD VarE <br /> 50 1 VarA VarB VarC VarD VarE <br /></i></p><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br />The output is informing you that the iteration was performed a total of 50 times on one single set.<br /><br />The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank. <br /><br /><b># Assign Original Values with Imputations to Data Frame # <br /><br />SheetB_Imputed_Complete <- complete(SheetB_Imputed) </b><br /><br />The outcome should resemble something like:<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-D0hsed6_Z1g/YMJtDdwHW2I/AAAAAAAABWk/udrP9oEtcfYVCg_rjTrzgf6tUa2qmVvDgCLcBGAsYHQ/s486/DataFrameBImputations.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="486" data-original-width="332" height="320" src="https://1.bp.blogspot.com/-D0hsed6_Z1g/YMJtDdwHW2I/AAAAAAAABWk/udrP9oEtcfYVCg_rjTrzgf6tUa2qmVvDgCLcBGAsYHQ/s320/DataFrameBImputations.png" /></a></div><div style="text-align: center;"><i>(Beautiful!)</i></div><p style="font-stretch: normal; line-height: normal; margin: 0px 0px 8px;"><br />A quick warning, the <b>mice()</b> function cannot be utilized on data frames which contain unencoded categorical variable entries. <br /><br />An example of this:<br /><br /> <a href="https://1.bp.blogspot.com/-y6WUjCmuJ1k/YMJtMnYiXuI/AAAAAAAABWo/U9bLPf02Mnk8-EXyCjpkfE4E8LKYIjungCLcBGAsYHQ/s453/DataFrameC.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="453" data-original-width="360" height="320" src="https://1.bp.blogspot.com/-y6WUjCmuJ1k/YMJtMnYiXuI/AAAAAAAABWo/U9bLPf02Mnk8-EXyCjpkfE4E8LKYIjungCLcBGAsYHQ/s320/DataFrameC.png" /></a><br /><br />To get <b>mice() </b>to work correctly on this data set, you must recode "<b>VARC"</b> prior to proceeding. You could do this by changing each instance of "<b>Spade"</b> to 1, "<b>Club"</b> to 2, <b>“Diamond" </b>to 3, and "<b>Heart" </b>to 4. <br /><br />For more information as it relates to this function, please check out this <a href="https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html" target="_blank">link</a>. <br /><br />That’s all for now, internet.<br /><br />-RD<br /></p></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-66659182036579343022021-06-12T10:48:00.004-04:002021-06-12T10:50:47.930-04:00(R) 2-Sample Test for Equality of ProportionsIn today’s article we are going to revisit in greater detail, a topic which was reviewed in a prior article. <br /><br />What the 2-Sample Test for Equality of Proportions seeks to achieve, is an assessment of differentiation as it pertains to one survey group’s response, as measured against another. <br /><br />To illustrate the application of this methodology, I will utilize a prior example which was previously published to this site (10/15/2017). <br /><br /><b><u>Example</u>: </b><br /><br /><i>A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased. </i><br /><br /><b># Model Hypothesis # <br /><br /># H0: p1 - p2 = 0 #</b><div><b><br /> # (The proportions are the same) # </b><div><b><br /># Ha: p1 - p2 > 0 # <br /><br /># (The proportions are NOT the same) # <br /><br /># Disable Scientific Notation in R Output #<br /> <br /> options(scipen = 999) <br /><br /># Model Application # <br /><br />prop.test(x = c(600,500), n=c(1300,1500), conf.level = .95, correct = FALSE) </b><br /><br />Which produces the output: <br /><br /><i>2-sample test for equality of proportions without continuity correction <br /><br />data: c(600, 500) out of c(1300, 1500) <br />X-squared = 47.991, df = 1, p-value = 0.000000000004281 <br />alternative hypothesis: two.sided <br />95 percent confidence interval: <br /> 0.09210145 0.16430881 <br />sample estimates: <br /> prop 1 prop 2 <br />0.4615385 0.3333333 <br /><br /></i>We are now prepared to state the details of our model’s application, and the subsequent findings and analysis which occurred as a result of such. <br /><br /><b><u>Conclusions</u>: </b><br /><br />A 2-Sample Test for Equality of Proportions without Continuity Correction was performed to analyze whether the poll survey results for Candidate A., significantly differed from subsequent poll survey results gathered weeks later. A 90% confidence interval was assumed for significance. <br /><br />There was a significant difference in Candidate A’s favorability score as from the initial poll findings: 46% (600/1300), as compared to Candidate A’s favorability score the subsequent poll findings: 33% (500/1500); χ2 (1, N = 316) = 47.99, p > .10.<br /></div></div><div><br /></div><div>-----------------------------------------------------------------------------------------------------------------------------</div><div><br /></div><div>That's all for now.</div><div><br /></div><div>I'll see you next time, Data Heads.</div><div><br /></div><div>-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-51364545410228856962021-06-05T21:10:00.003-04:002021-06-05T21:20:44.148-04:00(R) Pearson’s Chi-Square Test Residuals and Post Hoc AnalysisIn today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results. <br /><br />If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding. <br /><br /><b><u>Example</u>:</b><br /><br />To demonstrate this post hoc technique, we will utilize a prior article’s example: <br /><br /><b>The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.</b><br /><br /><b># To test for goodness of fit # <br /><br /> Model <-matrix(c(5, 1, 2, 2), <br /><br /> nrow = 2,<br /> <br /> dimnames = list("Smoker" = c("Yes", "No"),<br /> <br /> "Obese" = c("Yes", "No")))<br /> <br /> # To run the chi-square test #<br /> <br /> # 'correct = FALSE' disables the Yates’ continuity correction #<br /> <br /> chisq.test(Model, correct = FALSE)</b><br /><br />This produces the output:<br /> <br /><i> Pearson's Chi-squared test<br /> <br /> data: Model<br /> X-squared = 1.2698, df = 1, p-value = 0.2598</i><div><br />From the output provided, we can easily conclude that our results were not significant. <br /><br />However, let’s delve a bit deeper into our findings. <br /><br />First, let’s take a look at the matrix of the model. <br /><br /><b>Model</b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 5 2 <br /> No 1 2 </i><br /><p style="font-stretch: normal; line-height: normal; margin: 0px;"><br />Now, let’s take a look at the expected model values. <br /><br /><b>chi.result <- chisq.test(Model, correct = FALSE) <br /><br />chi.result$expected </b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 4.2 2.8 <br /> No 1.8 1.2 </i><br /></p><br />What does this mean? <br /><br />The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-o6LCL1Kvv3U/YLwbgpj0itI/AAAAAAAABV4/_3OJmgIONfsqkYtE2mLoLxGwt3AGPyqUwCLcBGAsYHQ/s288/Karl_Pearson.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="219" src="https://1.bp.blogspot.com/-o6LCL1Kvv3U/YLwbgpj0itI/AAAAAAAABV4/_3OJmgIONfsqkYtE2mLoLxGwt3AGPyqUwCLcBGAsYHQ/s0/Karl_Pearson.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><i>(Karl Pearson)</i></div><br />From the previously derived values, we can derived the Pearson Residual Values.<br /><div><br /><b>print(chi.result$residuals) </b><br /><br /><i> Obese <br />Smoker Yes No <br /> Yes 0.3903600 -0.4780914 <br /> No -0.5962848 0.7302967 <br /></i><br />What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data. <br /><br />The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values. </div><div><br />Thus: <b>Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value</b><br /><br /><b><u>Observed Values </u></b><br /><br /> Obese <br />Smoker Yes No <br /> Yes 5 2 <br /> No 1 2 <br /><br /><b><u>Expected Values </u></b><br /><br /> Obese <br />Smoker Yes No <br /> Yes 4.2 2.8 <br /> No 1.8 1.2</div><br />(5 – 4.2) / √ 4.2 = 0.3903600 <div><br />(1 – 1.8) / √ 1.8 = -0.5962848 <br /><br /></div><div>(2 – 2.8) / √ 2.8 = -0.4780914 <br /><br /></div><div>(2 – 1.2) / √ 1.2 = 0.7302967 <br /><br /><b><u>~ OR ~ </u></b><br /><br /><b>(5 - 4.2) / sqrt(4.2) <br /><br />(1 - 1.8) / sqrt(1.8) <br /><br />(2 - 2.8) / sqrt(2.8) <br /><br />(2 - 1.2) / sqrt(1.2) </b><br /><br /><i>[1] 0.39036 <br />[1] -0.5962848 <br />[1] -0.4780914 <br />[1] 0.7302967</i></div><br />The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.<br /><br /><b><u>The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test </u></b><br /><br />Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis. <br /><br />Why would we do this? <br /><br />Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance. <br /><br />A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-fk-35lHxDt0/YLwdtI0MRMI/AAAAAAAABWA/BBmgv1hHmqAqwHKuAjHJKK0mCpFWWdamwCLcBGAsYHQ/s378/R_Fisher.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="378" data-original-width="277" height="320" src="https://1.bp.blogspot.com/-fk-35lHxDt0/YLwdtI0MRMI/AAAAAAAABWA/BBmgv1hHmqAqwHKuAjHJKK0mCpFWWdamwCLcBGAsYHQ/s320/R_Fisher.png" /></a></div><div style="text-align: center;"><i>(Sir Ronald Fisher</i><i>)</i></div><br /><b>fisher.result <- fisher.test(Model) <br /><br />print(fisher.result$p.value) </b><br /><br /><i>[1] 0.5 <br /></i><br /><div><Yikes!></div><div><br /><u><b>Conclusions </b></u><br /><br />Now that we have considered our analysis every which way, we can state our findings in APA Format. <br /><br />This would resemble the following: <br /><br />A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05. </div><div><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 12px;"><br />In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data. <br /><br />In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a<b> non-significant</b> relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05). </p><div><br /></div><div>-----------------------------------------------------------------------------------------------------------------------------</div><br />I hope that you found all of this helpful and entertaining. <br /><br />Until next time, <br /><br />-RD<div><div><div><div><div><br /></div></div></div></div></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-23976497720373503432020-11-09T20:54:00.010-05:002020-11-09T21:00:08.517-05:00(R) Cohen’s d In today’s entry, we are going to discuss Cohen’s d, what it is, and when to utilize it. We will also discuss how to appropriately apply the methodology needed to derive this value, through the utilization of the R software package. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uDmaBoW3_vw/X6nvSwdS0jI/AAAAAAAABUA/sehzljcss1sMej3PEz0hFrBDyT0VwTvwwCLcBGAsYHQ/s572/Drake.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="513" data-original-width="572" height="358" src="https://1.bp.blogspot.com/-uDmaBoW3_vw/X6nvSwdS0jI/AAAAAAAABUA/sehzljcss1sMej3PEz0hFrBDyT0VwTvwwCLcBGAsYHQ/w400-h358/Drake.png" width="400" /></a></div><br /><div style="text-align: center;"><i>(SPSS does not contain the innate functionality necessary to perform this calculation)</i></div><br /><b><u>Cohen’s d - (What it is)</u>:</b><br /><br />Cohen’s d is utilized as a method to assess the magnitude of impact as it relates to two sample groups which are subject to differing conditions. For example, if a two sample t-test was being implemented to test a single group which received a drug, against another group which did not receive the drug, then the p-value of this test would determine whether or not the findings were significant.<br /><br /><i>Cohen’s d would measure the magnitude of the potential impact</i><i>. </i><br /><br /><b><u>Cohen’s d - (When to use it)</u>: </b><br /><br />In your statistics class. <br /><br />You could also utilize this test to perform post-hoc analysis as it relates to the ANOVA model and the Student’s T-Test. However, I have never witnessed the utilization of this test outside of an academic setting. <br /><br /><b><u>Cohen’s d – (How to interpret it)</u>: </b><br /><br />General Interpretation Guidelines: <br /><br />Greater than or equal to 0.2 = small <br />Greater than or equal to 0.5 = medium <br />Greater than or equal to 0.8 = large <br /><br /><b><u>Cohen’s d – (How to state your findings)</u>: </b><br /><br />The effect size for this analysis (d = x.xx) was found to exceed Cohen’s convention for a [small, medium, large] effect (d = .xx). <br /><br /><b><u>Cohen’s d – (How to derive it)</u>:</b><br /><br /><b># Within the R-Programming Code Space # <br /><br />################################## <br /><br /># length of sample 1 (x) # <br />lenx <- <br /># length of sample 2 (y) # <br />leny <- <br /># mean of sample 1 (x) # <br />meanx <- <br /># mean of sample 2 (y)# <br />meany <- <br /># SD of sample 1 (x) # <br />sdx <- <br /># SD of sample 2 (y) # <br />sdy <- <br /><br />varx <- sdx^2 <br />vary <- sdy^2 <br />lx <- lenx - 1 <br />ly <- leny - 1 <br />md <- abs(meanx - meany) ## mean difference (numerator) <br />csd <- lx * varx + ly * vary <br />csd <- csd/(lx + ly) <br />csd <- sqrt(csd) ## common sd computation <br />cd <- md/csd ## cohen's d <br /><br />cd <br /><br />################################## </b><br /><br /><b># The above code is a modified version of the code found at: # <br /><br /># https://stackoverflow.com/questions/15436702/estimate-cohens-d-for-effect-size #</b><br /><br /><b><u>Cohen’s d – (Example)</u></b>: <br /><br /><b>FIRST WE MUST RUN A TEST IN WHICH COHEN’S d CAN BE APPLIED AS AN APPROPRIATE POST-HOC TEST METHODOLOGY.</b><div><b> <br />Two Sample T-Test</b><br /> <br /> This test is utilized if you randomly sample different sets of items from two separate control groups. <br /><br /><b> Example:</b><br /> <br /> A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:</div><div><br /> 70, 74, 76, 72, 75, 74, 71, 71<br /> <br /> He then measures temperature in samples which the chemical was not applied.<br /> <br /> 74, 75, 73, 76, 74, 77, 78, 75<br /> <br /> Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?<br /> <br /> For this, we will use the code:<div><br /><b> N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)<br /> <br /> N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)<br /> <br /> t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)</b><br /> <br /> Which produces the output:</div><div><br /><i>Two Sample t-test <br /><br />data: N2 and N1 <br />t = 2.4558, df = 14, p-value = 0.02773 <br />alternative hypothesis: true difference in means is not equal to 0 <br />95 percent confidence interval: <br /> 0.3007929 4.4492071 <br />sample estimates: <br />mean of x mean of y <br /> 75.250 72.875 </i><br /><br /><b> # Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #</b><br /> <br /><b> # An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #<br /></b> <br /> From this output we can conclude:<br /> <br /> With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.</div></div><br /><b><u>Application of Cohen’s d</u></b> <br /><br /><b>length(N1) # 8 # <br />length(N2) # 8 # <br /><br />mean(N1) # 72.875 # <br />mean(N2) # 75.25 # <br /><br />sd(N1) # 2.167124 # <br />sd(N2) # 1.669046 # <br /><br /># length of sample 1 (x) # <br />lenx <- 8 <br /># length of sample 2 (y) # <br />leny <- 8 <br /># mean of sample 1 (x) # <br />meanx <- 72.875 <br /># mean of sample 2 (y)# <br />meany <- 75.25 <br /># SD of sample 1 (x) # <br />sdx <- 2.167124 <br /># SD of sample 2 (y) # <br />sdy <- 1.669046 <br /><br />varx <- sdx^2 <br />vary <- sdy^2 <br />lx <- lenx - 1 <br />ly <- leny - 1 <br />md <- abs(meanx - meany) ## mean difference (numerator) <br />csd <- lx * varx + ly * vary <br />csd <- csd/(lx + ly) <br />csd <- sqrt(csd) ## common sd computation <br />cd <- md/csd ## cohen's d <br /><br />cd </b><br /><br />Which produces the output: <br /><br /><i>[1] 1.227908</i><div><i><br /></i></div><div><b>################################## </b><i><br /></i><br />From this output we can conclude: <br /><br />The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80). <br /><br /><b>Combining both conclusions, our final written product would resemble: </b><br /><br /> With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water. <br /><br />The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80). <br /><br /></div><div>And that is it for this article. <br /><br />Until next time, <br /><br />-RD</div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-25937816034542672552020-10-16T23:22:00.007-04:002020-10-16T23:27:19.962-04:00(R) Fisher’s Exact Test In today’s entry, we are going to briefly review <b>Fisher’s Exact Test</b>, and its appropriate application within the R programming language. <br /><br />Like the F-Test, Fisher’s Exact Test utilizes the F-Distribution as its primary mechanism of functionality. The F-Distribution being initially derived by Sir. Ronald Fisher. <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-wANIfKQjV8I/X4pgr0H_UqI/AAAAAAAABTg/u_3HU2q8t9Itu_4OQ6RS0jVjqvtTkqqtwCLcBGAsYHQ/s308/Fisher1016.JPG" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="308" data-original-width="220" src="https://1.bp.blogspot.com/-wANIfKQjV8I/X4pgr0H_UqI/AAAAAAAABTg/u_3HU2q8t9Itu_4OQ6RS0jVjqvtTkqqtwCLcBGAsYHQ/s0/Fisher1016.JPG" /></a></div><div style="text-align: center;"><i>(The Man)</i></div><div style="text-align: center;"><i><br /></i></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-AeP7Tv1jxoE/X4pg71qIRoI/AAAAAAAABTo/R-xTULHNAFA2scHBJwvStW2XWsDj9vySwCLcBGAsYHQ/s1024/Fisher1016.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://1.bp.blogspot.com/-AeP7Tv1jxoE/X4pg71qIRoI/AAAAAAAABTo/R-xTULHNAFA2scHBJwvStW2XWsDj9vySwCLcBGAsYHQ/w400-h300/Fisher1016.png" width="400" /></a></div><div style="text-align: center;"><i>(The Distribution)</i></div><div style="text-align: center;"><br /></div>The Fisher’s Exact Test is very similar to The Chi-Squared Test. Both tests are utilized to assess categorical data classifications. The Fisher’s Exact Test was designed specifically for 2x2 contingency sorted data, though, more rows could theoretically be added if necessary. A general rule for application as it relates to selecting the appropriate test for the given circumstances (Fisher’s Exact vs. Chi-Squared), pertains directly to the sample size. If a cell within the contingency table would contain less than 5 observations, a Fisher’s Exact Test would be more appropriate. <br /><br />The test itself was created for the purpose of studying small observational samples. For this reason, the test is considered to be “conservative”, as compared to The Chi-Squared Test. Or, in layman terms, you are less likely to reject the null hypothesis when utilizing a Fisher’s Exact Test, as the test errs on the side of caution. As previously mentioned, the test was designed for smaller observational series, therefore, its conservative nature is a feature, not an error. <br /><br />Let’s give it a try in today’s…</div><br /><b><u>Example: </u></b><br /><br />A professor instructs two classes on the subject of Remedial Calculus. He believes, based on a book that he recently completed, that students who consume avocados prior to taking an exam, will generally perform better than students who did not consume avocados prior to taking an exam. To test this hypothesis, the professor has one of classes consume avocados prior to a very difficult pass/fail examination. The other class does not consume avocados, and also completes the same examination. He collects the results of his experiment, which are as follows: <br /><br />Class 1 (Avocado Consumers) <br /><br />Pass: 15 <br /><br />Fail: 5 <br /><br />Class 2 (Avocado Abstainers) <br /><br />Pass: 10 <br /><br />Fail: 15 <br /><br />It is also worth mentioning that professor will be assuming an alpha value of .05. <br /><br /><b># The data must first be entered into a matrix # <br /><br />Model <- matrix(c(15, 10, 5, 15), nrow = 2, ncol=2) <br /><br /># Let’s examine the matrix to make sure everything was entered correctly # <br /><br />Model</b><br /><br /><u>Console Output: </u><br /><div><i><br /></i></div><div><i> [,1] [,2] <br />[1,] 15 5 <br />[2,] 10 15 </i><br /><br /><b># Now to apply Fisher’s Exact Test # <br /><br />fisher.test(Model) <br /></b><br /><u>Console Output: <br /></u><br /> <i><span> <span> </span></span>Fisher's Exact Test for Count Data <br /><br />data: Model <br />p-value = 0.03373 <br />alternative hypothesis: true odds ratio is not equal to 1 <br />95 percent confidence interval: <br /> 1.063497 20.550173 <br />sample estimates: <br />odds ratio <br /> 4.341278</i><br /><br /><u><b>Findings: </b></u><br /><br />Fisher’s Exact Test was applied to our experimental findings for analysis. The results of such indicated a significant relationship as it pertains to avocado consumption and examination success: 75% (15/20), as compared to non-consumption and examination success: 40% (10/25); (p = .03). <br /><br />If we were to apply the Chi-Squared Test to the same data matrix, we would receive the following output: <br /><br /><b># Application of Chi-Squared Test to prior experimental observations # </b><br /><br /><b>chisq.test(Model, correct = FALSE)</b><br /><br /><u>Console Output: </u><br /><br /><i> Pearson's Chi-squared test <br /><br />data: Model <br />X-squared = 5.5125, df = 1, p-value = 0.01888</i>
<br /><br /><b><u>Findings:</u> </b></div><div><br /></div><div>As you might have expected, the application of the Chi-Squared Test yielded an even smaller p-value! If we were to utilize this test in lieu of The Fisher’s Exact Test, our results would also demonstrate significance. <br /><br />That is all for this entry. <br /><br />Thank you for your patronage. <br /><br />I hope to see you again soon. <br /><br />-RD<br /></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-4012950854500466502020-10-14T17:58:00.002-04:002020-10-14T18:00:42.658-04:00Why Isn’t My Excel Function Working?! (MS-Excel)Even an old data scientist can learn a new trick every once in a while. <br /><br />Today was such a day.<br /><br />Imagine my shock, as I spent about two and a half hours trying to get the most basic MS-Excel Functions to correctly execute. <br /><br />This brings us to today’s example.<br /><br />I’m not sure if this is now a default option within the latest version of Excel, or why this option would even exist, however, I feel that it is my duty to warn you of its existence.<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-b9styC0PrRA/X4dx7I4nrPI/AAAAAAAABS4/ayrX50GIQRYHMvfqfVdXKjTakyyl0C-BACLcBGAsYHQ/s306/1012A.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="115" data-original-width="306" src="https://1.bp.blogspot.com/-b9styC0PrRA/X4dx7I4nrPI/AAAAAAAABS4/ayrX50GIQRYHMvfqfVdXKjTakyyl0C-BACLcBGAsYHQ/s16000/1012A.png" /></a></div><div><br /></div>For the sake this demonstration, we’ll hypothetically assume that you are attempting to write a <b>=COUNTIF </b>function within cell: <b>C2</b>, in order assess the value contained within cell: <b>A2</b>. If we were to drag this formula to the cells beneath: <b>C2</b>, in order to apply the function to cells: <b>C3 </b>and <b>C4</b>, a mis-application occurs, as the value <b>“Car”</b> is not contained within <b>A3</b> or <b>A4</b>, and yet, the value <b>1 </b>is returned. <br /><br />If this “error” arises, it is likely due to the option <b>“Manual”</b> being pre-selected within the <b>“Calculator Options”</b> drop-down menu, which itself, is contained within the<b> “Formulas”</b> ribbon menu. To remedy this situation, change the selection to <b>“Automatic”</b> within the <b>“Calculator Options”</b> drop down. <br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-jFISEQOLqf8/X4dy8djCwUI/AAAAAAAABTE/m4FiEnJWDbYmeZlY3YGeeghfCeKgEpg0gCLcBGAsYHQ/s1056/1012B.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="207" data-original-width="1056" height="78" src="https://1.bp.blogspot.com/-jFISEQOLqf8/X4dy8djCwUI/AAAAAAAABTE/m4FiEnJWDbYmeZlY3YGeeghfCeKgEpg0gCLcBGAsYHQ/w400-h78/1012B.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><i>(Click on image to enlarge)</i></div><div><br /></div>The result should be the previously expected outcome:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-MNyB5D4K_0o/X4dzVa9Nn9I/AAAAAAAABTM/AQbtrqYHl48lwLEiKzrUJuKQNBJc7tU1ACLcBGAsYHQ/s245/1012C.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="110" data-original-width="245" src="https://1.bp.blogspot.com/-MNyB5D4K_0o/X4dzVa9Nn9I/AAAAAAAABTM/AQbtrqYHl48lwLEiKzrUJuKQNBJc7tU1ACLcBGAsYHQ/s16000/1012C.png" /></a></div><div><br /></div>Instead of accidentally and unknowingly encountering this error/feature in a way which is detrimental to your research, I would always recommend checking that <b>“Calculator Options”</b> is set to <b>“Automatic”</b>,<b> </b>prior to beginning your work within the MS-Excel platform.<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-KPcQq0L3JhI/X4dzqbw7Q8I/AAAAAAAABTU/PxARehn-RQU9pAe1GfUgNoORAz2AHHUrACLcBGAsYHQ/s259/1012D.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="180" data-original-width="259" src="https://1.bp.blogspot.com/-KPcQq0L3JhI/X4dzqbw7Q8I/AAAAAAAABTU/PxARehn-RQU9pAe1GfUgNoORAz2AHHUrACLcBGAsYHQ/s0/1012D.png" /></a></div><br />I hope that you found this article useful. <br /><br />I’ll see you in the next entry. <br /><br />-RD <br />Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0tag:blogger.com,1999:blog-1608768736913930926.post-57997942540476579562020-10-06T23:28:00.005-04:002020-10-06T23:31:54.761-04:00Averaging Across Variable Columns (SPSS)There may be a more efficient way to perform this function, as simpler functionality exists within other programming languages. However, I have not been able to discover a non <b>“ad-hoc” </b>method for performing this task within SPSS. <br /><br />We will assume that we are operating within the following data set:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-apvBDdzNEX4/X30zjhdRW0I/AAAAAAAABSM/C-V69eU_OLovvfBgE0SuMxWFv2jw3FaygCLcBGAsYHQ/s344/A10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="257" data-original-width="344" src="https://1.bp.blogspot.com/-apvBDdzNEX4/X30zjhdRW0I/AAAAAAAABSM/C-V69eU_OLovvfBgE0SuMxWFv2jw3FaygCLcBGAsYHQ/s16000/A10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Which possesses the following data labels:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-dRstvQfcyr4/X30z2247HrI/AAAAAAAABSU/MqVH5Zb9B1wjSuz4OThvxhCACA88b0KkgCLcBGAsYHQ/s350/B10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="258" data-original-width="350" src="https://1.bp.blogspot.com/-dRstvQfcyr4/X30z2247HrI/AAAAAAAABSU/MqVH5Zb9B1wjSuz4OThvxhCACA88b0KkgCLcBGAsYHQ/s16000/B10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Assuming that all variables are on a similar scale, we could create a new variable by utilizing the code below: <br /><br /><b>COMPUTE CatSum=MEAN(VarA, <br />VarB, <br />VarC). <br />EXECUTE. <br /></b><br />This new variable will be named <b>“CatSum”</b>. This variable will be comprised of the mean of the sum of each variable’s corresponding observational data rows: (<b>“VarA”</b>, <b>“VarB”</b>, <b>“VarC”</b>). <div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1KVLXcGlTXs/X300R55rkuI/AAAAAAAABSc/9LS64FaKsesHMYq6rcN_tIpXOYv9D590QCLcBGAsYHQ/s430/C10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="259" data-original-width="430" src="https://1.bp.blogspot.com/-1KVLXcGlTXs/X300R55rkuI/AAAAAAAABSc/9LS64FaKsesHMYq6rcN_tIpXOYv9D590QCLcBGAsYHQ/s16000/C10.6.png" /></a></div><div><br /></div> To generate the mean value of our newly created <b>“CatSum” </b>variable, we would execute the following code: <br /><br /><b>DESCRIPTIVES VARIABLES=CatSum <br /> /STATISTICS=MEAN STDDEV. </b><div><br />This produces the output:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-GMeDNaTlGhI/X300j_B8QkI/AAAAAAAABSk/I4KtlkVu1VkSjdRYeL7TTnLXV8xGjw8-ACLcBGAsYHQ/s426/D10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="426" src="https://1.bp.blogspot.com/-GMeDNaTlGhI/X300j_B8QkI/AAAAAAAABSk/I4KtlkVu1VkSjdRYeL7TTnLXV8xGjw8-ACLcBGAsYHQ/s16000/D10.6.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br />To reiterate what we are accomplishing by performing this task, we are simply generating the mean value of the sum of variables: <b>“VarA”</b>, <b>“VarB”</b>, <b>“VarC”</b>. <br /><br />Another way to conceptually envision this process, is to imagine that we are placing all of the variables together into a single column:<div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-30_c_QnCkOc/X3003ZZOk_I/AAAAAAAABSs/lRmctvQXC-c5NVirOKHljCXm1iQsIqo5ACLcBGAsYHQ/s741/E10.6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="741" data-original-width="87" height="640" src="https://1.bp.blogspot.com/-30_c_QnCkOc/X3003ZZOk_I/AAAAAAAABSs/lRmctvQXC-c5NVirOKHljCXm1iQsIqo5ACLcBGAsYHQ/w74-h640/E10.6.png" width="74" /></a></div><br />After which, we are generating the mean value of the column which contains all of the combined variable observational values. <br /><br />And that, is that!<br /><br />At least, for this article. <br /><br />Stay studious in the interim, Data Heads! <br /><br />- RD<div><div><div><br /></div></div></div>Data Scientisthttp://www.blogger.com/profile/06009197702473566650noreply@blogger.com0