Reflections of a Data Scientist: Model and Method Utilization

There are many model types, methods and techniques demonstrated on this website. In this entry, I will categorize each of the aforementioned concepts, and provide a brief description as it pertains to the scenario which would warrant appropriate utilization.

(Tests of Normality)

Q-Q Plot – A graph which is utilized to assess data for normality.

P-P Plot – A graph which is utilized to assess data for normality.

Shapiro-Wilk Normality Test – A test which is utilized to test data for normality.

(Tests Related to Parametric Model Variable Correlation)

Variance Influence Factor – A method which tests model variables for correlation.

(Pearson) Coefficient of Correlation – A method which tests variables for correlation.

Partial Correlation - A method which is utilized to measure the correlation between two variables, while also controlling for a third variable.

Distance Correlation – A method which tests model variables for correlation through the utilization of a Euclidean distance formula.

Canonical Correlation – A method which assesses model variables for correlation through the combination of model variables into independent groups.

(Tests Related to Non-Parametric Model Variable Correlation)

Spearman’s Rank Correlation - A non-parametric alternative to the Pearson correlation. This method is utilized in circumstances when either data samples are non-linear, or the data type contained within those samples are ordinal. An example of ordinal data – “survey response data which asked the respondent to rank a particular item on a scale of 1-10”.

Kendall Rank Correlation Coefficient - Like Spearman’s rho, Kendall’s Tau is also utilized in circumstances when either data samples are non-linear, or the data type contained within the samples is ordinal.

(Tests of Significance Amongst Groups)

One Sample T-Test - This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

Two Sample T-Test - This test functions in the same manner as the above test. However, in the case of this model, data is randomly sampled from different sets of items from two separate control groups.

The Welch Two Sample T-Test - This test functions in the same manner as the above test. The only difference being, this method is utilized if data is randomly sampled from different sets of items from two separate control groups of uneven size.

Paired T-Test – Similar in composition to the Two Sample T-Test, this test is utilized if you are sampling the same set twice, once for each variable.

(Analysis of Variance “ANOVA”)

Analysis of Variance – Also known as ANOVA, this method is utilized to test for significance across the variances of multiple sample groups. In many ways, this test is similar to a t-test, however, ANOVA allows for multiple group comparison.

One Way Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable.

Two Way Analysis of Variance (ANOVA) - An ANOVA model containing multiple independent variables.

Repeated-Measures Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable measured multiple times.

(Exotic Analysis of Variance “ANOVA” Variants)

Analysis of Covariance (ANCOVA) – An ANOVA model which also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/ancova-using-spss-statistics.php

Random Effects Analysis of Variance – An ANOVA model which is synthesized from sampling from a greater population in order to determine inference.

https://stat.ethz.ch/education/semesters/as2015/anova/06_Random_Effects.pdf

Multivariate Analysis of Variance (MANOVA) – An ANOVA model containing multiple dependent variables.

https://statistics.laerd.com/spss-tutorials/one-way-manova-using-spss-statistics.php

Multivariate of Covariance (MANCOVA) – An ANOVA model containing multiple dependent variables. Also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/one-way-mancova-using-spss-statistics.php

(Test of Significance for Nonparametric Data)

Friedman Test (One Way Analysis of Variance) – The nonparametric alternative to a One Way ANOVA test.

Wilcox Signed Rank Test (One Sample T-Test, Paired T-Test) – The nonparametric alternative to the One Sample T-Test, and the Paired T-Test.

Mann-Whitney U Test (Two Sample T-Test) – A nonparametric alternative to the One Way ANOVA test.

(Tests of Significance Amongst Groups)

Chi-Square – A test which measures categorical significance as it pertains to a binary outcome variable.

McNemar's Test – A test which measures categorical significance, limited to two initial categories, and two categorical outcomes. This test is typically utilized for drug trials.

(Metric to Assess Rate of Agreement Amongst Two Entitles)

Cohen’s Kappa – A test which measures the rate of agreement amongst two entities.

(Tests of Significance Amongst Groups Comprised of Survey Questions)

Cronbach’s Alpha - Cronbach’s Alpha is primarily utilized to measure the inter-relatedness of response data collected from sociological surveys. Specifically, the potential differentiation of response information related to certain interrelated categorical survey questions.

(Tests Pertaining to Stationarity and Random Walks)

Dicky-Fuller Test – A methodology of analysis utilized to test data for stationarity.

Phillips-Perron Unit Root Test – A methodology utilized to test data for random walk potential.

(Comparison of Outcome Variables)

Two Step Cluster – A method which assesses model outcome variables through the utilization of a clustering technique.

K-Means - A method which assesses model outcome variables through the utilization of a clustering technique.

Hierarchical Cluster - A method which assesses model outcome variables through the utilization of a hierarchal technique.

K-Nearest Neighbor – A method which compares similarity of outcome variables as determined by the values of the model’s independent variables.

(Reduction of Independent Variables through Variable Synthesis)

Dimension Reduction – A method which creates new variables with values that are determined by the original values of the independent model variables.

(Impact Assessment)

TURF Analysis – A method of analysis typically utilized for product and design studies. This technique assesses the most effective way to reach a sample target demographic.

(Survival Analysis)

Survival Analysis - A statistical methodology which measures the probability of an event occurring within a group over a period of time.

(Sample Distribution Tests)

The Wald Wolfowitz Test - A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.

The Wald Wolfowitz Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

The Kolmogorov-Smirnov Test - A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.

The Kolmogorov-Smirnov Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

(Outcome Models – Conditions for Utilization)

Linear Regression – Continuous outcome variable. Continuous independent variable(s).

General Linear Mixed Models – Continuous outcome variable. Any type of independent variable(s).

Logistic Regression Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Discriminant Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Loglinear Analysis - Binary outcome variable. Categorical independent variable(s).

Partial Least Squares Regression – Any type of outcome variable. Any type of independent variable(s).

Polynomial Regression – Continuous outcome variable. Continuous independent variable(s).

Multinomial Logistical Analysis – Categorical outcome variable. Categorical input variable(s).

Logistical Ordinal Regression – Categorical outcome variable. Categorical input variable(s).

Probit Regression – Binary outcome variable. Categorical or continuous input variable(s).

2-Stage Least Squares Regression - Categorical outcome variable. Continuous independent variable(s).

Reflections of a Data Scientist

Sunday, August 4, 2019

Model and Method Utilization

No comments:

Post a Comment