Correlation Research
- Relations between Variables
- Magnitude
- Reliability
- PValue,
- Small relations can only be proven in large samples
Test for Normality
- Anova/Manova,
- Nonparametrics
- Test Statistics - Normal(T, F, Chi-Square)
Differences between independent groups. -Normal Distributions -> T test for Independent Samples
T Test Alternative for Independent Nonparametrics samples
- the Mann-Whitney U test
- the Kolmogorov-Smirnov two-sample test and Wald-Wolfowitz
- Analysis of variance (ANOVA)
- Analysis of covariance
- Multivariate ANOVA
Differences between dependent groups.
Normal Distributions
- T-Test for Dependent Samples
Nonparametric alternatives to this test are
the Sign test and
Wilcoxon's matched pairs test.
If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate.
If there are more than two variables that were measured in the same sample
- then we would customarily use repeated measures ANOVA.
Nonparametric alternatives to this method are
- Friedman's two-way analysis of variance
- and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs. "failed").
- Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time
_ Relationships between variables.
To express a relationship between two variables one usually computes the correlation coefficient.
Nonparametric equivalents to the standard correlation coefficient are_
If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are
- the Chi-square test,
- the Phi coefficient,
- and the Fisher exact test.
In addition, a simultaneous test for relationships between multiple cases is available:
- Kendall coefficient of concordance.
This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli_
In several mostly elementary situations when the assumptions of parametric tests cannot be met, one may resort to non-parametric tests rather than parametric tests such as the t-test, the Pearson correlation test, analysis of variance, etc. In such situations the power of the non-parametric or distribution-free tests is often as good as the parametric ones or better. It often is a good idea to use both types of tests if they are available and compare the resulting p-values. If these values are roughly the same there is little to worry about, if they are different there is something to be sorted out.
_ Nonparametric Correlations _
The following are three types of commonly used nonparametric correlation coefficients (Spearman R,Kendall Tau, andGamma coefficients). Violating Normality Assumptions has less grave consequences than previously thought although this can only be proven on a case by case basis.
- ** Normal vs Nonparametric Methods
- Distribution Tables (Z T Chi-Square F-Tables)
- Charting Difference Between (F/T Tests, Anova/Ancova, T-Test/ Anova, 1 vs 2-way Anova )
- Pearson product-moment
- Partial correlation -Confounding variable
- Coefficient of determination
* stats.zscore(df)
Two types Major types of Statistics: Inferential and Descriptive. Inferential statistics use more complex calculations to infer trends and make predictions/assumptions. A descriptive analysis should be performed first before moving into inferencing.
First and foremost show of the descriptives Confidence intervals on scatter plots/ pearson coefficients/ linear regression
Group by group -total X's in Y
The Chi-Square test helps you determine if two discrete variables are associated
Too much overlapping in the x-axis labels renders the whole plot useless.
https://blog.socialcops.com/academy/resources/cross-tabulation-how-why/
"Also known as contingency tables or cross tabs, cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.""Cross tabulation is usually performed on categorical data — data that can be divided into mutually exclusive groups."
"The Chi-Square test helps you determine if two discrete variables are associated. If there's an association, the distribution of one variable will differ depending on the value of the second variable. But if the two variables are independent, the distribution of the first variable will be similar for all values of the second variable."
Cross tabulation:
preprocess?
figure out which 2 features you want to crosstab
Create a column for every category of feature 1
Determin how you will aggregate (sum,count,avg,max,min,product)
create a record for every unique feature 1
5.1) create a column for the unique labels name
- aggregate feature 1 columns
Cross tabulation :
- preprocess?
- figure out which 2 features you want to crosstab
- Create a column for every category of feature 1
- Determine how you will aggregate (sum,count,avg,max,min,product)
- create a record for every unique feature 1 5.1) create a column for the unique labels name
- aggregate feature 1 columns
Comparison of means:_ t _-test
The t-test is used in many ways in statistics. The more common uses are (1) comparing one mean with a known mean, (2) testing whether two means are distinct, (3) testing whether the means from matched pairs are equal. Also called Student's t test (equal variances) or Welch's t test (unequal variances). Applicable for means from one sample, two independent samples, and paired samples. See analysis of variance for comparing more than two means. The t-test is also used in other contexts.
The (Pearson) correlation coefficient is a measure of the strength of the linear relationship between two interval or numeric variables. Other correlation coefficients exist to measure the relationship between ordinal two variables, such the Spearman's rank correlation coefficient. The highest value of the correlation coefficient is 1 or -1 (perfect relationship), the lowest value 0 (no relationship). The t-test is used to test whether a sample Pearson correlation differs from 0.
The (Pearson) chi-square coefficient is primarily used with one or two categorical variables. The coefficient is a measure of difference between observed and expected scores.
One categorical variable: In case of one categorical variable the test measures whether the observed values can reasonably come from a known distribution (or model). In other words the observed values are compared to the expected values from this known distribution. In such cases the test is primarily used for model testing.
Two categorical variables: In case of two categorical variables the expected values usually are the values under the nulhypothesis that there is no relationship between the two variables. Therefore the chi-square coefficient of two variables is a measure of relationship.
The chi-square coefficient is tested by comparing it with the chi-square distribution given the degrees of freedom. Other coefficients to measure the relationship between two variables in two-way contingency tables exist as well (for a list, see for instance the SPSS output with Crosstabs).
Note that if possible exact p-values are preferred over the standard (asymptotic) ones.
In several mostly elementary situations when the assumptions of parametric tests cannot be met, one may resort to non-parametric tests rather than parametric tests such as the t-test, the Pearson correlation test, analysis of variance, etc. In such situations the power of the non-parametric or distribution-free tests is often as good as the parametric ones or better. It often is a good idea to use both types of tests if they are available and compare the resulting p-values. If these values are roughly the same there is little to worry about, if they are different there is something to be sorted out.
Unfortunately, appropriate non-parametric techniques are not available for all comparable parametric techniques (see, however, the method selection charts for comparable tests).
The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.
Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.
The correlation coefficient is a function of the covariance. The correlation coefficient is equal to the covariance divided by the product of the standard deviations of the variables. `Therefore, a positive covariance always results in a positive correlation and a negative covariance always results in a negative correlation.
https://lindeloev.github.io/tests-as-linear/
Parametric vs NonParametric Tests
Parametric tests (means) | Nonparametric tests (medians) |
---|---|
1-sample t test | 1-sample Sign, 1-sample Wilcoxon |
2-sample t test | Mann-Whitney test |
One-Way ANOVA | Kruskal-Wallis, Mood's median test |
Factorial DOE with one factor and one blocking variable | Friedman test |
BASIS FOR COMPARISON | T-TEST | F-TEST |
---|---|---|
Meaning | T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small. | F-test is statistical test, that determines the equality of the variances of the two normal populations. |
Test statistic | T-statistic follows Student t-distribution, under null hypothesis. | F-statistic follows Snedecor f-distribution, under null hypothesis. |
Application | Comparing the means of two populations. | Comparing two population variances. |
BASIS FOR COMPARISON | ANOVA | ANCOVA |
---|---|---|
Meaning | ANOVA is a process of examining the difference among the means of multiple groups of data for homogeneity. | ANCOVA is a technique that remove the impact of one or more metric-scaled undesirable variable from dependent variable before undertaking research. |
Uses | Both linear and non-linear model are used. | Only linear model is used. |
Includes | Categorical variable. | Categorical and interval variable. |
Covariate | Ignored | Considered |
BG variation | Attributes Between Group (BG) variation, to treatment. | Divides Between Group (BG) variation, into treatment and covariate. |
WG variation | Attributes Within Group (WG) variation, to individual differences. | Divides Within Group (WG) variation, into individual differences and covariate. |
BASIS FOR COMPARISON | T-TEST | ANOVA |
---|---|---|
Meaning | T-test is a hypothesis test that is used to compare the means of two populations. | ANOVA is a statistical technique that is used to compare the means of more than two populations. |
Test statistic | (x ̄-µ)/(s/√n) | Between Sample Variance/Within Sample Variance |
ONE WAY ANOVA** VS **TWO WAY ANOVA
BASIS FOR COMPARISON | ONE WAY ANOVA | TWO WAY ANOVA |
---|---|---|
Meaning | One way ANOVA is a hypothesis test, used to test the equality of three of more population means simultaneously using variance. | Two way ANOVA is a statistical technique wherein, the interaction between factors, influencing variable can be studied. |
Independent Variable | One | Two |
Compares | Three or more levels of one factor. | Effect of multiple level of two factors. |
Number of Observation | Need not to be same in each group. | Need to be equal in each group. |
Design of experiments | Need to satisfy only two principles. | All three principles needs to be satisfied. |
Compare Distribution Tables
Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities.
Standard Normal (Z) Table
The Standard Normal distribution is used in various hypothesis tests including tests on single means, the difference between two means, and tests on proportions.
Student's t Table
The Shape of the Student's t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increases. For more information on how this distribution is used in hypothesis testing, see t-test for independent samples and t-test for dependent samples in Basic Statistics and Tables
Chi-Square Table
Like the Student's t-Distribution, the Chi-square distribution's shape is determined by its degrees of freedom. The animation above shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50). For examples of tests of hypothesis that use the Chi-square distribution, see Statistics in crosstabulation tables in Basic Statistics and Tables as well as Nonlinear Estimation . See also, Chi-square Distribution.
F Distribution Tables
The F distribution is a right-skewed distribution used most commonly in Analysis of Variance (see ANOVA/MANOVA). The F distribution is a ratio of two Chi-square distributions, and a specific F distribution is denoted by the degrees of freedom for the numerator Chi-square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10)distribution is shown in the animation above
Gamma exponential bi bernoulli poisson
3 Pearson and Spearman correlation
the Spearman rank correlation is a Pearson correlation on rank-transformed x and y rank(y)=β0+β1⋅rank(x) H0:β1=0 value = 'rho'
Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest.
While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49). A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size. https://www.medcalc.org/manual/correlation.php The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this probability is lower than the conventional 5% (P<0.05) the correlation coefficient is called statistically significant. 95% confidence interval (CI) for the correlation coefficient: this is the range of values that contains with a 95% confidence the 'true' correlation coefficient.
4 One mean
4.1 One sample t-test and Wilcoxon signed-rank t-test model: A single number predicts y. Y = mx+b where x = 0. Null Hypothesis b=0 Wilcoxon signed-rank test , just with the signed ranks of yy instead of y itself signed_rank(y)=β0
4.2 Paired samples t-test and Wilcoxon matched pairs t-test model: a single number (intercept) predicts the pairwise differences. Y2-y1=mx+b where x = 0. Null Hypothesis: b=0
This means that there is just one y** = y 2 − y **1 to predict and it becomes a one-sample t-test on the pairwise differences. The visualization is therefore also the same as for the one-sample t-test. At the risk of overcomplicating a simple substraction, you can think of these pairwise differences as slopes (see left panel of the figure), which we can represent as y-offsets (see right panel of the figure): Similarly, the Wilcoxon matched pairs only differ from Wilcoxon signed-rank in that it's testing the signed ranks of the pairwise y−x differences.
5 Two means
5.1 Independent t-test and Mann-Whitney U Independent t-test model: two means predict yy.
yi=β0+β1xiβ1=0 where xixi is an indicator (0 or 1) saying whether data point ii was sampled from one or the other group. Indicator variables (also called "dummy coding") underly a lot of linear models and we'll take an aside to see how it works in a minute. Mann-Whitney U (also known as Wilcoxon signed-rank test for two independent groups) is the same model to a very close approximation, just on the ranks of xx and yy instead of the actual values: rank(yi)=β0+β1xi
5.2 Welch's t-test
This is identical to the (Student's) independent t-test above except that Student's assumes identical variances and Welch's t-test does not.
6 Three or more means
ANOVAs are linear models with (only) categorical predictors. They simply extend everything we did above, relying heavily on dummy coding.
Multivariate
Relationship Between Variables
- Correlation Coefficient
- Correlation Coefficient by Variable Combination
- Correlation Plot of Numerical Variables
Target based Analysis
Grouped Descriptive Statistics
Grouped Numerical Variables
Grouped Categorical Variables
Grouped Relationship Between Variables
Grouped Correlation Coefficient
Grouped Correlation Plot of Numerical Variables
Anova
ancova
Support
Confidence
Lift
Multivariate
- Correlation Coefficient
- Correlation Coefficient by Variable Combination
- Correlation Plot of Numerical Variables
- Target based Analysis
- Gruoped Descriptive Statistics
- Gruoped Numerical Variables
- Gruoped Categorical Variables
- Gruoped Relationship Between Variables
- Grouped Correlation Coefficient
- Grouped Correlation Plot of Numerical Variables
Major columns
Anova ancova
Support Confidence Lift
The bivariate distribution plots help us to study the relationship between two variables by analyzing the scatter plot the bivariate distributions:
If the denominators used to calculate the two percentages represent the same people, we use a one-sample t-test between percents to compare the two percents. If the denominators represent different people, we use the two-sample t-test between percents.
Median difference from Quartiles represent skew. Whiskers represent variance.
Cross tabulation:
- preprocess?
- figure out which 2 features you want to crosstab
- Create a column for every category of feature 1
- Determine how you will aggregate (sum,count,avg,max,min,product)
- create a record for every unique feature 1
- 5.1) create a column for the unique labels name
- aggregate feature 1 columns
Data Assumptions Parametric Vs Non
http://www.statsoft.com/Textbook/Nonparametric-Statistics
Relations between Variables - Magnitude, Reliability, PValue,
Small relations can only be proven in large samples
Parametric
- Differences between independent groups -> T-Test for Independent Samples
- Differences between dependent groups -> T-Test for Dependent Samples, If there are more than two variables that were measured in the same sample, then we would customarily use repeated measures ANOVA.
- Relationships between variables -> correlation coefficient
Nonparametric
- Differences between independent groups -> Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test and Wald-Wolfowitz
- Differences between dependent groups -> Sign test and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in nature, "pass" vs. "no pass" then McNemar's Chi-square test. If there are more than two variables that were measured in the same sample, then we would use Friedman's two-way analysis of variance and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs. "failed"). Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time.
- Relationships between variables -> Spearman R, Kendall Tau, and coefficient Gamma. If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance. This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli
- http://www.statsoft.com/Textbook/Nonparametric-Statistics#correlations
Common Tests
Basic Statistics: Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation
measures of location (mean, median, mode, etc.) and dispersion (variance, average deviation, quartile range, etc.)
Parametric(Means) Vs NonParametric (Medians)
- 1-sample t test VS 1-sample Sign, 1-sample Wilcoxon
- 2-sample t test VS Mann-Whitney test
- One-Way ANOVA VS Kruskal-Wallis, Mood’s median test
T-Test
- Meaning: T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small. T-test is a hypothesis test that is used to compare the means of two populations.
- Test Statistic: T-statistic follows Student t-distribution, under null hypothesis.
- Application: Comparing the means of two populations.
- Test statistic - (x ̄-µ)/(s/√n)
F-Test
- Meaning: F-test is statistical test, that determines the equality of the variances of the two normal populations.
- Test Statistic: F-statistic follows Snedecor f-distribution, under null hypothesis.
- Application: Comparing two population variances.
Anova
- Meaning - ANOVA is a process of examining the difference among the means of multiple groups of data for homogeneity.
- Uses - Both linear and non-linear model are used. ANOVA is a statistical technique that is used to compare the means of more than two populations.
- Includes - Categorical variable.
- Covariate - Ignored
- BG variation - Attributes Between Group (BG) variation, to treatment.
- WG variation - Attributes Within Group (WG) variation, to individual differences.
- Test statistic - Between Sample Variance/Within Sample Variance
Ancova
- Meaning - ANCOVA is a technique that remove the impact of one or more metric-scaled undesirable variable from dependent variable before undertaking research.
- Uses - Only linear model is used.
- Includes - Categorical and interval variable.
- Covariate - Considered
- BG variation - Divides Between Group (BG) variation, into treatment and covariate.
- WG variation - Divides Within Group (WG) variation, into individual differences and covariate.
ONE WAY ANOVA
- Meaning - One way ANOVA is a hypothesis test, used to test the equality of three of more population means simultaneously using variance.
- Independent Variable - One
- Compares - Three or more levels of one factor.
- Number of Observation - Need not to be same in each group.
- Design of experiments - Need to satisfy only two principles.
TWO WAY ANOVA
- Meaning - Two way ANOVA is a statistical technique wherein, the interaction between factors, influencing variable can be studied.
- Independent Variable - Two
- Compares - Effect of multiple level of two factors.
- Number of Observation - Need to be equal in each group.
- Design of experiments - All three principles needs to be satisfied.
Anova Ancova Sarima Arima ARMA Spatial Temporal
ANOVA Requirements
- Normal Distribution
- Independent Samples/Groups
- Independent Samples t Test requires the assumption of homogeneity of variance. a test for the homogeneity of variance, called Levene's Test, whenever you run an independent samples T test
Test Erors
Most researches test a null hyppothesis with alpha at .05
Type 1 Error - Erroneously rejecting the null hypothesis with a statistical analysis, when the null hypothesis is in fact true in the population.
Single Analysis Test the null H of equal mean IQs between adult males and adult females. This is done by Testing the null H with an independent samples t-test If the t-test p-value is less than .05, reject the null hypothesis of equal means.
The Bonferroni Correction is an adjustent applied to p-values that is 'supposed to be' applied, when two or more statistical analyses have been performed on the same sample of data.
specifies the chances of erroneously rejecting the null hypothesis at least once amongst the family of analyses is equal to X
ALPHA(fam-wise-error-rate) = 1 - (1-.05)^#tests
Approach 1. Divide the per analsysis alpha rate by the number of statistical analyses performed (.05/3 = .017) ... => any observed p-value less than the corrected p-value (.017) is declared to be statistically significant.
This can sometimes oblitterate all statistically significant results.
Common Error Measures:
- (root) Mean Squared Error - continuous data, sensitivity to outliers
- median absolute deviation - continuous data, often more robust
- sensitivity (Recall) - if you want few missed positives
- specificity - if you want few negatives called positives
- accuracy - weights false positives / negatives equally
- Concordance - one example is kappa
Challenges With Data
- Some of this was covered in the prior sections and will not be repeated here.
- Options tree with Pessimistic, Nominal & Optimistic ...
- Performance vs Risk vs Design analysis
Key issues: accuracy, overfitting, interpretability, computational speed.
Pay attention to - confounding variables, complicated interactions, skewness, outliers, nonlinear patterns, variance changes, units/scale issues, overloading, regression, correlation and causation
measures of effectiveness of the classifier:
- Predictive accuracy: How well does it predict the categories for new observations?
- Speed: What is the computational cost of using the classifier?
- Robustness: How well do the models created perform if data quality is low?
- Scalability: Does the classifier function efficiently with large amounts of data?
- Interpretability: Are the results understandable to users?
Confounder: a variable that is correlated with both the outcome and covariates
- confounders can change the regression line.
- detected with exploration
Descriptive Statistics : Location (mean,median,mode) Spread(standard deviation, variance, range, iqr, absolute deviation, mean absolute difference, distance standard deviation... Coefficient of variation and Gini Coefficients. shape( skewness ie kurtosis, distance skewness). Dependence ( Pearson, Spearman)
Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm
workhorse estimation technique of frequentist statistics: maximum likelihood estimation.
Changing variance - what can you do
- box cox transform
- variance stabilizing transform
- weighted least squares
- huber white standard errors
MISC
- P Values require knowing how many records exist are in the database.
- Outliers profoundly influence on the slope of the regression line and the correlation coefficient.
- Correlation coefficient alone is not enough for decision making (i.e., scatterplots are always recommended)
- Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm
Bias:
- Quantitative Approach to Outliers.
- Correlations in Non-homogeneous Groups
- Nonlinear Relations between Variables. - Pearson R measures linearity
- Exploratory Examination of Correlation Matrices
- Casewise vs. Pairwise Deletion of Missing Data
- Overfitting: Pruning/ Cross Validation
- Breakdown Analysis
- Frequency Tables
- Cross Tabulation
- Marginal Frequencies
- Association Rules