https://www.itl.nist.gov/div898/handbook/

Correlation Research

Relations between Variables
Magnitude
Reliability
PValue,
Small relations can only be proven in large samples

Test for Normality

Anova/Manova,
Nonparametrics
Test Statistics - Normal(T, F, Chi-Square)

Differences between independent groups. -Normal Distributions -> T test for Independent Samples

T Test Alternative for Independent Nonparametrics samples

the Mann-Whitney U test
the Kolmogorov-Smirnov two-sample test and Wald-Wolfowitz

Partition of variance

Analysis of variance (ANOVA)
Analysis of covariance
Multivariate ANOVA

Differences between dependent groups.

Normal Distributions

T-Test for Dependent Samples

Nonparametric alternatives to this test are

the Sign test and
Wilcoxon's matched pairs test.
If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate.

If there are more than two variables that were measured in the same sample

then we would customarily use repeated measures ANOVA.

Nonparametric alternatives to this method are

Friedman's two-way analysis of variance
and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs. "failed").
Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time

_ Relationships between variables.

To express a relationship between two variables one usually computes the correlation coefficient.

Nonparametric equivalents to the standard correlation coefficient are_

Spearman R,
Kendall Tau, and
coefficient Gamma (seeNonparametric correlations_).

If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are

the Chi-square test,
the Phi coefficient,
and the Fisher exact test.

In addition, a simultaneous test for relationships between multiple cases is available:

Kendall coefficient of concordance.

This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli_

Non-parametric statistics

In several mostly elementary situations when the assumptions of parametric tests cannot be met, one may resort to non-parametric tests rather than parametric tests such as the t-test, the Pearson correlation test, analysis of variance, etc. In such situations the power of the non-parametric or distribution-free tests is often as good as the parametric ones or better. It often is a good idea to use both types of tests if they are available and compare the resulting p-values. If these values are roughly the same there is little to worry about, if they are different there is something to be sorted out.

_ Nonparametric Correlations _

The following are three types of commonly used nonparametric correlation coefficients (Spearman R,Kendall Tau, andGamma coefficients). Violating Normality Assumptions has less grave consequences than previously thought although this can only be proven on a case by case basis.

** Normal vs Nonparametric Methods
Distribution Tables (Z T Chi-Square F-Tables)
Charting Difference Between (F/T Tests, Anova/Ancova, T-Test/ Anova, 1 vs 2-way Anova )

Correlation

Pearson product-moment
Partial correlation -Confounding variable
Coefficient of determination

* stats.zscore(df)

Two types Major types of Statistics: Inferential and Descriptive. Inferential statistics use more complex calculations to infer trends and make predictions/assumptions. A descriptive analysis should be performed first before moving into inferencing.

First and foremost show of the descriptives Confidence intervals on scatter plots/ pearson coefficients/ linear regression

Group by group -total X's in Y

The Chi-Square test helps you determine if two discrete variables are associated

Too much overlapping in the x-axis labels renders the whole plot useless.

https://blog.socialcops.com/academy/resources/cross-tabulation-how-why/

"Also known as contingency tables or cross tabs, cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.""Cross tabulation is usually performed on categorical data — data that can be divided into mutually exclusive groups."

http://blog.minitab.com/blog/understanding-statistics/using-cross-tabulation-and-chi-square-the-survey-says

"The Chi-Square test helps you determine if two discrete variables are associated. If there's an association, the distribution of one variable will differ depending on the value of the second variable. But if the two variables are independent, the distribution of the first variable will be similar for all values of the second variable."

Cross tabulation:

preprocess?
figure out which 2 features you want to crosstab
Create a column for every category of feature 1
Determin how you will aggregate (sum,count,avg,max,min,product)
create a record for every unique feature 1

5.1) create a column for the unique labels name

aggregate feature 1 columns

Cross tabulation :

preprocess?
figure out which 2 features you want to crosstab
Create a column for every category of feature 1
Determine how you will aggregate (sum,count,avg,max,min,product)
create a record for every unique feature 1 5.1) create a column for the unique labels name
aggregate feature 1 columns

Comparison of means:_ t _-test

The t-test is used in many ways in statistics. The more common uses are (1) comparing one mean with a known mean, (2) testing whether two means are distinct, (3) testing whether the means from matched pairs are equal. Also called Student's t test (equal variances) or Welch's t test (unequal variances). Applicable for means from one sample, two independent samples, and paired samples. See analysis of variance for comparing more than two means. The t-test is also used in other contexts.

Correlation

The (Pearson) correlation coefficient is a measure of the strength of the linear relationship between two interval or numeric variables. Other correlation coefficients exist to measure the relationship between ordinal two variables, such the Spearman's rank correlation coefficient. The highest value of the correlation coefficient is 1 or -1 (perfect relationship), the lowest value 0 (no relationship). The t-test is used to test whether a sample Pearson correlation differs from 0.

Chi-square tests

The (Pearson) chi-square coefficient is primarily used with one or two categorical variables. The coefficient is a measure of difference between observed and expected scores.

One categorical variable: In case of one categorical variable the test measures whether the observed values can reasonably come from a known distribution (or model). In other words the observed values are compared to the expected values from this known distribution. In such cases the test is primarily used for model testing.

Two categorical variables: In case of two categorical variables the expected values usually are the values under the nulhypothesis that there is no relationship between the two variables. Therefore the chi-square coefficient of two variables is a measure of relationship.

The chi-square coefficient is tested by comparing it with the chi-square distribution given the degrees of freedom. Other coefficients to measure the relationship between two variables in two-way contingency tables exist as well (for a list, see for instance the SPSS output with Crosstabs).

Note that if possible exact p-values are preferred over the standard (asymptotic) ones.

Non-parametric statistics

Unfortunately, appropriate non-parametric techniques are not available for all comparable parametric techniques (see, however, the method selection charts for comparable tests).

Amazing Graphics https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/

The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.

The correlation coefficient is a function of the covariance. The correlation coefficient is equal to the covariance divided by the product of the standard deviations of the variables. `Therefore, a positive covariance always results in a positive correlation and a negative covariance always results in a negative correlation.

https://lindeloev.github.io/tests-as-linear/

Parametric vs NonParametric Tests

Parametric tests (means)	Nonparametric tests (medians)
1-sample t test	1-sample Sign, 1-sample Wilcoxon
2-sample t test	Mann-Whitney test
One-Way ANOVA	Kruskal-Wallis, Mood's median test
Factorial DOE with one factor and one blocking variable	Friedman test

t-test vs f-test

BASIS FOR COMPARISON	T-TEST	F-TEST
Meaning	T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small.	F-test is statistical test, that determines the equality of the variances of the two normal populations.
Test statistic	T-statistic follows Student t-distribution, under null hypothesis.	F-statistic follows Snedecor f-distribution, under null hypothesis.
Application	Comparing the means of two populations.	Comparing two population variances.

anova and ancova

BASIS FOR COMPARISON	ANOVA	ANCOVA
Meaning	ANOVA is a process of examining the difference among the means of multiple groups of data for homogeneity.	ANCOVA is a technique that remove the impact of one or more metric-scaled undesirable variable from dependent variable before undertaking research.
Uses	Both linear and non-linear model are used.	Only linear model is used.
Includes	Categorical variable.	Categorical and interval variable.
Covariate	Ignored	Considered
BG variation	Attributes Between Group (BG) variation, to treatment.	Divides Between Group (BG) variation, into treatment and covariate.
WG variation	Attributes Within Group (WG) variation, to individual differences.	Divides Within Group (WG) variation, into individual differences and covariate.

T-Test V Anova

BASIS FOR COMPARISON	T-TEST	ANOVA
Meaning	T-test is a hypothesis test that is used to compare the means of two populations.	ANOVA is a statistical technique that is used to compare the means of more than two populations.
Test statistic	(x ̄-µ)/(s/√n)	Between Sample Variance/Within Sample Variance

ONE WAY ANOVA** VS **TWO WAY ANOVA

BASIS FOR COMPARISON	ONE WAY ANOVA	TWO WAY ANOVA
Meaning	One way ANOVA is a hypothesis test, used to test the equality of three of more population means simultaneously using variance.	Two way ANOVA is a statistical technique wherein, the interaction between factors, influencing variable can be studied.
Independent Variable	One	Two
Compares	Three or more levels of one factor.	Effect of multiple level of two factors.
Number of Observation	Need not to be same in each group.	Need to be equal in each group.
Design of experiments	Need to satisfy only two principles.	All three principles needs to be satisfied.

Compare Distribution Tables

Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities.

Standard Normal (Z) Table

The Standard Normal distribution is used in various hypothesis tests including tests on single means, the difference between two means, and tests on proportions.

Student's t Table

The Shape of the Student's t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increases. For more information on how this distribution is used in hypothesis testing, see t-test for independent samples and t-test for dependent samples in Basic Statistics and Tables

Chi-Square Table

Like the Student's t-Distribution, the Chi-square distribution's shape is determined by its degrees of freedom. The animation above shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50). For examples of tests of hypothesis that use the Chi-square distribution, see Statistics in crosstabulation tables in Basic Statistics and Tables as well as Nonlinear Estimation . See also, Chi-square Distribution.

F Distribution Tables

The F distribution is a right-skewed distribution used most commonly in Analysis of Variance (see ANOVA/MANOVA). The F distribution is a ratio of two Chi-square distributions, and a specific F distribution is denoted by the degrees of freedom for the numerator Chi-square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10)distribution is shown in the animation above

Gamma exponential bi bernoulli poisson

3 Pearson and Spearman correlation

the Spearman rank correlation is a Pearson correlation on rank-transformed x and y rank(y)=β0+β1⋅rank(x) H0:β1=0 value = 'rho'

Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest.

While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49). A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size. https://www.medcalc.org/manual/correlation.php The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this probability is lower than the conventional 5% (P<0.05) the correlation coefficient is called statistically significant. 95% confidence interval (CI) for the correlation coefficient: this is the range of values that contains with a 95% confidence the 'true' correlation coefficient.

4 One mean

4.1 One sample t-test and Wilcoxon signed-rank t-test model: A single number predicts y. Y = mx+b where x = 0. Null Hypothesis b=0 Wilcoxon signed-rank test , just with the signed ranks of yy instead of y itself signed_rank(y)=β0

4.2 Paired samples t-test and Wilcoxon matched pairs t-test model: a single number (intercept) predicts the pairwise differences. Y2-y1=mx+b where x = 0. Null Hypothesis: b=0

This means that there is just one y** = y 2 − y **1 to predict and it becomes a one-sample t-test on the pairwise differences. The visualization is therefore also the same as for the one-sample t-test. At the risk of overcomplicating a simple substraction, you can think of these pairwise differences as slopes (see left panel of the figure), which we can represent as y-offsets (see right panel of the figure): Similarly, the Wilcoxon matched pairs only differ from Wilcoxon signed-rank in that it's testing the signed ranks of the pairwise y−x differences.

5 Two means

5.1 Independent t-test and Mann-Whitney U Independent t-test model: two means predict yy.

yi=β0+β1xiβ1=0 where xixi is an indicator (0 or 1) saying whether data point ii was sampled from one or the other group. Indicator variables (also called "dummy coding") underly a lot of linear models and we'll take an aside to see how it works in a minute. Mann-Whitney U (also known as Wilcoxon signed-rank test for two independent groups) is the same model to a very close approximation, just on the ranks of xx and yy instead of the actual values: rank(yi)=β0+β1xi

5.2 Welch's t-test

This is identical to the (Student's) independent t-test above except that Student's assumes identical variances and Welch's t-test does not.

6 Three or more means

ANOVAs are linear models with (only) categorical predictors. They simply extend everything we did above, relying heavily on dummy coding.

Multivariate

Relationship Between Variables

Correlation Coefficient
Correlation Coefficient by Variable Combination
Correlation Plot of Numerical Variables

Target based Analysis

Grouped Descriptive Statistics
Grouped Numerical Variables
Grouped Categorical Variables
Grouped Relationship Between Variables
Grouped Correlation Coefficient
Grouped Correlation Plot of Numerical Variables
Anova
ancova
Support
Confidence
Lift

Multivariate

Correlation Coefficient
Correlation Coefficient by Variable Combination
Correlation Plot of Numerical Variables
Target based Analysis
Gruoped Descriptive Statistics
Gruoped Numerical Variables
Gruoped Categorical Variables
Gruoped Relationship Between Variables
Grouped Correlation Coefficient
Grouped Correlation Plot of Numerical Variables

Major columns

Anova ancova

Support Confidence Lift

The bivariate distribution plots help us to study the relationship between two variables by analyzing the scatter plot the bivariate distributions:

If the denominators used to calculate the two percentages represent the same people, we use a one-sample t-test between percents to compare the two percents. If the denominators represent different people, we use the two-sample t-test between percents.

Median difference from Quartiles represent skew. Whiskers represent variance.

Cross tabulation:

1. preprocess?
1. figure out which 2 features you want to crosstab
1. Create a column for every category of feature 1
1. Determine how you will aggregate (sum,count,avg,max,min,product)
1. create a record for every unique feature 1
5.1) create a column for the unique labels name
1. aggregate feature 1 columns

Data Assumptions Parametric Vs Non

http://www.statsoft.com/Textbook/Nonparametric-Statistics

Relations between Variables - Magnitude, Reliability, PValue,

Small relations can only be proven in large samples

Parametric

Differences between independent groups -> T-Test for Independent Samples
Differences between dependent groups -> T-Test for Dependent Samples, If there are more than two variables that were measured in the same sample, then we would customarily use repeated measures ANOVA.
Relationships between variables -> correlation coefficient

Nonparametric

Differences between independent groups -> Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test and Wald-Wolfowitz
Differences between dependent groups -> Sign test and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in nature, "pass" vs. "no pass" then McNemar's Chi-square test. If there are more than two variables that were measured in the same sample, then we would use Friedman's two-way analysis of variance and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs. "failed"). Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time.
Relationships between variables -> Spearman R, Kendall Tau, and coefficient Gamma. If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance. This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli
http://www.statsoft.com/Textbook/Nonparametric-Statistics#correlations

Common Tests

Basic Statistics: Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation

measures of location (mean, median, mode, etc.) and dispersion (variance, average deviation, quartile range, etc.)

Parametric(Means) Vs NonParametric (Medians)

1-sample t test VS 1-sample Sign, 1-sample Wilcoxon
2-sample t test VS Mann-Whitney test
One-Way ANOVA VS Kruskal-Wallis, Mood’s median test

T-Test

Meaning: T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small. T-test is a hypothesis test that is used to compare the means of two populations.
Test Statistic: T-statistic follows Student t-distribution, under null hypothesis.
Application: Comparing the means of two populations.
Test statistic - (x ̄-µ)/(s/√n)

F-Test

Meaning: F-test is statistical test, that determines the equality of the variances of the two normal populations.
Test Statistic: F-statistic follows Snedecor f-distribution, under null hypothesis.
Application: Comparing two population variances.

Anova

Meaning - ANOVA is a process of examining the difference among the means of multiple groups of data for homogeneity.
Uses - Both linear and non-linear model are used. ANOVA is a statistical technique that is used to compare the means of more than two populations.
Includes - Categorical variable.
Covariate - Ignored
BG variation - Attributes Between Group (BG) variation, to treatment.
WG variation - Attributes Within Group (WG) variation, to individual differences.
Test statistic - Between Sample Variance/Within Sample Variance

Ancova

Meaning - ANCOVA is a technique that remove the impact of one or more metric-scaled undesirable variable from dependent variable before undertaking research.
Uses - Only linear model is used.
Includes - Categorical and interval variable.
Covariate - Considered
BG variation - Divides Between Group (BG) variation, into treatment and covariate.
WG variation - Divides Within Group (WG) variation, into individual differences and covariate.

ONE WAY ANOVA

Meaning - One way ANOVA is a hypothesis test, used to test the equality of three of more population means simultaneously using variance.
Independent Variable - One
Compares - Three or more levels of one factor.
Number of Observation - Need not to be same in each group.
Design of experiments - Need to satisfy only two principles.

TWO WAY ANOVA

Meaning - Two way ANOVA is a statistical technique wherein, the interaction between factors, influencing variable can be studied.
Independent Variable - Two
Compares - Effect of multiple level of two factors.
Number of Observation - Need to be equal in each group.
Design of experiments - All three principles needs to be satisfied.

Anova Ancova Sarima Arima ARMA Spatial Temporal

ANOVA Requirements

Normal Distribution
Independent Samples/Groups
Independent Samples t Test requires the assumption of homogeneity of variance. a test for the homogeneity of variance, called Levene's Test, whenever you run an independent samples T test

Test Erors

Most researches test a null hyppothesis with alpha at .05

Type 1 Error - Erroneously rejecting the null hypothesis with a statistical analysis, when the null hypothesis is in fact true in the population.

Single Analysis Test the null H of equal mean IQs between adult males and adult females. This is done by Testing the null H with an independent samples t-test If the t-test p-value is less than .05, reject the null hypothesis of equal means.

The Bonferroni Correction is an adjustent applied to p-values that is 'supposed to be' applied, when two or more statistical analyses have been performed on the same sample of data.

specifies the chances of erroneously rejecting the null hypothesis at least once amongst the family of analyses is equal to X

ALPHA(fam-wise-error-rate) = 1 - (1-.05)^#tests

Approach 1. Divide the per analsysis alpha rate by the number of statistical analyses performed (.05/3 = .017) ... => any observed p-value less than the corrected p-value (.017) is declared to be statistically significant.

This can sometimes oblitterate all statistically significant results.

Common Error Measures:

(root) Mean Squared Error - continuous data, sensitivity to outliers
median absolute deviation - continuous data, often more robust
sensitivity (Recall) - if you want few missed positives
specificity - if you want few negatives called positives
accuracy - weights false positives / negatives equally
Concordance - one example is kappa

Challenges With Data

Some of this was covered in the prior sections and will not be repeated here.

Options tree with Pessimistic, Nominal & Optimistic ...
Performance vs Risk vs Design analysis

Key issues: accuracy, overfitting, interpretability, computational speed.

Pay attention to - confounding variables, complicated interactions, skewness, outliers, nonlinear patterns, variance changes, units/scale issues, overloading, regression, correlation and causation

measures of effectiveness of the classifier:

Predictive accuracy: How well does it predict the categories for new observations?
Speed: What is the computational cost of using the classifier?
Robustness: How well do the models created perform if data quality is low?
Scalability: Does the classifier function efficiently with large amounts of data?
Interpretability: Are the results understandable to users?

Confounder: a variable that is correlated with both the outcome and covariates

confounders can change the regression line.
detected with exploration

Descriptive Statistics : Location (mean,median,mode) Spread(standard deviation, variance, range, iqr, absolute deviation, mean absolute difference, distance standard deviation... Coefficient of variation and Gini Coefficients. shape( skewness ie kurtosis, distance skewness). Dependence ( Pearson, Spearman)

Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

workhorse estimation technique of frequentist statistics: maximum likelihood estimation.

Changing variance - what can you do

box cox transform
variance stabilizing transform
weighted least squares
huber white standard errors

MISC

P Values require knowing how many records exist are in the database.
Outliers profoundly influence on the slope of the regression line and the correlation coefficient.
Correlation coefficient alone is not enough for decision making (i.e., scatterplots are always recommended)
Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

Bias:

Quantitative Approach to Outliers.
Correlations in Non-homogeneous Groups
Nonlinear Relations between Variables. - Pearson R measures linearity
Exploratory Examination of Correlation Matrices
Casewise vs. Pairwise Deletion of Missing Data
Overfitting: Pruning/ Cross Validation
Breakdown Analysis
Frequency Tables
Cross Tabulation
Marginal Frequencies
Association Rules

Datascience

Charles Karpati | Distribution and Regression Tests