Home

Datascience

Clustering Associati... Distributions And Re... Geospatial Timeserie... Linear Regressions Ml Process Python Eda Spatial Stationarity Viz Notes Examples

Don't Look! I'm changing!

URL Copied

Data Assumptions Parametric Vs NonParametricNonparametricCommon TestsTest Erors

Nonparametric-Statistics

Correlation Research

Test for Normality

Differences between independent groups. -Normal Distributions -> T test for Independent Samples

T Test Alternative for Independent Nonparametrics samples

Partition of variance

Differences between dependent groups.

Normal Distributions

Nonparametric alternatives to this test are

If there are more than two variables that were measured in the same sample

Nonparametric alternatives to this method are

_ Relationships between variables.

To express a relationship between two variables one usually computes the correlation coefficient.

Nonparametric equivalents to the standard correlation coefficient are_

If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are

In addition, a simultaneous test for relationships between multiple cases is available:

This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli_

Non-parametric statistics

In several mostly elementary situations when the assumptions of parametric tests cannot be met, one may resort to non-parametric tests rather than parametric tests such as the t-test, the Pearson correlation test, analysis of variance, etc. In such situations the power of the non-parametric or distribution-free tests is often as good as the parametric ones or better. It often is a good idea to use both types of tests if they are available and compare the resulting p-values. If these values are roughly the same there is little to worry about, if they are different there is something to be sorted out.

_ Nonparametric Correlations _

The following are three types of commonly used nonparametric correlation coefficients (Spearman R,Kendall Tau, andGamma coefficients). Violating Normality Assumptions has less grave consequences than previously thought although this can only be proven on a case by case basis.

Correlation

* stats.zscore(df)

Two types Major types of Statistics: Inferential and Descriptive. Inferential statistics use more complex calculations to infer trends and make predictions/assumptions. A descriptive analysis should be performed first before moving into inferencing.

First and foremost show of the descriptives Confidence intervals on scatter plots/ pearson coefficients/ linear regression

Group by group -total X's in Y

The Chi-Square test helps you determine if two discrete variables are associated

Too much overlapping in the x-axis labels renders the whole plot useless.

https://blog.socialcops.com/academy/resources/cross-tabulation-how-why/

"Also known as contingency tables or cross tabs, cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.""Cross tabulation is usually performed on categorical data — data that can be divided into mutually exclusive groups."

http://blog.minitab.com/blog/understanding-statistics/using-cross-tabulation-and-chi-square-the-survey-says

"The Chi-Square test helps you determine if two discrete variables are associated. If there's an association, the distribution of one variable will differ depending on the value of the second variable. But if the two variables are independent, the distribution of the first variable will be similar for all values of the second variable."

Cross tabulation:

  1. preprocess?

  2. figure out which 2 features you want to crosstab

  3. Create a column for every category of feature 1

  4. Determin how you will aggregate (sum,count,avg,max,min,product)

  5. create a record for every unique feature 1

5.1) create a column for the unique labels name

  1. aggregate feature 1 columns

Cross tabulation :

  1. preprocess?
  2. figure out which 2 features you want to crosstab
  3. Create a column for every category of feature 1
  4. Determine how you will aggregate (sum,count,avg,max,min,product)
  5. create a record for every unique feature 1 5.1) create a column for the unique labels name
  6. aggregate feature 1 columns

Comparison of means:_ t _-test

The t-test is used in many ways in statistics. The more common uses are (1) comparing one mean with a known mean, (2) testing whether two means are distinct, (3) testing whether the means from matched pairs are equal. Also called Student's t test (equal variances) or Welch's t test (unequal variances). Applicable for means from one sample, two independent samples, and paired samples. See analysis of variance for comparing more than two means. The t-test is also used in other contexts.

Correlation

The (Pearson) correlation coefficient is a measure of the strength of the linear relationship between two interval or numeric variables. Other correlation coefficients exist to measure the relationship between ordinal two variables, such the Spearman's rank correlation coefficient. The highest value of the correlation coefficient is 1 or -1 (perfect relationship), the lowest value 0 (no relationship). The t-test is used to test whether a sample Pearson correlation differs from 0.

Chi-square tests

The (Pearson) chi-square coefficient is primarily used with one or two categorical variables. The coefficient is a measure of difference between observed and expected scores.

One categorical variable: In case of one categorical variable the test measures whether the observed values can reasonably come from a known distribution (or model). In other words the observed values are compared to the expected values from this known distribution. In such cases the test is primarily used for model testing.

Two categorical variables: In case of two categorical variables the expected values usually are the values under the nulhypothesis that there is no relationship between the two variables. Therefore the chi-square coefficient of two variables is a measure of relationship.

The chi-square coefficient is tested by comparing it with the chi-square distribution given the degrees of freedom. Other coefficients to measure the relationship between two variables in two-way contingency tables exist as well (for a list, see for instance the SPSS output with Crosstabs).

Note that if possible exact p-values are preferred over the standard (asymptotic) ones.

Non-parametric statistics

In several mostly elementary situations when the assumptions of parametric tests cannot be met, one may resort to non-parametric tests rather than parametric tests such as the t-test, the Pearson correlation test, analysis of variance, etc. In such situations the power of the non-parametric or distribution-free tests is often as good as the parametric ones or better. It often is a good idea to use both types of tests if they are available and compare the resulting p-values. If these values are roughly the same there is little to worry about, if they are different there is something to be sorted out.

Unfortunately, appropriate non-parametric techniques are not available for all comparable parametric techniques (see, however, the method selection charts for comparable tests).

Amazing Graphics https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/

The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.

The correlation coefficient is a function of the covariance. The correlation coefficient is equal to the covariance divided by the product of the standard deviations of the variables. `Therefore, a positive covariance always results in a positive correlation and a negative covariance always results in a negative correlation.

https://lindeloev.github.io/tests-as-linear/

Parametric vs NonParametric Tests

Parametric tests (means)Nonparametric tests (medians)
1-sample t test1-sample Sign, 1-sample Wilcoxon
2-sample t testMann-Whitney test
One-Way ANOVAKruskal-Wallis, Mood's median test
Factorial DOE with one factor and one blocking variableFriedman test

t-test vs f-test

BASIS FOR COMPARISONT-TESTF-TEST
MeaningT-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small.F-test is statistical test, that determines the equality of the variances of the two normal populations.
Test statisticT-statistic follows Student t-distribution, under null hypothesis.F-statistic follows Snedecor f-distribution, under null hypothesis.
ApplicationComparing the means of two populations.Comparing two population variances.

anova and ancova

BASIS FOR COMPARISONANOVAANCOVA
MeaningANOVA is a process of examining the difference among the means of multiple groups of data for homogeneity.ANCOVA is a technique that remove the impact of one or more metric-scaled undesirable variable from dependent variable before undertaking research.
UsesBoth linear and non-linear model are used.Only linear model is used.
IncludesCategorical variable.Categorical and interval variable.
CovariateIgnoredConsidered
BG variationAttributes Between Group (BG) variation, to treatment.Divides Between Group (BG) variation, into treatment and covariate.
WG variationAttributes Within Group (WG) variation, to individual differences.Divides Within Group (WG) variation, into individual differences and covariate.

T-Test V Anova

BASIS FOR COMPARISONT-TESTANOVA
MeaningT-test is a hypothesis test that is used to compare the means of two populations.ANOVA is a statistical technique that is used to compare the means of more than two populations.
Test statistic(x ̄-µ)/(s/√n)Between Sample Variance/Within Sample Variance

ONE WAY ANOVA** VS **TWO WAY ANOVA

BASIS FOR COMPARISONONE WAY ANOVATWO WAY ANOVA
MeaningOne way ANOVA is a hypothesis test, used to test the equality of three of more population means simultaneously using variance.Two way ANOVA is a statistical technique wherein, the interaction between factors, influencing variable can be studied.
Independent VariableOneTwo
ComparesThree or more levels of one factor.Effect of multiple level of two factors.
Number of ObservationNeed not to be same in each group.Need to be equal in each group.
Design of experimentsNeed to satisfy only two principles.All three principles needs to be satisfied.

Compare Distribution Tables

Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities.

Standard Normal (Z) Table

The Standard Normal distribution is used in various hypothesis tests including tests on single means, the difference between two means, and tests on proportions.

Student's t Table

The Shape of the Student's t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increases. For more information on how this distribution is used in hypothesis testing, see t-test for independent samples and t-test for dependent samples in Basic Statistics and Tables

Chi-Square Table

Like the Student's t-Distribution, the Chi-square distribution's shape is determined by its degrees of freedom. The animation above shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50). For examples of tests of hypothesis that use the Chi-square distribution, see Statistics in crosstabulation tables in Basic Statistics and Tables as well as Nonlinear Estimation . See also, Chi-square Distribution.

F Distribution Tables

The F distribution is a right-skewed distribution used most commonly in Analysis of Variance (see ANOVA/MANOVA). The F distribution is a ratio of two Chi-square distributions, and a specific F distribution is denoted by the degrees of freedom for the numerator Chi-square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10)distribution is shown in the animation above

Gamma exponential bi bernoulli poisson

3 Pearson and Spearman correlation

the Spearman rank correlation is a Pearson correlation on rank-transformed x and y rank(y)=β0+β1⋅rank(x) H0:β1=0 value = 'rho'

Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest.

While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49). A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size. https://www.medcalc.org/manual/correlation.php The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this probability is lower than the conventional 5% (P<0.05) the correlation coefficient is called statistically significant. 95% confidence interval (CI) for the correlation coefficient: this is the range of values that contains with a 95% confidence the 'true' correlation coefficient.

4 One mean

4.1 One sample t-test and Wilcoxon signed-rank t-test model: A single number predicts y. Y = mx+b where x = 0. Null Hypothesis b=0 Wilcoxon signed-rank test , just with the signed ranks of yy instead of y itself signed_rank(y)=β0

4.2 Paired samples t-test and Wilcoxon matched pairs t-test model: a single number (intercept) predicts the pairwise differences. Y2-y1=mx+b where x = 0. Null Hypothesis: b=0

This means that there is just one y** = y 2 y **1 to predict and it becomes a one-sample t-test on the pairwise differences. The visualization is therefore also the same as for the one-sample t-test. At the risk of overcomplicating a simple substraction, you can think of these pairwise differences as slopes (see left panel of the figure), which we can represent as y-offsets (see right panel of the figure): Similarly, the Wilcoxon matched pairs only differ from Wilcoxon signed-rank in that it's testing the signed ranks of the pairwise y−x differences.

5 Two means

5.1 Independent t-test and Mann-Whitney U Independent t-test model: two means predict yy.

yi=β0+β1xiβ1=0 where xixi is an indicator (0 or 1) saying whether data point ii was sampled from one or the other group. Indicator variables (also called "dummy coding") underly a lot of linear models and we'll take an aside to see how it works in a minute. Mann-Whitney U (also known as Wilcoxon signed-rank test for two independent groups) is the same model to a very close approximation, just on the ranks of xx and yy instead of the actual values: rank(yi)=β0+β1xi

5.2 Welch's t-test

This is identical to the (Student's) independent t-test above except that Student's assumes identical variances and Welch's t-test does not.

6 Three or more means

ANOVAs are linear models with (only) categorical predictors. They simply extend everything we did above, relying heavily on dummy coding.

Multivariate

Relationship Between Variables

Target based Analysis

Multivariate

Major columns

Anova ancova

Support Confidence Lift

The bivariate distribution plots help us to study the relationship between two variables by analyzing the scatter plot the bivariate distributions:

If the denominators used to calculate the two percentages represent the same people, we use a one-sample t-test between percents to compare the two percents. If the denominators represent different people, we use the two-sample t-test between percents.

Median difference from Quartiles represent skew. Whiskers represent variance.

Cross tabulation:

Data Assumptions Parametric Vs Non

http://www.statsoft.com/Textbook/Nonparametric-Statistics

Relations between Variables - Magnitude, Reliability, PValue,

Small relations can only be proven in large samples

Parametric

Nonparametric

Common Tests

Basic Statistics: Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation

measures of location (mean, median, mode, etc.) and dispersion (variance, average deviation, quartile range, etc.)

Parametric(Means) Vs NonParametric (Medians)

T-Test

F-Test

Anova

Ancova

ONE WAY ANOVA

TWO WAY ANOVA

Anova Ancova Sarima Arima ARMA Spatial Temporal

ANOVA Requirements

Test Erors

Most researches test a null hyppothesis with alpha at .05

Type 1 Error - Erroneously rejecting the null hypothesis with a statistical analysis, when the null hypothesis is in fact true in the population.

Single Analysis Test the null H of equal mean IQs between adult males and adult females. This is done by Testing the null H with an independent samples t-test If the t-test p-value is less than .05, reject the null hypothesis of equal means.

The Bonferroni Correction is an adjustent applied to p-values that is 'supposed to be' applied, when two or more statistical analyses have been performed on the same sample of data.

specifies the chances of erroneously rejecting the null hypothesis at least once amongst the family of analyses is equal to X

ALPHA(fam-wise-error-rate) = 1 - (1-.05)^#tests

Approach 1. Divide the per analsysis alpha rate by the number of statistical analyses performed (.05/3 = .017) ... => any observed p-value less than the corrected p-value (.017) is declared to be statistically significant.

This can sometimes oblitterate all statistically significant results.

Common Error Measures:

Challenges With Data

Key issues: accuracy, overfitting, interpretability, computational speed.

Pay attention to - confounding variables, complicated interactions, skewness, outliers, nonlinear patterns, variance changes, units/scale issues, overloading, regression, correlation and causation

measures of effectiveness of the classifier:

Confounder: a variable that is correlated with both the outcome and covariates

Descriptive Statistics : Location (mean,median,mode) Spread(standard deviation, variance, range, iqr, absolute deviation, mean absolute difference, distance standard deviation... Coefficient of variation and Gini Coefficients. shape( skewness ie kurtosis, distance skewness). Dependence ( Pearson, Spearman)

Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

workhorse estimation technique of frequentist statistics: maximum likelihood estimation.

Changing variance - what can you do

MISC

Bias: