Home

Datascience

Clustering Associati... Distributions And Re... Geospatial Timeserie... Linear Regressions Ml Process Python Eda Spatial Stationarity Viz Notes Examples

Don't Look! I'm changing!

URL Copied

Clustering

http://users.stat.umn.edu/~helwig/notes/npind-Notes.pdf

Kendall’s Rank Correlation:

Spearman’s Rank Correlation:

Suppose we have n bivariate observations

(X1, Y1), . . . ,(Xn, Yn) where (Xi , Yi) is i-th subject’s data

We want to make inferences about association between X and Y

Let FX,Y denote joint distribution of X and Y

Let FX and FY denote marginal distributions of X and Y

Null hypothesis is statistical independence:

FX,Y (x, y) = FX (x)FY (y) for all (x, y)

Independence assumption

n i=1 are iid from some bivariate population Continuity assumption
FX,Y is a continuous distribution

some use cases for graph algorithms:

There are 3 main categories of graph algorithms that are currently supported in most frameworks (networkx in Python, or in Neo4J for example) :

Clustering

http://users.stat.umn.edu/~helwig/notes/cluster-Notes.pdf

Let x = (x1, . . . , xp) 0 and y = (y1, . . . , yp) 0 denote two arbitrary vectors. Problem: We want some rule that measures the “closeness” or “similarity” between x and y.

How we define closeness (or similarity) will determine how we group the objects into clusters.

Rule 1: Pearson correlation between x and y

Rule 2: e Minkowski Metric (Manhattan, Euclidean, Chebyshev) distance between x and y

Rule 3: Number of matches, i.e., Pp j=1 1{xj=yj}

Hierarchical clustering uses a series of successive mergers or divisions to group N objects based on some distance.

Agglomerative Hierarchical Clustering (bottom up) 1 Begin with N clusters (each object is own cluster) 2 Merge the most similar objects 3 Repeat 2 until all objects are in the same cluster

Divisive Hierarchical Clustering (top down) 1 Begin with 1 cluster (all objects together) 2 Split the most dissimilar objects 3 Repeat 2 until all objects are in their own cluster

We know how to define dissimilarity between objects (i.e., duv ), but how do we define dissimilarity between clusters of objects?

To quantify the distance between two clusters, we could use:

Non-hierarchical clustering partitions a set of N objects into K distinct groups based on some distance (or dissimilarity). The number of clusters K can be known a priori or can be estimated as a part of the procedure.

Why are Permutations Useful for Statistics? Classic statistical paradigm is: collect some data form null hypothesis H0 design test statistic derive sampling distribution of test statistic under H0 In many cases, the null hypothesis is the nil hypothesis, i.e., no effect. Under the nil hypothesis, all possible outcomes (permutations) are equally likely, so permutations relate to sampling distributions.

Principal Components Analysis (PCA) finds linear combinations of variables that best explain the covariation structure of the variables. There are two typical purposes of PCA: 1 Data reduction: explain covariation between p variables using r < p linear combinations 2 Data interpretation: find features (i.e., components) that are important for explaining covariation

The data matrix refers to the array of numbers X = {x11 x1p, x21 x2p, xn1 xnp} where xij is the j-th variable collected from the i-th item (e.g., subject). items/subjects are rows variables are columns X is a data matrix of order n Ă— p (# items by # variables).