http://users.stat.umn.edu/~helwig/notes/npind-Notes.pdf
Kendall’s Rank Correlation:
Spearman’s Rank Correlation:
Suppose we have n bivariate observations
(X1, Y1), . . . ,(Xn, Yn) where (Xi , Yi) is i-th subject’s data
We want to make inferences about association between X and Y
Let FX,Y denote joint distribution of X and Y
Let FX and FY denote marginal distributions of X and Y
Null hypothesis is statistical independence:
FX,Y (x, y) = FX (x)FY (y) for all (x, y)
Independence assumption
some use cases for graph algorithms:
- real-time fraud detection
- real-time recommendations
- streamline regulatory compliance
- management and monitoring of complex networks
- identity and access management
- social applications/features
There are 3 main categories of graph algorithms that are currently supported in most frameworks (networkx in Python, or in Neo4J for example) :
- Pathfinding: identify the optimal path depending on availability and quality for example. We’ll also include search algorithms in this category. This can be used to identify the quickest route or traffic routing for example.
- Centrality: determine the importance of the nodes in the network. This can be used to identify influencers in social media for example or identify potential attack targets in a network
- Community detection: evaluate how a group is clustered. This can be used to segment customers and detect fraud for example.
Clustering
http://users.stat.umn.edu/~helwig/notes/cluster-Notes.pdf
Let x = (x1, . . . , xp) 0 and y = (y1, . . . , yp) 0 denote two arbitrary vectors. Problem: We want some rule that measures the “closeness” or “similarity” between x and y.
How we define closeness (or similarity) will determine how we group the objects into clusters.
Rule 1: Pearson correlation between x and y
Rule 2: e Minkowski Metric (Manhattan, Euclidean, Chebyshev) distance between x and y
Rule 3: Number of matches, i.e., Pp j=1 1{xj=yj}
Hierarchical clustering uses a series of successive mergers or divisions to group N objects based on some distance.
Agglomerative Hierarchical Clustering (bottom up) 1 Begin with N clusters (each object is own cluster) 2 Merge the most similar objects 3 Repeat 2 until all objects are in the same cluster
Divisive Hierarchical Clustering (top down) 1 Begin with 1 cluster (all objects together) 2 Split the most dissimilar objects 3 Repeat 2 until all objects are in their own cluster
We know how to define dissimilarity between objects (i.e., duv ), but how do we define dissimilarity between clusters of objects?
To quantify the distance between two clusters, we could use:
- Single Linkage: minimum (or nearest neighbor) distance
- Complete Linkage: maximum (or furthest neighbor) distance
- Average Linkage: average (across all pairs) distance
Non-hierarchical clustering partitions a set of N objects into K distinct groups based on some distance (or dissimilarity). The number of clusters K can be known a priori or can be estimated as a part of the procedure.
Why are Permutations Useful for Statistics? Classic statistical paradigm is: collect some data form null hypothesis H0 design test statistic derive sampling distribution of test statistic under H0 In many cases, the null hypothesis is the nil hypothesis, i.e., no effect. Under the nil hypothesis, all possible outcomes (permutations) are equally likely, so permutations relate to sampling distributions.
Principal Components Analysis (PCA) finds linear combinations of variables that best explain the covariation structure of the variables. There are two typical purposes of PCA: 1 Data reduction: explain covariation between p variables using r < p linear combinations 2 Data interpretation: find features (i.e., components) that are important for explaining covariation
The data matrix refers to the array of numbers X = {x11 x1p, x21 x2p, xn1 xnp} where xij is the j-th variable collected from the i-th item (e.g., subject). items/subjects are rows variables are columns X is a data matrix of order n Ă— p (# items by # variables).