Correlation

Measures for the support-confidence framework for mining association rules:

Support
Confidence

Some alternative measures:

All-confidence
Collective strength
Conviction
Coverage
Leverage
Lift (originally called interest)
Other measures

Notes on Nonparametric Independence Tests

Kendall’s Rank Correlation:

Spearman’s Rank Correlation:

Suppose we have n bivariate observations

(X1, Y1), . . . ,(Xn, Yn) where (Xi , Yi) is i-th subject’s data

We want to make inferences about association between X and Y

Let FX,Y denote joint distribution of X and Y

Let FX and FY denote marginal distributions of X and Y

Null hypothesis is statistical independence:

FX,Y (x, y) = FX (x)FY (y) for all (x, y)

Independence assumption

n i=1 are iid from some bivariate population Continuity assumption

FX,Y is a continuous distribution

some use cases for graph algorithms:

real-time fraud detection
real-time recommendations
streamline regulatory compliance
management and monitoring of complex networks
identity and access management
social applications/features

There are 3 main categories of graph algorithms that are currently supported in most frameworks (networkx in Python, or in Neo4J for example) :

Pathfinding: identify the optimal path depending on availability and quality for example. We’ll also include search algorithms in this category. This can be used to identify the quickest route or traffic routing for example.
Centrality: determine the importance of the nodes in the network. This can be used to identify influencers in social media for example or identify potential attack targets in a network
Community detection: evaluate how a group is clustered. This can be used to segment customers and detect fraud for example.

Clustering

http://users.stat.umn.edu/~helwig/notes/cluster-Notes.pdf

Let x = (x1, . . . , xp) 0 and y = (y1, . . . , yp) 0 denote two arbitrary vectors. Problem: We want some rule that measures the “closeness” or “similarity” between x and y.

How we define closeness (or similarity) will determine how we group the objects into clusters.

Rule 1: Pearson correlation between x and y

Rule 2: e Minkowski Metric (Manhattan, Euclidean, Chebyshev) distance between x and y

Rule 3: Number of matches, i.e., Pp j=1 1{xj=yj}

Hierarchical clustering uses a series of successive mergers or divisions to group N objects based on some distance.

Agglomerative Hierarchical Clustering (bottom up) 1 Begin with N clusters (each object is own cluster) 2 Merge the most similar objects 3 Repeat 2 until all objects are in the same cluster

Divisive Hierarchical Clustering (top down) 1 Begin with 1 cluster (all objects together) 2 Split the most dissimilar objects 3 Repeat 2 until all objects are in their own cluster

We know how to define dissimilarity between objects (i.e., duv ), but how do we define dissimilarity between clusters of objects?

To quantify the distance between two clusters, we could use:

Single Linkage: minimum (or nearest neighbor) distance
Complete Linkage: maximum (or furthest neighbor) distance
Average Linkage: average (across all pairs) distance

Non-hierarchical clustering partitions a set of N objects into K distinct groups based on some distance (or dissimilarity). The number of clusters K can be known a priori or can be estimated as a part of the procedure.

Why are Permutations Useful for Statistics? Classic statistical paradigm is: collect some data form null hypothesis H0 design test statistic derive sampling distribution of test statistic under H0 In many cases, the null hypothesis is the nil hypothesis, i.e., no effect. Under the nil hypothesis, all possible outcomes (permutations) are equally likely, so permutations relate to sampling distributions.

Principal Components Analysis (PCA) finds linear combinations of variables that best explain the covariation structure of the variables. There are two typical purposes of PCA: 1 Data reduction: explain covariation between p variables using r < p linear combinations 2 Data interpretation: find features (i.e., components) that are important for explaining covariation

The data matrix refers to the array of numbers X = {x11 x1p, x21 x2p, xn1 xnp} where xij is the j-th variable collected from the i-th item (e.g., subject). items/subjects are rows variables are columns X is a data matrix of order n × p (# items by # variables).

Datascience

Charles Karpati | Correlation

Clustering