Home

Datascience

Clustering Associati... Distributions And Re... Geospatial Timeserie... Linear Regressions Ml Process Python Eda Spatial Stationarity Viz Notes Examples

Don't Look! I'm changing!

URL Copied

Non-standard predictors

Classification Identifying to which category an object belongs to. Applications : Spam detection, Image recognition.

Algorithms :

Classification:

Regression Predicting a continuous-valued attribute associated with an object.

  1. Applications : Drug response, Stock prices.

  2. Algorithms :

Clustering

Automatic grouping of similar objects into sets.

  1. Applications : Customer segmentation, Grouping experiment outcomes

  2. Algorithms :

Dimensionality reduction Reducing the number of random variables to consider.

  1. Applications :
  1. Algorithms :

Model selection Comparing, validating and choosing parameters and models.

  1. Goal : Improved accuracy via parameter tuning

  2. Modules :

Preprocessing Feature extraction and normalization.

Preprocessing Feature extraction and normalization.

  1. Application : Transforming input data such as text for use with machine learning algorithms.
  2. Modules :

ML

CLASSIFICATION AND REGRESSION PROBLEMS

There are numerous algorithms for predicting continuous variables or categorical variables from a set of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear Models) and GRM (General Regression Models), we can specify a linear combination (design) of continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction effects) to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis), we can specify such designs for predicting categorical variables, i.e., to solve classification problems.

Regression-type problems. Regression-type problems are generally those where we attempt to predict the values of a continuous variable from one or more continuous and/or categorical predictor variables.

Classification-type problems. Classification-type problems are generally those where we attempt to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. There are a number of methods for analyzing classification-type problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA).

Tree methods are nonparametric and nonlinear + Simplicity of results. -Specify Criteria for Predictive Accuracy, Selecting Splits, When to Stop Splitting.

Data Classification -> Effectiveness Data_classification

Learning Models : We can think of a model as a template. When data is processed through a learning model, what will come out the other end is insight. The model is nothing more than a set of operations performed on the data. Models are typically made in a static environment by (drilling/ rolling, pivoting, slicing/ dicing, Etc.) through data and may involve Integrating multiple mining functions (ex. Classifying than Clustering).

Data classification

According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier:

Types of Models

Hierarchical Generative

Discriminative

Descriptive

MLE is the workhorse estimation technique of frequentist statistics latent variables :

-- principal component analysis

-- singular value decomposition

properties of estimators (bias, consistency, efficiency, sufficiency, robustness).

Testing: Type I and II errors, power, likelihood ratios

Methodology of probabilistic process models:

latent variables:

Feature Reduction:

concepts:

accuracy (tp+tn/p+n)

precision: tp/(tp+fp)

specificity: tn/(fp+tn)

sensitivity: tp/(tp+fn)

Regression:

ensemble learning - bagging boosting stacking additive regression

Neural Nets: (super/unsuper)

Clustering:

RNN

FeedForward :

Online Learning :

T-Distribution - Visualize high density

RISK :Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g.,linear regression,logistic regression, anddistance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form ofregularization. Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g.,linear regression,logistic regression,Support Vector Machines,naive Bayes) and distance functions (e.g.,nearest neighbor methods,support vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such asdecision trees andneural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.

concepts :

accuracy (tp+tn/p+n)

precision: tp/(tp+fp)

specificity: tn/(fp+tn)

sensitivity: tp/(tp+fn) properties of estimators (bias, consistency, efficiency, sufficiency, robustness).

Testing : Type I and II errors, power, likelihood ratios

Critical Review Configuration and Risk Rational for Complexity Weighted parameters Weighted performance metrics Risk assessments and mitigations Technologies Roadmaps

Common Error Measures:

Key issues: accuracy, overfitting, interpretability, computational speed. Pay attention to - confounding variables, complicated interactions, skewness, outliers, nonlinear patterns, variance changes, units/scale issues, overloading, regression, correlation and causation Confounder: a variable that is correlated with both the outcome and covariates

Hierarchical - Distance or similarity? - continuous(euclidean/correlation), binary - manhattan Graphs - help understand properties, find patterns, suggest future modelings, debug, and communicate. Bagging and Boosting - Combine classifiers to improve accuracy but make harder to interpret. Predictive