Data Science

Background
Business Understanding

1. Meta-Data
1. Data Types - Text, Number, Date/Time, other
1. Data Attributes - Nominal, Symmetry, Skew, modality, kurtosis, etc

Data Understanding

1. Data Exploration
1. Graphics

Data Preparation
Data Modeling
Evaluation

0. Background

What is Data Science

Principally, a data scientists job is to apply scientific methods to data in a process often refered to as data-mining. The purpose of this mining is to find informative patterns in the data and to then make meaningful /actionable knowledge of said information. Often times the exploration of the data and development of something actionable is accompanied Machine Learning Algorithms. While contentious, Statistics is generally understood as more focused on quantative descriptions of data and that data science grew out of statistics as an applied branch of the field with an emphasis on qualatative data and actionability of their output. source

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s)... Data mining is the process of applying [machine learning] methods with the intention of uncovering hidden patterns.[16] in large data sets. It bridges the gap from applied statistics and artificial intelligence.

Data Mining Tasks

Anomaly Detection
Association Rule Learning
Clustering
Classification
Regression
Summarization

Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[14] Currently, the terms data mining and knowledge discovery are used interchangeably. data-mining

The most simple conceptual model of the the data scientists job is as follows:

Preproces
Data Mine
Resuls Validation

The Knowledge Discovery in Database (KDD) Process uses these 5 steps:

Selection
Pre-processing
Transformation
Data mining
Interpretation/evaluation

And a more modern competing approach is to use the CRISP-DM approach which splits the data scientists responsibility along these 6 themes

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

I like to break out data science steps as so:

Acquire - (Selection & Business Understanding) Understanding what's needed, Retrieving data and injesting it through the act of preliminary exploration and storage
Explore - (Data Understanding) find extremes, range within, distributions, anomolies, etc. Performed after any data processing is done.
Pre-process - Cleaning the data for further analytical activities and exploration by filtering or performing non-critical prep work on the data like removing noise or missing vals or meet size/time limit constraints. Typically only done once.
Process - Once the raw data is cleaned and generically prepped, additional transformational processes may be required to make applicable use of data for a scientific specific model. augmentations may include sorting, filtering and arranging, and aggregating and integration. Many different models and precesses may be used, each deserving an exploration.
Model - test train predict, regressions, clusterings, etc. Analysis is performed by applying models on data and recording the output, Includes finding Correlations, and aontextualize any discoveries.
Act - Inteperite the data and draw evaluative conclusions and communicating them in the form of something presentable/ actionable.

This way of thinking differentiates itself from the former two models of thought is that the first step is idealy only performed once wheras the 2nd step is performed after any data is changed. The third step is conditinoally applied from CRISP-DM in that in breaks out the pre-processing and integration/transformations steps as they can are sometimes optional and a big enough topic to deserve their own section. This article has been sectioned along CRISP-DM Standards.

At One Point in time I also described it like so:

Process - Data - raw/ processed data.

Figures - exploratory/ final figures.

Code - raw scripts, final scripts.

Text - readme / analysis.

Steps - define the question, define the ideal data, obtain data, clean data, exploratory data analysis, statistical prediction/modeling, interpret results, challenge results, synthesize/write up results. Create reproducible code.

Language ( describe, correlate/associated, lead to/causes, predicts). Interpret and explain

Challenge all steps - question, data, processing, analysis, conclusion

synthesize/write up results -> lead with question, summarize analysis, order analyses as story rather than chronologically. Include pretty figures.

Geo-Spatial-Time-Series

To work with this data, ds's can use these tools:

Model
Process
Visual
Exploratory spatial analysis
spatial autocorrolation
spatial regression
interpolation
grid based stats
point based stats
spatial network analysis
spatial clustering.

Free Softwares

Free software for data analysis

DevInfo – a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
ELKI – data mining framework in Java with data mining oriented visualization functions.
KNIME – the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
Orange – A visual programming tool featuring interactive data visualization and methods for statistical data analysis, data mining, and machine learning.
Pandas – Python library for data analysis
PAW – FORTRAN/C data analysis framework developed at CERN
R – a programming language and software environment for statistical computing and graphics.
ROOT – C++ data analysis framework developed at CERN
SciPy – Python library for data analysis
Vega-Lite{ description, name, title, data, mark, encoding, transform}

Existing Services

Northstar
automl
h20.autokeras
autosklearn
googlecloudml
tpot
dive
orange
xl
spss
vega_2

Existing Service Features

Available attributes -> Principle-Axis Visualization, Secondary/Tertiary Axis's
Sources | sheet1| + | + | + | + ..
Hierarchy Maps -> Dimensions
Marks(map.bar.etc) colors, label, tooltip, size, detail
Filter(range), show only relevant, include nulls
Colrs -> by freq or what?
Specific style rules for each column/row
transformation/prediction, nlp/queries

##1. Business Understanding

Systems = Infrastructure => Technical, (hw, ppl, processes), integrate

Software = Apps => Business

Data provenance

Features Geometry = Coordinates/ Rings

Challenges

80% of the time of client requests is spent on [access, format, connect, and fix], often the solutions to these client requests are client specific artifacts or heuristics/ classifications requirements.

Administrative data is heavily Biased and Dependent. Survey data less so

Dependent data does not lend itself to Bayesian methods.

Depending on Collection Methodology

I need to know (which/if any) of these are not applicable for our purposes
Census(descriptive),
Observational study(inferential),
convenience sample(all types-may be biased),
randomized trial (causal). Other types:
prediction study(prediction),
studies over time [cross sectional(inferential),
longitudinal(inferential, predictive) ],
retrospective ( inferential)

Conceptual Challenges:

AutoEda Project Challenges

Noteable Free Software
What's Novel?
philosophy - Semantic -> noideal, structure/ ai
data and methodolgies reflect goals
complexity - Basic, many ways
Visualy - Basic, many ways, using the right tool
meaning - All relative
Resource utilization - Synergy capture
Conflict Of Interest statements.
Data provenance, results, purpose, assupmtions
Whats important to encode and why?
Types of systems
Central Limit Theorem applications

Options When Cleaning Unclean Data

Fill Drop Replace
isnan, isfinite, defaultVal
Col Info / Processing Outline
GisHandler()
readFile()
mergeBounds()
filterBounds()

Ideal Data:

Descriptive - a whole population
Exploratory - a random sample with many variables
Inferential - the right population, randomly sampled
Predictive - a training and test data set from the same population
Causal - data from a randomized study
Mechanistic - data about all components of the system.

Critical Questions

Dimensionality
Table Multi Indexies
Grouping Identifiers
Arima/Kriging
Anova/Ancova
Temporal/Physical bounds
Analysis of Avg of AVg's
Data provenance
Features Geometry = Coordinates/ Rings
Is this an act of classifying or categorizing?

Classes are categorized ideas.

Which and what to apply? Does it truly matter in the end?

Yes because shannon encoding. The approximation of the truth is not the truth but can be sometimes useful.

2. Data Understanding

Meta Data

We won't get too deep into this but: data is like an ogre in that hey both have layers. Data about data is called meta-data. Universal law: Any and all 'observable' data has meta-data which then too is also 'observable'. There are thought to be 6 meta-physical cascading layers of meta-data that expand onto themselves indefinetely. When humans perform their superficial observation about the world around them, at that 'instant' all the data and its accompanying meta-data is 'metabolized' by our sensory processors solely at the 'instance' level of this Hierarchy. Illicitation of further understanding of the data may be derived by critically introspecting into the observation. The deepest people feel comfortable without getting too abstract is to the structural level (3), ie: "The relations between the observable properties of the original observation". lower levels get into abstract concepts like observing the relations between words and language itself.

Rules - cuts across other layers. buried in code at all levels

0A. Reaction & Transformation Rules -> Derivation Rules -> Facts / Quries -> Integration Constraints
0B. Example: From the Piano axioms in math we may derive further rules. From these rules

Domain - express local in terms of global. the sphere of all things local app should know
Referent (understand linkages between data models -> xsl/t, topic maps, data driven) languages, abstractions
Structural (relational, object, hierarchical)
Syntactic (type, language, msg length, source, bitrate, encryption level),
Instance (actual data)

Data Types

Text Types

CHAR( ) A fixed section from 0 to 255 characters long. VARCHAR( ) A variable section from 0 to 255 characters long. TINYTEXT A string with a maximum length of 255 characters.
TEXT A string with a maximum length of 65535 characters.
BLOB A string with a maximum length of 65535 characters.
MEDIUMTEXT A string with a maximum length of 16777215 characters.
MEDIUMBLOB A string with a maximum length of 16777215 characters.
LONGTEXT A string with a maximum length of 4294967295 characters.
LONGBLOB A string with a maximum length of 4294967295 characters.
ENUM(x,y,z,etc.)
SET A logical field can be displayed as Yes/No, True/False, or On/Off

Number Types

TINYINT -128 to 127 normal 0 to 255
SMALLINT -32768 to 32767 normal 0 to 65535
MEDIUMINT -8388608 to 8388607 normal 16777215
INT -2147483648 to 2147483647 normal 0 to 4294967295
BIGINT -9223372036854775808 to 9223372036854775807 normal 0 to 18446744073709551615
FLOAT DOUBLE - A large number with a floating decimal point.
DECIMAL - A DOUBLE stored as a string, allowing for a fixed decimal point.
Money
Real

Date/Time Types

Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

DATE: YYYY-MM-DD
DATETIMEL YYYY-MM-DD HH:MM:SS
TIMESTAMP: YYYYMMDDHHMMSS
TIME: HH:MM:SS
YEAR: YYYY

Other Types of Data :

Records
Objects
Graph/Network
Text
Multimedia
Relational/ Transactional
PYTHON (Integer, Boolean, Float, Object)
GIS( Point, Line, Polygon )( Spatial Projections )

Data Attributes

Collection Methodology attributes

Census (descriptive),
Observational study (inferential),
convenience sample (all types-may be biased),
randomized trial (causal).
prediction study (prediction)
studies over time [cross sectional(inferential)
longitudinal (inferential, predictive)
retrospective ( inferential)

Basics attributes

Categorical Qualitative (Binomial, Nominal, Ordinal)
Numerical Quantitative (Discrete or Continuous)(Interval/ Ratio)
Binary- 1 or 0
Nominal - (Hair color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform to Numerical using One Hot Encoding)(Display using Pie or Bar Chart)
Ordinal - Ordered without a known Magnitude

DESCRIPTIVE STATISTICS

What are Basic Statistics? Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation

One basic and straightforward method for analyzing data is via crosstabulation. Log-Linear provides a more "sophisticated" way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance Fitting marginal frequencies. Let us now turn to the analysis of our example table. We could ask ourselves what the frequencies would look like if there were no relationship between variables (the null hypothesis). Without going into details, intuitively one could expect that the frequencies in each cell would proportionately reflect the marginal frequencies (Totals). For example, consider the following table:

Meta-Data

Population
Statistical power
Sample size
Treating Missing data

Central Tendency

Mean
arithmetic
geometric
harmonic
Median
Mode

Dispersion

Variance
Standard deviation
Percentile
Range
Interquartile range

Advanced attributes

Dispersion - Median, Min, Max, Quartiles, Box Plots, Iqr, Normal distribution
Bias - Expanded further below
Variance - Standard deviation, Square root means
Symmetry - Values are equally weighted. Folding a histogram in half.
Skew - Mean is smaller/larger than the median.
Kurtosis - Fat/Thing Tails
Normality - z-score
Linearity - Alternative (Splines)
Heteroscedasticity - variability of a variable is unequal across the range of values of a second variable it predicts.
Monotonicity - slope remains Unchanged in the + or - direction

Analytical attributes

StatisticsSummary_statistics

https://en.wikipedia.org/wiki/Sufficient_statistic
https://en.wikipedia.org/wiki/Descriptive_statistics
Location - Common measures of location, or central tendency, are the arithmetic mean, median, mode, and interquartile mean.[2][3]
Spread - Common measures of statistical dispersion are the standard deviation, variance, range, interquartile range, absolute deviation, mean absolute difference and the distance standard deviation. Measures that assess spread in comparison to the typical size of data values include the coefficient of variation. The Gini coefficient was originally developed to measure income inequality and is equivalent to one of the L-moments. A simple summary of a dataset is sometimes given by quoting particular order statistics as approximations to selected percentiles of a distribution.
Shape - Common measures of the shape of a distribution are skewness or kurtosis, while alternatives can be based on L-moments. A different measure is the distance skewness, for which a value of zero implies central symmetry.
Dependence - The common measure of dependence between paired random variables is the Pearson product-moment correlation coefficient, while a common alternative summary statistic is Spearman's rank correlation coefficient. A value of zero for the distance correlation implies independence.

Attribute set operators

Boolean an or not
Location adjacent contains intersects distance With.touches.crosses.overlaps
Analy distance-;inner/outer/ mean avg std/ first second order effecs, analy centrography,
analyze global and local densities, quadrratic/kernal.
semanticweb-w3cdataintegrity- domain ontology solve syntax/structure
Innventory Systems show
Query systems reveal
Analysis systems explore
Decisions systems support
Modeling systems process
Monitoring systems are time centric
Gis analysis - img processing, classifier sufface analysis, visibility gradient, aspect, network slows
Voronoi for knn/iwd

Central Tendency

mean_center: calculate the mean center of the unmarked point pattern.
weighted_mean_center: calculate the weighted mean center of the marked point pattern.
manhattan_median: calculate the manhattan median
euclidean_median: calculate the Euclidean median Dispersion and Orientation
std_distance: calculate the standard distance

_ Descriptive statistics. _ _When one's data are not normally distributed, and the measurements at best contain rank order information, then computing the

standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data.

Nonparametrics and Distributions will compute a wide variety of measures of location (_ mean,median,mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide the "complete picture" of one's data.

Dataset

Size -> Number of variables
Shape -> Number of observations
Dimension
total missing, memory, avg size
count of columns by type
important info/ warnings

Univariate

setDtype

getUnique/Count/Percent/Rate
getMissing/Count/Percent/Rate

getDtype * Categorical Qualitative (Binomial, Nominal, Ordinal) * Numerical Quantitative (Discrete or Continuous)(Interval/ Ratio) * Binary - 1 or 0 * Nominal - (Hair color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform to Numerical using One Hot * Encoding)(Display using Pie or Bar Chart) * Ordinal - Ordered without a known Magnitude

Boolean
Numeric -- Skew -- kurtosis -- Normality Test -- Mean -- stdev -- median -- mode -- min -- max -- Visualization of (Sample) Data

Dispersion - Median, Min, Max, Quartiles, Box Plots, Iqr, Normal distribution
Bias - Expanded further below
Variance - Standard deviation, Square root means
Symmetry - Values are equally weighted. Folding a histogram in half.
Skew - Mean is smaller/larger than the median.
Kurtosis - Fat/Thing Tails
Normality - z-score
Linearity - Alternative (Splines)
Heteroscedasticity - variability of a variable is unequal across the range of values of a second variable it predicts.
Monotonicity - slope remains Unchanged in the + or - direction
Big-Data, Structure(Semi/Un), Time-Stamped, Spatial, Spatio-Temporal, Ordered, Stream, Dimensionality,
Primary Keys, Unique Values, Index, Spatial, Auto Increment, Default Values, Null Values

noise
Categorical
Geospatial (country,countrycode, city, metro, etc...)
String

Location - Common measures of location, or central tendency, are the arithmetic mean, median, mode, and interquartile mean.
Spread - Common measures of statistical dispersion are the standard deviation, variance, range, interquartile range, absolute deviation, mean absolute difference and the distance standard deviation. Measures that assess spread in comparison to the typical size of data values include the coefficient of variation. The Gini coefficient was originally developed to measure income inequality and is equivalent to one of the L-moments. A simple summary of a dataset is sometimes given by quoting particular order statistics as approximations to selected percentiles of a distribution.
Shape - Common measures of the shape of a distribution are skewness or kurtosis, while alternatives can be based on L-moments. A different measure is the distance skewness, for which a value of zero implies central symmetry.
Dependence - The common measure of dependence between paired random variables is the Pearson product-moment correlation coefficient, while a common alternative summary statistic is Spearman's rank correlation coefficient. A value of zero for the distance correlation implies independence. The Chi-Square test helps you determine if two discrete variables are associated

Outlier detection:
- BoxPlot -> IQR
- Q3+1.5IQR
- Count, Mean, Std, Min, 25%, 50%, 75%, Max
- Z score -> Z = (x-u)/o. U = Standard Deviation from x =Mean.
- stats.zscore(df)

Date/Time Types (quarter, month, dayofweek, dayofmonth, hour, minute

DATE YYYY-MM-DD
DATETIME YYYY-MM-DD HH:MM:SS
TIMESTAMP YYYYMMDDHHMMSS
TIME HH:MM:SS
YEAR YYYY

Data Exploration

Refere to

methods_tutorial.ipynb

and

Viz Notes Examples.ipynb

and DistributionsAndTest.ipynb for more

Median difference from Quartiles represent skew. Whiskers represent variance.

4. Data Preparation

Data Prep

Data prep

Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes... Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery... The issues to be dealt with fall into two main categories: systematic errors involving large numbers of data records, probably because they have come from different sources; individual errors affecting small numbers of data records, probably due to errors in the original data entry.

pre-processing

Data preprocessing has the objective to add missing values, aggregate information, label data with categories (Data binning) and smooth a trajectory
Tasks of data pre-processing

Data cleansing - replacing, modifying, or deleting incomplete, incorrect, inaccurate or irrelevant parts of the data
Data editing - For Quality Control
Data reduction
Data wrangling - Transforming the data format for better handling

OUT OF ALL THESE LINKS PLEASE READ THE WIKI PAGE ON Data cleansing and Data Quality.

It covers materials for High-quality data needs to pass a set of quality criteria and the process thereinvolved

Data_processing

Data processing may involve various processes, including:

Validation – Ensuring that supplied data is correct and relevant.
Sorting – "arranging items in some sequence and/or in different sets."
Summarization – reducing detail data to its main points.
Aggregation – combining multiple pieces of data.
Analysis – the "collection, organization, analysis, interpretation and presentation of data."
Reporting – list detail or summary data or computed information.
Classification – separation of data into various categories.

Data processing system Data_processing_system

A Data processing system may involve some combination of:

Conversion converting data to another form or Language.
Reporting – list detail or summary data or computed information.

Data Processing Systems by service type

Transaction processing systems
Information storage and retrieval systems
Command and control systems
Computing service systems
Process control systems
Message switching systems

Validation

Validation Types Data_validation

Range and constraint validation;
Code and Cross-reference validation;
Structured validation

Validation methods

Allowed character checks
Batch totals
Cardinality check
Check digits
Consistency checks
Control totals
Cross-system consistency checks
Data type checks
File existence check
Format or picture check
Hash totals
Limit check
Logic check
Presence check
Range check
Referential integrity
Spelling and grammar check
Uniqueness check
Table lookup check

Post-Validation Actions

Enforcement Action
Advisory Action
Verification Action
Log of validation

Data Transformation

May include: Projecting data, Transforming Multiple Classes to Binary ones, Calibrating Class Probabilities, Cleaning the data and even sampling it. Typically it follows the following path:

Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed.

Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual ETL tools,[3] transformation languages).
Code generation is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules.[4] Typically, the data transformation technologies generate this code[5] based on the definitions or metadata defined by the developers.
Code execution is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
Data review is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process.1

5. Modeling

Statistical Challenges With Data

Options tree to show Pessimistic, Nominal Optimistic versions
Performance vs Risk vs Design analysis

Changing variance - what can you do

box cox transform
variance stabilizing transform
weighted least squares
huber white standard errors

ANOVA Requirements

Normal Distribution
Independent Samples/Groups
Independent Samples t Test requires the assumption of homogeneity of variance.
a test for the homogeneity of variance, called Levene's Test , whenever you run an independent samples T test

MISC

P Values require knowing how many records exist are in the database.
Outliers profoundly influence on the slope of the regression line and the correlation coefficient.
Correlation coefficient alone is not enough for decision making (i.e., scatterplots are always recommended)
Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

Bias:

Quantitative Approach to Outliers.
Correlations in Non-homogeneous Groups
Nonlinear Relations between Variables. - Pearson R measures linearity
Exploratory Examination of Correlation Matrices
Casewise vs. Pairwise Deletion of Missing Data
Overfitting: Pruning/ Cross Validation
Breakdown Analysis
Frequency Tables
Cross Tabulation
Marginal Frequencies
Association Rules

ML and ADE

ML is a helpful tool that can help us with our data exploration and more and more, tools can even be used to make predictions!

Services usually satisfy one or more of these steps:

processing
- Partitions
- Parallelizations
- Map Reduce
visualizing
exploring
feature engineering
model validation
predictive modeling
- Train, Test, Appply
Feature eng
architecture search/transfer learning
parameter tunin g
model selection
model ensambling
model distillation

Datascience

Charles Karpati | Data Science

Table Of Contents