- Background
- Business Understanding
- Meta-Data
- Data Types - Text, Number, Date/Time, other
- Data Attributes - Nominal, Symmetry, Skew, modality, kurtosis, etc
- Data Understanding
- Data Exploration
- Graphics
- Data Preparation
- Data Modeling
- Evaluation
0. Background
What is Data Science
Principally, a data scientists job is to apply scientific methods to data in a process often refered to as data-mining. The purpose of this mining is to find informative patterns in the data and to then make meaningful /actionable knowledge of said information. Often times the exploration of the data and development of something actionable is accompanied Machine Learning Algorithms. While contentious, Statistics is generally understood as more focused on quantative descriptions of data and that data science grew out of statistics as an applied branch of the field with an emphasis on qualatative data and actionability of their output. source
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s)... Data mining is the process of applying [machine learning] methods with the intention of uncovering hidden patterns.[16] in large data sets. It bridges the gap from applied statistics and artificial intelligence.
Data Mining Tasks
- Anomaly Detection
- Association Rule Learning
- Clustering
- Classification
- Regression
- Summarization
Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[14] Currently, the terms data mining and knowledge discovery are used interchangeably. data-mining
The most simple conceptual model of the the data scientists job is as follows:
- Preproces
- Data Mine
- Resuls Validation
The Knowledge Discovery in Database (KDD) Process uses these 5 steps:
- Selection
- Pre-processing
- Transformation
- Data mining
- Interpretation/evaluation
And a more modern competing approach is to use the CRISP-DM approach which splits the data scientists responsibility along these 6 themes
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
I like to break out data science steps as so:
- Acquire - (Selection & Business Understanding) Understanding what's needed, Retrieving data and injesting it through the act of preliminary exploration and storage
- Explore - (Data Understanding) find extremes, range within, distributions, anomolies, etc. Performed after any data processing is done.
- Pre-process - Cleaning the data for further analytical activities and exploration by filtering or performing non-critical prep work on the data like removing noise or missing vals or meet size/time limit constraints. Typically only done once.
- Process - Once the raw data is cleaned and generically prepped, additional transformational processes may be required to make applicable use of data for a scientific specific model. augmentations may include sorting, filtering and arranging, and aggregating and integration. Many different models and precesses may be used, each deserving an exploration.
- Model - test train predict, regressions, clusterings, etc. Analysis is performed by applying models on data and recording the output, Includes finding Correlations, and aontextualize any discoveries.
- Act - Inteperite the data and draw evaluative conclusions and communicating them in the form of something presentable/ actionable.
This way of thinking differentiates itself from the former two models of thought is that the first step is idealy only performed once wheras the 2nd step is performed after any data is changed. The third step is conditinoally applied from CRISP-DM in that in breaks out the pre-processing and integration/transformations steps as they can are sometimes optional and a big enough topic to deserve their own section. This article has been sectioned along CRISP-DM Standards.
At One Point in time I also described it like so:
Process - Data - raw/ processed data.
Figures - exploratory/ final figures.
Code - raw scripts, final scripts.
Text - readme / analysis.
Steps - define the question, define the ideal data, obtain data, clean data, exploratory data analysis, statistical prediction/modeling, interpret results, challenge results, synthesize/write up results. Create reproducible code.
Language ( describe, correlate/associated, lead to/causes, predicts). Interpret and explain
Challenge all steps - question, data, processing, analysis, conclusion
synthesize/write up results -> lead with question, summarize analysis, order analyses as story rather than chronologically. Include pretty figures.
Geo-Spatial-Time-Series
To work with this data, ds's can use these tools:
- Model
- Process
- Visual
- Exploratory spatial analysis
- spatial autocorrolation
- spatial regression
- interpolation
- grid based stats
- point based stats
- spatial network analysis
- spatial clustering.
Free Softwares
Free software for data analysis
DevInfo â a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
ELKI â data mining framework in Java with data mining oriented visualization functions.
KNIME â the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
Orange â A visual programming tool featuring interactive data visualization and methods for statistical data analysis, data mining, and machine learning.
Pandas â Python library for data analysis
PAW â FORTRAN/C data analysis framework developed at CERN
R â a programming language and software environment for statistical computing and graphics.
ROOT â C++ data analysis framework developed at CERN
SciPy â Python library for data analysis
Vega-Lite{ description, name, title, data, mark, encoding, transform}
Existing Services
- Northstar
- automl
- h20.autokeras
- autosklearn
- googlecloudml
- tpot
- dive
- orange
- xl
- spss
- vega_2
Existing Service Features
Available attributes -> Principle-Axis Visualization, Secondary/Tertiary Axis's
Sources | sheet1| + | + | + | + ..
Hierarchy Maps -> Dimensions
Marks(map.bar.etc) colors, label, tooltip, size, detail
Filter(range), show only relevant, include nulls
Colrs -> by freq or what?
Specific style rules for each column/row
transformation/prediction, nlp/queries
##1. Business Understanding
Systems = Infrastructure => Technical, (hw, ppl, processes), integrate
Software = Apps => Business
Data provenance
Features Geometry = Coordinates/ Rings
Challenges
80% of the time of client requests is spent on [access, format, connect, and fix], often the solutions to these client requests are client specific artifacts or heuristics/ classifications requirements.
Administrative data is heavily Biased and Dependent. Survey data less so
Dependent data does not lend itself to Bayesian methods.
Depending on Collection Methodology
- I need to know (which/if any) of these are not applicable for our purposes
- Census(descriptive),
- Observational study(inferential),
- convenience sample(all types-may be biased),
- randomized trial (causal). Other types:
- prediction study(prediction),
- studies over time [cross sectional(inferential),
- longitudinal(inferential, predictive) ],
- retrospective ( inferential)
Conceptual Challenges:
AutoEda Project Challenges
Noteable Free Software
What's Novel?
philosophy - Semantic -> noideal, structure/ ai
data and methodolgies reflect goals
complexity - Basic, many ways
Visualy - Basic, many ways, using the right tool
meaning - All relative
Resource utilization - Synergy capture
Conflict Of Interest statements.
Data provenance, results, purpose, assupmtions
Whats important to encode and why?
Types of systems
Central Limit Theorem applications
Options When Cleaning Unclean Data
- Fill Drop Replace
- isnan, isfinite, defaultVal
- Col Info / Processing Outline
- GisHandler()
- readFile()
- mergeBounds()
- filterBounds()
Ideal Data:
- Descriptive - a whole population
- Exploratory - a random sample with many variables
- Inferential - the right population, randomly sampled
- Predictive - a training and test data set from the same population
- Causal - data from a randomized study
- Mechanistic - data about all components of the system.
Critical Questions
- Dimensionality
- Table Multi Indexies
- Grouping Identifiers
- Arima/Kriging
- Anova/Ancova
- Temporal/Physical bounds
- Analysis of Avg of AVg's
- Data provenance
- Features Geometry = Coordinates/ Rings
- Is this an act of classifying or categorizing?
Classes are categorized ideas.
- Which and what to apply? Does it truly matter in the end?
Yes because shannon encoding. The approximation of the truth is not the truth but can be sometimes useful.
2. Data Understanding
Meta Data
We won't get too deep into this but: data is like an ogre in that hey both have layers. Data about data is called meta-data. Universal law: Any and all 'observable' data has meta-data which then too is also 'observable'. There are thought to be 6 meta-physical cascading layers of meta-data that expand onto themselves indefinetely. When humans perform their superficial observation about the world around them, at that 'instant' all the data and its accompanying meta-data is 'metabolized' by our sensory processors solely at the 'instance' level of this Hierarchy. Illicitation of further understanding of the data may be derived by critically introspecting into the observation. The deepest people feel comfortable without getting too abstract is to the structural level (3), ie: "The relations between the observable properties of the original observation". lower levels get into abstract concepts like observing the relations between words and language itself.
- Rules - cuts across other layers. buried in code at all levels
- 0A. Reaction & Transformation Rules -> Derivation Rules -> Facts / Quries -> Integration Constraints
- 0B. Example: From the Piano axioms in math we may derive further rules. From these rules
- Domain - express local in terms of global. the sphere of all things local app should know
- Referent (understand linkages between data models -> xsl/t, topic maps, data driven) languages, abstractions
- Structural (relational, object, hierarchical)
- Syntactic (type, language, msg length, source, bitrate, encryption level),
- Instance (actual data)
Data Types
Text Types
- CHAR( ) A fixed section from 0 to 255 characters long. VARCHAR( ) A variable section from 0 to 255 characters long. TINYTEXT A string with a maximum length of 255 characters.
- TEXT A string with a maximum length of 65535 characters.
- BLOB A string with a maximum length of 65535 characters.
- MEDIUMTEXT A string with a maximum length of 16777215 characters.
- MEDIUMBLOB A string with a maximum length of 16777215 characters.
- LONGTEXT A string with a maximum length of 4294967295 characters.
- LONGBLOB A string with a maximum length of 4294967295 characters.
- ENUM(x,y,z,etc.)
- SET A logical field can be displayed as Yes/No, True/False, or On/Off
Number Types
- TINYINT -128 to 127 normal 0 to 255
- SMALLINT -32768 to 32767 normal 0 to 65535
- MEDIUMINT -8388608 to 8388607 normal 16777215
- INT -2147483648 to 2147483647 normal 0 to 4294967295
- BIGINT -9223372036854775808 to 9223372036854775807 normal 0 to 18446744073709551615
- FLOAT DOUBLE - A large number with a floating decimal point.
- DECIMAL - A DOUBLE stored as a string, allowing for a fixed decimal point.
- Money
- Real
Date/Time Types
Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm
- DATE: YYYY-MM-DD
- DATETIMEL YYYY-MM-DD HH:MM:SS
- TIMESTAMP: YYYYMMDDHHMMSS
- TIME: HH:MM:SS
- YEAR: YYYY
Other Types of Data :
- Records
- Objects
- Graph/Network
- Text
- Multimedia
- Relational/ Transactional
- PYTHON (Integer, Boolean, Float, Object)
- GIS( Point, Line, Polygon )( Spatial Projections )
Data Attributes
Collection Methodology attributes
- Census (descriptive),
- Observational study (inferential),
- convenience sample (all types-may be biased),
- randomized trial (causal).
- prediction study (prediction)
- studies over time [cross sectional(inferential)
- longitudinal (inferential, predictive)
- retrospective ( inferential)
Basics attributes
- Categorical Qualitative (Binomial, Nominal, Ordinal)
- Numerical Quantitative (Discrete or Continuous)(Interval/ Ratio)
- Binary- 1 or 0
- Nominal - (Hair color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform to Numerical using One Hot Encoding)(Display using Pie or Bar Chart)
- Ordinal - Ordered without a known Magnitude
DESCRIPTIVE STATISTICS
What are Basic Statistics? Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation
One basic and straightforward method for analyzing data is via crosstabulation. Log-Linear provides a more "sophisticated" way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance Fitting marginal frequencies. Let us now turn to the analysis of our example table. We could ask ourselves what the frequencies would look like if there were no relationship between variables (the null hypothesis). Without going into details, intuitively one could expect that the frequencies in each cell would proportionately reflect the marginal frequencies (Totals). For example, consider the following table:
Meta-Data
- Population
- Statistical power
- Sample size
- Treating Missing data
Central Tendency
- Mean
- arithmetic
- geometric
- harmonic
- Median
- Mode
Dispersion
- Variance
- Standard deviation
- Percentile
- Range
- Interquartile range
Advanced attributes
- Dispersion - Median, Min, Max, Quartiles, Box Plots, Iqr, Normal distribution
- Bias - Expanded further below
- Variance - Standard deviation, Square root means
- Symmetry - Values are equally weighted. Folding a histogram in half.
- Skew - Mean is smaller/larger than the median.
- Kurtosis - Fat/Thing Tails
- Normality - z-score
- Linearity - Alternative (Splines)
- Heteroscedasticity - variability of a variable is unequal across the range of values of a second variable it predicts.
- Monotonicity - slope remains Unchanged in the + or - direction
Analytical attributes
StatisticsSummary_statistics
- https://en.wikipedia.org/wiki/Sufficient_statistic
- https://en.wikipedia.org/wiki/Descriptive_statistics
- Location - Common measures of location, or central tendency, are the arithmetic mean, median, mode, and interquartile mean.[2][3]
- Spread - Common measures of statistical dispersion are the standard deviation, variance, range, interquartile range, absolute deviation, mean absolute difference and the distance standard deviation. Measures that assess spread in comparison to the typical size of data values include the coefficient of variation. The Gini coefficient was originally developed to measure income inequality and is equivalent to one of the L-moments. A simple summary of a dataset is sometimes given by quoting particular order statistics as approximations to selected percentiles of a distribution.
- Shape - Common measures of the shape of a distribution are skewness or kurtosis, while alternatives can be based on L-moments. A different measure is the distance skewness, for which a value of zero implies central symmetry.
- Dependence - The common measure of dependence between paired random variables is the Pearson product-moment correlation coefficient, while a common alternative summary statistic is Spearman's rank correlation coefficient. A value of zero for the distance correlation implies independence.
Attribute set operators
- Boolean an or not
- Location adjacent contains intersects distance With.touches.crosses.overlaps
- Analy distance-;inner/outer/ mean avg std/ first second order effecs, analy centrography,
- analyze global and local densities, quadrratic/kernal.
- semanticweb-w3cdataintegrity- domain ontology solve syntax/structure
- Innventory Systems show
- Query systems reveal
- Analysis systems explore
- Decisions systems support
- Modeling systems process
- Monitoring systems are time centric
- Gis analysis - img processing, classifier sufface analysis, visibility gradient, aspect, network slows
- Voronoi for knn/iwd
- mean_center: calculate the mean center of the unmarked point pattern.
- weighted_mean_center: calculate the weighted mean center of the marked point pattern.
- manhattan_median: calculate the manhattan median
- euclidean_median: calculate the Euclidean median Dispersion and Orientation
- std_distance: calculate the standard distance
_ Descriptive statistics. _ _When one's data are not normally distributed, and the measurements at best contain rank order information, then computing the
- standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data.
Nonparametrics and Distributions will compute a wide variety of measures of location (_ mean,median,mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide the "complete picture" of one's data.
Dataset
- Size -> Number of variables
- Shape -> Number of observations
- Dimension
- total missing, memory, avg size
- count of columns by type
- important info/ warnings
Univariate
setDtype
- getUnique/Count/Percent/Rate
- getMissing/Count/Percent/Rate
getDtype * Categorical Qualitative (Binomial, Nominal, Ordinal) * Numerical Quantitative (Discrete or Continuous)(Interval/ Ratio) * Binary - 1 or 0 * Nominal - (Hair color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform to Numerical using One Hot * Encoding)(Display using Pie or Bar Chart) * Ordinal - Ordered without a known Magnitude
- Boolean
- Numeric -- Skew -- kurtosis -- Normality Test -- Mean -- stdev -- median -- mode -- min -- max -- Visualization of (Sample) Data
- Dispersion - Median, Min, Max, Quartiles, Box Plots, Iqr, Normal distribution
- Bias - Expanded further below
- Variance - Standard deviation, Square root means
- Symmetry - Values are equally weighted. Folding a histogram in half.
- Skew - Mean is smaller/larger than the median.
- Kurtosis - Fat/Thing Tails
- Normality - z-score
- Linearity - Alternative (Splines)
- Heteroscedasticity - variability of a variable is unequal across the range of values of a second variable it predicts.
- Monotonicity - slope remains Unchanged in the + or - direction
- Big-Data, Structure(Semi/Un), Time-Stamped, Spatial, Spatio-Temporal, Ordered, Stream, Dimensionality,
- Primary Keys, Unique Values, Index, Spatial, Auto Increment, Default Values, Null Values
- noise
- Categorical
- Geospatial (country,countrycode, city, metro, etc...)
- String
- Location - Common measures of location, or central tendency, are the arithmetic mean, median, mode, and interquartile mean.
- Spread - Common measures of statistical dispersion are the standard deviation, variance, range, interquartile range, absolute deviation, mean absolute difference and the distance standard deviation. Measures that assess spread in comparison to the typical size of data values include the coefficient of variation. The Gini coefficient was originally developed to measure income inequality and is equivalent to one of the L-moments. A simple summary of a dataset is sometimes given by quoting particular order statistics as approximations to selected percentiles of a distribution.
- Shape - Common measures of the shape of a distribution are skewness or kurtosis, while alternatives can be based on L-moments. A different measure is the distance skewness, for which a value of zero implies central symmetry.
- Dependence - The common measure of dependence between paired random variables is the Pearson product-moment correlation coefficient, while a common alternative summary statistic is Spearman's rank correlation coefficient. A value of zero for the distance correlation implies independence. The Chi-Square test helps you determine if two discrete variables are associated
- Outlier detection:
- BoxPlot -> IQR
- Q3+1.5IQR
- Count, Mean, Std, Min, 25%, 50%, 75%, Max
- Z score -> Z = (x-u)/o. U = Standard Deviation from x =Mean.
- stats.zscore(df)
- Date/Time Types (quarter, month, dayofweek, dayofmonth, hour, minute
- DATE YYYY-MM-DD
- DATETIME YYYY-MM-DD HH:MM:SS
- TIMESTAMP YYYYMMDDHHMMSS
- TIME HH:MM:SS
- YEAR YYYY
Data Exploration
Refere to
methods_tutorial.ipynband
Viz Notes Examples.ipynband DistributionsAndTest.ipynb for more
Median difference from Quartiles represent skew. Whiskers represent variance.
4. Data Preparation
Data Prep
Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes... Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery... The issues to be dealt with fall into two main categories: systematic errors involving large numbers of data records, probably because they have come from different sources; individual errors affecting small numbers of data records, probably due to errors in the original data entry.
Data preprocessing has the objective to add missing values, aggregate information, label data with categories (Data binning) and smooth a trajectory
Tasks of data pre-processing
- Data cleansing - replacing, modifying, or deleting incomplete, incorrect, inaccurate or irrelevant parts of the data
- Data editing - For Quality Control
- Data reduction
- Data wrangling - Transforming the data format for better handling
OUT OF ALL THESE LINKS PLEASE READ THE WIKI PAGE ON Data cleansing and Data Quality.
It covers materials for High-quality data needs to pass a set of quality criteria and the process thereinvolved
Data processing may involve various processes, including:
- Validation â Ensuring that supplied data is correct and relevant.
- Sorting â "arranging items in some sequence and/or in different sets."
- Summarization â reducing detail data to its main points.
- Aggregation â combining multiple pieces of data.
- Analysis â the "collection, organization, analysis, interpretation and presentation of data."
- Reporting â list detail or summary data or computed information.
- Classification â separation of data into various categories.
Data processing system Data_processing_system
A Data processing system may involve some combination of:
- Conversion converting data to another form or Language.
- Reporting â list detail or summary data or computed information.
Data Processing Systems by service type
- Transaction processing systems
- Information storage and retrieval systems
- Command and control systems
- Computing service systems
- Process control systems
- Message switching systems
Validation
Validation Types Data_validation
- Range and constraint validation;
- Code and Cross-reference validation;
- Structured validation
Validation methods
- Allowed character checks
- Batch totals
- Cardinality check
- Check digits
- Consistency checks
- Control totals
- Cross-system consistency checks
- Data type checks
- File existence check
- Format or picture check
- Hash totals
- Limit check
- Logic check
- Presence check
- Range check
- Referential integrity
- Spelling and grammar check
- Uniqueness check
- Table lookup check
Post-Validation Actions
- Enforcement Action
- Advisory Action
- Verification Action
- Log of validation
Data Transformation
May include: Projecting data, Transforming Multiple Classes to Binary ones, Calibrating Class Probabilities, Cleaning the data and even sampling it. Typically it follows the following path:
- Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed.
- Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual ETL tools,[3] transformation languages).
- Code generation is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules.[4] Typically, the data transformation technologies generate this code[5] based on the definitions or metadata defined by the developers.
- Code execution is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
- Data review is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process.1
5. Modeling
Statistical Challenges With Data
- Options tree to show Pessimistic, Nominal Optimistic versions
- Performance vs Risk vs Design analysis
Changing variance - what can you do
- box cox transform
- variance stabilizing transform
- weighted least squares
- huber white standard errors
ANOVA Requirements
- Normal Distribution
- Independent Samples/Groups
- Independent Samples t Test requires the assumption of homogeneity of variance.
- a test for the homogeneity of variance, called Levene's Test , whenever you run an independent samples T test
MISC
- P Values require knowing how many records exist are in the database.
- Outliers profoundly influence on the slope of the regression line and the correlation coefficient.
- Correlation coefficient alone is not enough for decision making (i.e., scatterplots are always recommended)
- Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm
Bias:
- Quantitative Approach to Outliers.
- Correlations in Non-homogeneous Groups
- Nonlinear Relations between Variables. - Pearson R measures linearity
- Exploratory Examination of Correlation Matrices
- Casewise vs. Pairwise Deletion of Missing Data
- Overfitting: Pruning/ Cross Validation
- Breakdown Analysis
- Frequency Tables
- Cross Tabulation
- Marginal Frequencies
- Association Rules
ML and ADE
ML is a helpful tool that can help us with our data exploration and more and more, tools can even be used to make predictions!
Services usually satisfy one or more of these steps:
- processing
- Partitions
- Parallelizations
- Map Reduce
- visualizing
- exploring
- feature engineering
- model validation
- predictive modeling
- Train, Test, Appply
- Feature eng
- architecture search/transfer learning
- parameter tunin g
- model selection
- model ensambling
- model distillation