Home

Datascience

Clustering Associati... Distributions And Re... Geospatial Timeserie... Linear Regressions Ml Process Python Eda Spatial Stationarity Viz Notes Examples

Don't Look! I'm changing!

URL Copied

. BackgroundWhat Is Data ScienceGeo-Spatial-Time-SeriesFree SoftwaresChallengesIdeal Data:Critical Questions. Data UnderstandingMeta DataData TypesData AttributesGetdtype * Categorical Qualitative (Binomial, Nominal, Ordinal) * Numerical Quantitative (Discrete Or Continuous)(Interval/ Ratio) * Binary - 1 Or 0 * Nominal - (Hair Color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform To Numerical Using One Hot * Encoding)(Display Using Pie Or Bar Chart) * Ordinal - Ordered Without A Known MagnitudeData Exploration. Data PreparationData PrepValidationData Transformation. ModelingStatistical Challenges With DataMl And Ade. Evaluation
  1. Background
  2. Business Understanding
  1. Data Understanding
  1. Data Preparation
  2. Data Modeling
  3. Evaluation

0. Background

What is Data Science

Principally, a data scientists job is to apply scientific methods to data in a process often refered to as data-mining. The purpose of this mining is to find informative patterns in the data and to then make meaningful /actionable knowledge of said information. Often times the exploration of the data and development of something actionable is accompanied Machine Learning Algorithms. While contentious, Statistics is generally understood as more focused on quantative descriptions of data and that data science grew out of statistics as an applied branch of the field with an emphasis on qualatative data and actionability of their output. source

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s)... Data mining is the process of applying [machine learning] methods with the intention of uncovering hidden patterns.[16] in large data sets. It bridges the gap from applied statistics and artificial intelligence.

Data Mining Tasks

Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[14] Currently, the terms data mining and knowledge discovery are used interchangeably. data-mining

The most simple conceptual model of the the data scientists job is as follows:

  1. Preproces
  2. Data Mine
  3. Resuls Validation

The Knowledge Discovery in Database (KDD) Process uses these 5 steps:

  1. Selection
  2. Pre-processing
  3. Transformation
  4. Data mining
  5. Interpretation/evaluation

And a more modern competing approach is to use the CRISP-DM approach which splits the data scientists responsibility along these 6 themes

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

I like to break out data science steps as so:

  1. Acquire - (Selection & Business Understanding) Understanding what's needed, Retrieving data and injesting it through the act of preliminary exploration and storage
  2. Explore - (Data Understanding) find extremes, range within, distributions, anomolies, etc. Performed after any data processing is done.
  3. Pre-process - Cleaning the data for further analytical activities and exploration by filtering or performing non-critical prep work on the data like removing noise or missing vals or meet size/time limit constraints. Typically only done once.
  4. Process - Once the raw data is cleaned and generically prepped, additional transformational processes may be required to make applicable use of data for a scientific specific model. augmentations may include sorting, filtering and arranging, and aggregating and integration. Many different models and precesses may be used, each deserving an exploration.
  5. Model - test train predict, regressions, clusterings, etc. Analysis is performed by applying models on data and recording the output, Includes finding Correlations, and aontextualize any discoveries.
  6. Act - Inteperite the data and draw evaluative conclusions and communicating them in the form of something presentable/ actionable.

This way of thinking differentiates itself from the former two models of thought is that the first step is idealy only performed once wheras the 2nd step is performed after any data is changed. The third step is conditinoally applied from CRISP-DM in that in breaks out the pre-processing and integration/transformations steps as they can are sometimes optional and a big enough topic to deserve their own section. This article has been sectioned along CRISP-DM Standards.

At One Point in time I also described it like so:

Process - Data - raw/ processed data.

Figures - exploratory/ final figures.

Code - raw scripts, final scripts.

Text - readme / analysis.

Steps - define the question, define the ideal data, obtain data, clean data, exploratory data analysis, statistical prediction/modeling, interpret results, challenge results, synthesize/write up results. Create reproducible code.

Language ( describe, correlate/associated, lead to/causes, predicts). Interpret and explain

Challenge all steps - question, data, processing, analysis, conclusion

synthesize/write up results -> lead with question, summarize analysis, order analyses as story rather than chronologically. Include pretty figures.

Geo-Spatial-Time-Series

To work with this data, ds's can use these tools:

Free Softwares

Free software for data analysis

Existing Services

Existing Service Features

##1. Business Understanding

Systems = Infrastructure => Technical, (hw, ppl, processes), integrate

Software = Apps => Business

Data provenance

Features Geometry = Coordinates/ Rings

Challenges

80% of the time of client requests is spent on [access, format, connect, and fix], often the solutions to these client requests are client specific artifacts or heuristics/ classifications requirements.

Administrative data is heavily Biased and Dependent. Survey data less so

Dependent data does not lend itself to Bayesian methods.

Depending on Collection Methodology

Conceptual Challenges:

AutoEda Project Challenges

Options When Cleaning Unclean Data

Ideal Data:

Critical Questions

Classes are categorized ideas.

Yes because shannon encoding. The approximation of the truth is not the truth but can be sometimes useful.

2. Data Understanding

Meta Data

We won't get too deep into this but: data is like an ogre in that hey both have layers. Data about data is called meta-data. Universal law: Any and all 'observable' data has meta-data which then too is also 'observable'. There are thought to be 6 meta-physical cascading layers of meta-data that expand onto themselves indefinetely. When humans perform their superficial observation about the world around them, at that 'instant' all the data and its accompanying meta-data is 'metabolized' by our sensory processors solely at the 'instance' level of this Hierarchy. Illicitation of further understanding of the data may be derived by critically introspecting into the observation. The deepest people feel comfortable without getting too abstract is to the structural level (3), ie: "The relations between the observable properties of the original observation". lower levels get into abstract concepts like observing the relations between words and language itself.

  1. Rules - cuts across other layers. buried in code at all levels
  1. Domain - express local in terms of global. the sphere of all things local app should know
  2. Referent (understand linkages between data models -> xsl/t, topic maps, data driven) languages, abstractions
  3. Structural (relational, object, hierarchical)
  4. Syntactic (type, language, msg length, source, bitrate, encryption level),
  5. Instance (actual data)

Data Types

Text Types

Number Types

Date/Time Types

Prefered Data Format : yyyy-mm-dd'T'hh:mm:ss.mmm

Other Types of Data :

Data Attributes

Collection Methodology attributes

Basics attributes

DESCRIPTIVE STATISTICS

What are Basic Statistics? Descriptive Statistics, Correlations, t-tests, frequency tables, cross tabulation

One basic and straightforward method for analyzing data is via crosstabulation. Log-Linear provides a more "sophisticated" way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance Fitting marginal frequencies. Let us now turn to the analysis of our example table. We could ask ourselves what the frequencies would look like if there were no relationship between variables (the null hypothesis). Without going into details, intuitively one could expect that the frequencies in each cell would proportionately reflect the marginal frequencies (Totals). For example, consider the following table:

Meta-Data

Central Tendency

Dispersion

Advanced attributes

Analytical attributes

StatisticsSummary_statistics

Attribute set operators

Central Tendency

_ Descriptive statistics. _ _When one's data are not normally distributed, and the measurements at best contain rank order information, then computing the

Nonparametrics and Distributions will compute a wide variety of measures of location (_ mean,median,mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide the "complete picture" of one's data.

Dataset

Univariate

setDtype

getDtype * Categorical Qualitative (Binomial, Nominal, Ordinal) * Numerical Quantitative (Discrete or Continuous)(Interval/ Ratio) * Binary - 1 or 0 * Nominal - (Hair color, Gender, Favorite Ice Cream ) (Frequencies, Proportions, Percentages)(Transform to Numerical using One Hot * Encoding)(Display using Pie or Bar Chart) * Ordinal - Ordered without a known Magnitude






Data Exploration

Refere to

methods_tutorial.ipynb
and
Viz Notes Examples.ipynb
and DistributionsAndTest.ipynb for more

Median difference from Quartiles represent skew. Whiskers represent variance.

4. Data Preparation

Data Prep

Data prep

Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes... Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery... The issues to be dealt with fall into two main categories: systematic errors involving large numbers of data records, probably because they have come from different sources; individual errors affecting small numbers of data records, probably due to errors in the original data entry.

pre-processing

Data preprocessing has the objective to add missing values, aggregate information, label data with categories (Data binning) and smooth a trajectory

Tasks of data pre-processing

OUT OF ALL THESE LINKS PLEASE READ THE WIKI PAGE ON Data cleansing and Data Quality.

It covers materials for High-quality data needs to pass a set of quality criteria and the process thereinvolved

Data_processing

Data processing may involve various processes, including:

Data processing system Data_processing_system

A Data processing system may involve some combination of:

Data Processing Systems by service type

Validation

Validation Types Data_validation

Validation methods

Post-Validation Actions

Data Transformation

May include: Projecting data, Transforming Multiple Classes to Binary ones, Calibrating Class Probabilities, Cleaning the data and even sampling it. Typically it follows the following path:

5. Modeling

Statistical Challenges With Data

Changing variance - what can you do

ANOVA Requirements

MISC

Bias:

ML and ADE

ML is a helpful tool that can help us with our data exploration and more and more, tools can even be used to make predictions!

Services usually satisfy one or more of these steps:

6. Evaluation