https://www.itl.nist.gov/div898/handbook/

Welcome

This notebook has functions to help handle common tasks

Functions

FillNa DropNa Replace
CustomLambda = Lambda X:(x+x%2)
df.groupby().transform(customLambda):sumAgg
Pd.melt()->columnRows
DummyEncode
Df.Stack.Unstack
Infer data types isfinite-inf/nan, isnan, first char is symbol, default value.

Common Python Data Manipulations

https://datascience.stackexchange.com/questions/37878/difference-between-isna-and-isnull-in-pandas

Common Python Data Manipulations

.isna(), .fillna(), .isnull()
.dropna(how='any'),
.fillna(method='ffill', inplace=true), method='ffill', .fillna(value=0, inplace=true)
.duplicated(), .unique(), .drop_duplicates()
.replace()
groupby()
contains(), within() for geospatial data. Common Python Cleaning operations:

Check the data types of all column in the data-frame
Create a new data-frame excluding all the 'object' types column
Select elements from each column that lie within 3 units of Z score

.cut() will bin your data
.dtypes, -.select_dtypes(exclude=['object'])
stats.zscore(df)

FILTERING

DataFrame.isna() Detect missing values.
DataFrame.any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
DataFrame.all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
DataFrame.filter([items, like, regex, axis]) Subset rows or columns of dataframe according to labels in the specified index.
DataFrame.dropna([axis, how, thresh, …]) Remove missing values.
DataFrame.fillna([value, method, axis, …]) Fill NA/NaN values using the specified method.
DataFrame.replace([to_replace, value, …]) Replace values given in to_replace with value.
DataFrame.interpolate([method, axis, limit, …]) Interpolate values according to different methods.
DataFrame.nlargest(n, columns[, keep]) Return the first n rows ordered by columns in descending order.
DataFrame.nsmallest(n, columns[, keep]) Return the first n rows ordered by columns in ascending order.

GROUPING/ Aggregating/ Manipulating

DataFrame.pivot([index, columns, values]) Return reshaped DataFrame organized by given index / column values. *df.agg("mean", axis="columns") # axis : {0 or ‘index’, 1 or ‘columns’}, default 0
DataFrame.compound(axis=None, skipna=None, level=None)
DataFrame.count(axis=0, level=None, numeric_only=False)[source]
df.groupby(['1_tpop']).mean()
DataFrame.insert(loc, column, value[, …]) Insert column into DataFrame at specified location.

Common Python Cleaning operations:

1. Check the data types of all column in the data-frame
1. Create a new data-frame excluding all the ‘object’ types column
1. Select elements from each column that lie within 3 units of Z score
.cut() will bin your data
.dtypes, -.select_dtypes(exclude=[‘object’])

biggest data cleaning task, missing values

Pandas will recognize both empty cells and “NA” types as missing values. Anything else should to be specified on import

In the code we’re looping through each entry in the “Owner Occupied” column. To try and change the entry to an integer, we’re using int(row). If the value can be changed to an integer, we change the entry to a missing value using Numpy’s np.nan. On the other hand, if it can’t be changed to an integer, we pass and keep going. The .loc method is the preferred Pandas method for modifying entries in place. https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.loc.html

Read In Data

dashboards notes reduced

The functions that transform notebooks in a library

Basic Text

TODO :

Provide a user of Import Options ,
Ask For file, Default = False
Ask For Delimiters, Default = ,
Ask For String Delimiters, Default = "
Ask If First Column Represents Header, Default = False
Ask If the Column Names are Correct

FillNA = -1, avg

FillNA THEN Coerce

Todo:

Interactive Inputs allow user to perform Simple Querys
Fixed Dictionary [ distinct, not, like, avg, min, max, mean, median, mode ]
Query Replaces the Imported Dataset
Repeat until user specifies otherwise
Template: Select From Where GroupBy Having

MISC

Import

Parse The DataTypes

NOTES

Plot Histograms

Basic ops

Categorical Analysis:

Count, Unique, Top, Frequency

Numeric Analysis:

Geo

Future Self Service Tool

Data analytics

Self Service
Reccurent Reports
Embedded Analytics.

GisHandler()

Check Columns
Check If Operations will work as expected
perform operations
tidy up
save
return

Main( Check For Missing Values, Perform Operation)

readFile() - csv/postgis -df -reverseGeocode? ColumnToCords? -Geodf

Geodataframe -toCrs, - saveGeoDataFrame

MergeBounds()

FilterBounds()

FilterPoints() Bounds Points

PoinsInPoly()

Applied Spatial Statistics

Prior Posterior Distribution
Hierarchal Models
Markov Chain Monte Carlo
Kernal Methods
Dynamic State Space Modeling
Multiple linear Regressions
Spatial Models (Car Sar) Kriging
Time series models: ARM ARMA
Dynamic linear models
multi level models - causal inference - meta analysis
multi agent decision making
variable transformations
eigenvalues

Applied Spatial Statistics -> Prior/Posteriors, MCMC, Kernel methods, dynamic state space modeling, multiple linear regression, multilevel models(causal inference, meta analysis), multi agent decision making, variable transformations, eigenvalues, Spatial models (Car,Sar) Kriging Time Series Models : ARM ARMA Dynamic linear models

Exploratory spatial analysis, spatial autocorrelation, spatial regression, interpolation, grid based stats, point based stats, spatial network analysis, spatial clustering.

Big-Data, Structure(Semi/Un), Time-Stamped, Spatial, Spatio-Temporal, Ordered, Stream, Dimensionality, Primary Keys, Unique Values, Index, Spatial, Auto Increment, Default Values, Null Values

Geographic Inquery:

Describe real world phenomena
Study of Spatial Arrangement of features
Patterns arise as a result of process operating within space
Measure compare generate
Size distribution pattern contiguity shape community scale orientation relation
How comparE? How describe analyze? How predict?
Entry, conversion, storage, query, manipulation, analysis, presentation,
Req, process, clean ,explore, model …
Hot spot analysis _> cluster points
Line of sight/visibility analysis -> network, overlay, proximity, risk
Heat maps
GeoCoding
Distance Decay
Clip Analysis
post analaysis
land use analysis
voronoi crop by bounds of other ds,
Buffering -radius around a point
Map coverage, spatial resource allocation,
impact assesment,
pollutant reduction,
decision support,
facility management (water plant mgmt),
operations mgmt,
site selection - where to do xyz,
business/marketing

http://pysal.org/notebooks/explore/esda/Spatial_Autocorrelation_for_Areal_Unit_Data.html Python Spatial Analysis library. https://pysal.org/notebooks/intro Python Spatial Analysis library. Shape Analysis hull: calculate the convex hull of the point pattern mbr: calculate the minimum bounding box (rectangle) The python file centrography.py contains several functions with which we can conduct centrography analysis.

Random point patterns are the outcome of CSR. https://en.wikipedia.org/wiki/Complete_spatial_randomness CSR has two major characteristics: Uniform: each location has equal probability of getting a point (where an event happens) Independent: location of event points are independent It usually serves as the null hypothesis in testing whether a point pattern is the outcome of a random process. There are two possible objectives in a discriminant analysis:

finding a predictive equation for classifying new individuals
interpreting the predictive equation to better understand the relationships that may exist among the variables. It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR.

Misc

https://www.gnu.org/philosophy/open-source-misses-the-point.html

It seems to me that the chief difference between the MIT license and GPL is that the MIT doesn't require modifications be open sourced whereas the GPL does.

You don't have to open-source your changes if you're using GPL. You could modify it and use it for your own purpose as long as you're not distributing it

BUT...

if you DO distribute it, then your entire project that is using the GPL code also becomes GPL automatically Which means, it must be open-sourced, and the recipient gets all the same rights as you - meaning, they can turn around and distribute it, modify it, sell it, etc.

And that would include your proprietary code which would then no longer be proprietary - it becomes open source.

with MIT is that even if you actually distribute your proprietary code that is using the MIT licensed code you do not have to make the code open source you can distribute it as a closed app where the code is encrypted or is a binary.

 Including the MIT-licensed code can be encrypted, as long as it carries the MIT license notice.

File->UncleanData->ToCsvFormat(filename,data)
ProcessCsv -> Unclean Data
IndexDB
URL->browser or server? callServer(url)
json/geojson/xl/csv -> tocsvformat -> iscsv->stringreplace, isjson->papaunparse, isxl->readxlsx[0] ->tocsv, isgeoj->json->papaunparce
JSN.Parse at runtime is faster than inlining the data when 10KB>
Code Caching occurs when inlineJs > 1KB
V8 reduced parse/compilation by 40% using workerThreads
/v8RawJS parse speed is 2x since chrome60

Clear indexdb -> readFile. Insert into IndexDB V.1.0

jpl- sweet ontology
Geoincubator group
Rdf, qsparql, gml, kml

Datascience

Charles Karpati | Python Exploratory Data Analysis