Welcome
This notebook has functions to help handle common tasks
Functions
- FillNa DropNa Replace
- CustomLambda = Lambda X:(x+x%2)
- df.groupby().transform(customLambda):sumAgg
- Pd.melt()->columnRows
- DummyEncode
- Df.Stack.Unstack
- Infer data types isfinite-inf/nan, isnan, first char is symbol, default value.
Common Python Data Manipulations
https://datascience.stackexchange.com/questions/37878/difference-between-isna-and-isnull-in-pandas
Common Python Data Manipulations
.isna(), .fillna(), .isnull()
.dropna(how='any'),
.fillna(method='ffill', inplace=true), method='ffill', .fillna(value=0, inplace=true)
.duplicated(), .unique(), .drop_duplicates()
.replace()
groupby()
contains(), within() for geospatial data. Common Python Cleaning operations:
- Check the data types of all column in the data-frame
- Create a new data-frame excluding all the 'object' types column
- Select elements from each column that lie within 3 units of Z score
- .cut() will bin your data
- .dtypes, -.select_dtypes(exclude=['object'])
- stats.zscore(df)
FILTERING
- DataFrame.isna() Detect missing values.
- DataFrame.any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
- DataFrame.all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
- DataFrame.filter([items, like, regex, axis]) Subset rows or columns of dataframe according to labels in the specified index.
- DataFrame.dropna([axis, how, thresh, …]) Remove missing values.
- DataFrame.fillna([value, method, axis, …]) Fill NA/NaN values using the specified method.
- DataFrame.replace([to_replace, value, …]) Replace values given in to_replace with value.
- DataFrame.interpolate([method, axis, limit, …]) Interpolate values according to different methods.
- DataFrame.nlargest(n, columns[, keep]) Return the first n rows ordered by columns in descending order.
- DataFrame.nsmallest(n, columns[, keep]) Return the first n rows ordered by columns in ascending order.
GROUPING/ Aggregating/ Manipulating
- DataFrame.pivot([index, columns, values]) Return reshaped DataFrame organized by given index / column values. *df.agg("mean", axis="columns") # axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- DataFrame.compound(axis=None, skipna=None, level=None)
- DataFrame.count(axis=0, level=None, numeric_only=False)[source]
- df.groupby(['1_tpop']).mean()
- DataFrame.insert(loc, column, value[, …]) Insert column into DataFrame at specified location.
Common Python Cleaning operations:
- Check the data types of all column in the data-frame
- Create a new data-frame excluding all the ‘object’ types column
- Select elements from each column that lie within 3 units of Z score
- .cut() will bin your data
- .dtypes, -.select_dtypes(exclude=[‘object’])
biggest data cleaning task, missing values
Pandas will recognize both empty cells and “NA” types as missing values. Anything else should to be specified on import
In the code we’re looping through each entry in the “Owner Occupied” column. To try and change the entry to an integer, we’re using int(row). If the value can be changed to an integer, we change the entry to a missing value using Numpy’s np.nan. On the other hand, if it can’t be changed to an integer, we pass and keep going. The .loc method is the preferred Pandas method for modifying entries in place. https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.loc.html
Read In Data
dashboards notes reduced
The functions that transform notebooks in a library
Basic Text
TODO :
- Provide a user of Import Options ,
- Ask For file, Default = False
- Ask For Delimiters, Default = ,
- Ask For String Delimiters, Default = "
- Ask If First Column Represents Header, Default = False
- Ask If the Column Names are Correct
FillNA = -1, avg
FillNA THEN Coerce
Todo:
- Interactive Inputs allow user to perform Simple Querys
- Fixed Dictionary [ distinct, not, like, avg, min, max, mean, median, mode ]
- Query Replaces the Imported Dataset
- Repeat until user specifies otherwise
- Template: Select From Where GroupBy Having
MISC
Import
Parse The DataTypes
NOTES
Plot Histograms
Basic ops
Categorical Analysis:
Count, Unique, Top, Frequency
Numeric Analysis:
Geo
Future Self Service Tool
Data analytics
- Self Service
- Reccurent Reports
- Embedded Analytics.
GisHandler()
- Check Columns
- Check If Operations will work as expected
- perform operations
- tidy up
- save
- return
Main( Check For Missing Values, Perform Operation)
readFile() - csv/postgis -df -reverseGeocode? ColumnToCords? -Geodf
Geodataframe -toCrs, - saveGeoDataFrame
MergeBounds()
FilterBounds()
FilterPoints() Bounds Points
PoinsInPoly()
Applied Spatial Statistics
- Prior Posterior Distribution
- Hierarchal Models
- Markov Chain Monte Carlo
- Kernal Methods
- Dynamic State Space Modeling
- Multiple linear Regressions
- Spatial Models (Car Sar) Kriging
- Time series models: ARM ARMA
- Dynamic linear models
- multi level models - causal inference - meta analysis
- multi agent decision making
- variable transformations
- eigenvalues
Applied Spatial Statistics -> Prior/Posteriors, MCMC, Kernel methods, dynamic state space modeling, multiple linear regression, multilevel models(causal inference, meta analysis), multi agent decision making, variable transformations, eigenvalues, Spatial models (Car,Sar) Kriging Time Series Models : ARM ARMA Dynamic linear models
Exploratory spatial analysis, spatial autocorrelation, spatial regression, interpolation, grid based stats, point based stats, spatial network analysis, spatial clustering.
Big-Data, Structure(Semi/Un), Time-Stamped, Spatial, Spatio-Temporal, Ordered, Stream, Dimensionality, Primary Keys, Unique Values, Index, Spatial, Auto Increment, Default Values, Null Values
Geographic Inquery:
Describe real world phenomena
Study of Spatial Arrangement of features
Patterns arise as a result of process operating within space
Measure compare generate
Size distribution pattern contiguity shape community scale orientation relation
How comparE? How describe analyze? How predict?
Entry, conversion, storage, query, manipulation, analysis, presentation,
Req, process, clean ,explore, model …
Hot spot analysis _> cluster points
Line of sight/visibility analysis -> network, overlay, proximity, risk
Heat maps
GeoCoding
Distance Decay
Clip Analysis
post analaysis
land use analysis
voronoi crop by bounds of other ds,
Buffering -radius around a point
Map coverage, spatial resource allocation,
impact assesment,
pollutant reduction,
decision support,
facility management (water plant mgmt),
operations mgmt,
site selection - where to do xyz,
business/marketing
http://pysal.org/notebooks/explore/esda/Spatial_Autocorrelation_for_Areal_Unit_Data.html Python Spatial Analysis library. https://pysal.org/notebooks/intro Python Spatial Analysis library. Shape Analysis hull: calculate the convex hull of the point pattern mbr: calculate the minimum bounding box (rectangle) The python file centrography.py contains several functions with which we can conduct centrography analysis.
Random point patterns are the outcome of CSR. https://en.wikipedia.org/wiki/Complete_spatial_randomness CSR has two major characteristics: Uniform: each location has equal probability of getting a point (where an event happens) Independent: location of event points are independent It usually serves as the null hypothesis in testing whether a point pattern is the outcome of a random process. There are two possible objectives in a discriminant analysis:
- finding a predictive equation for classifying new individuals
- interpreting the predictive equation to better understand the relationships that may exist among the variables. It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR.
Misc
https://www.gnu.org/philosophy/open-source-misses-the-point.html
It seems to me that the chief difference between the MIT license and GPL is that the MIT doesn't require modifications be open sourced whereas the GPL does.
You don't have to open-source your changes if you're using GPL. You could modify it and use it for your own purpose as long as you're not distributing it
BUT...
if you DO distribute it, then your entire project that is using the GPL code also becomes GPL automatically Which means, it must be open-sourced, and the recipient gets all the same rights as you - meaning, they can turn around and distribute it, modify it, sell it, etc.
And that would include your proprietary code which would then no longer be proprietary - it becomes open source.
with MIT is that even if you actually distribute your proprietary code that is using the MIT licensed code you do not have to make the code open source you can distribute it as a closed app where the code is encrypted or is a binary.
Including the MIT-licensed code can be encrypted, as long as it carries the MIT license notice.
- File->UncleanData->ToCsvFormat(filename,data)
- ProcessCsv -> Unclean Data
- IndexDB
- URL->browser or server? callServer(url)
- json/geojson/xl/csv -> tocsvformat -> iscsv->stringreplace, isjson->papaunparse, isxl->readxlsx[0] ->tocsv, isgeoj->json->papaunparce
- JSN.Parse at runtime is faster than inlining the data when 10KB>
- Code Caching occurs when inlineJs > 1KB
- V8 reduced parse/compilation by 40% using workerThreads
- /v8RawJS parse speed is 2x since chrome60
Clear indexdb -> readFile. Insert into IndexDB V.1.0
- jpl- sweet ontology
- Geoincubator group
- Rdf, qsparql, gml, kml