About this Tutorial:
In this notebook, the basics of working with geographic data are introduced.
Reading in data (points/ geoms)
Convert lat/lng columns to point coordinates
Geocoding address to coordinates
Changing coordinate reference systems
Connecting to PostGisDB's
Basic Operations
Saving shape data
Get Polygon Centroids
Working with Points and Polygons
Map Points and Polygons
Get Points in Polygons
Create Choropleths
Create Heatmaps (KDE?)
Objectives
By the end of this tutorial users should have an understanding of:
How to read in and process geo-data asa geo-dataframe.
The Coordinate Reference System and Coordinate Encoding
Basic geo-visualization strategies
Background
An expansive discursive on programming and cartography can be found here
Datatypes and Geo-data
Geographic data must be encoded properly order to attain the full potential of the spatial nature of your geographic data.
If you have read in a dataset using pandas it's data type will be a Dataframe.
It may be converted into a Geo-Dataframe using Geopandas as demonstrated in the sections below.
You can check a variables at any time using the dtype command:
undefinedCoordinate Reference Systems (CRS)
Make sure the appropriate spatial Coordinate Reference System (CRS) is used when reading in your data!
ala wiki:
A spatial reference system (SRS) or coordinate reference system (CRS) is a coordinate-based local, regional or global system used to locate geographical entities
CRS 4326 is the CRS most people are familar with when refering to latiude and longitudes.
Baltimore's 4326 CRS should be at (39.2, -76.6)
BNIA uses CRS 2248 internally
Additional Information: https://docs.qgis.org/testing/en/docs/gentle_gis_introduction/coordinate_reference_systems.html
Ensure your geodataframes' coordinates are using the same CRS using the geopandas command:
undefinedCoordinate Encoding
When first recieving a spatial dataset, the spatial column may need to be encoded to convert its 'text' data type values into understood 'coordinate' data types before it can be understood/processed accordingly.
Namely, there are two ways to encode text into coordinates:
df[geom] = df[geom].apply(lambda x: loads( str(x) ))
df[geom] = [Point(xy) for xy in zip(df.x, df.y)]
The first approach can be used for text taking the form "Point(-76, 39)" and will encode the text too coordinates. The second approach is useful when creating a point from two columns containing lat/lng information and will create Point coordinates from the two columns.
Raster Vs Vector Data
There exists two types of Geospatial Data, Raster and Vector. Both have different file formats.
This lab will only cover vector data.
Vector Data
Vector Data: Individual points stored as (x,y) coordinates pairs. These points can be joined to create lines or polygons.
Format of Vector data
Esri Shapefile â .shp, .dbf, .shx Description - Industry standard, most widely used. The three files listed above are needed to make a shapefile. Additional file formats may be included.
Geographic JavaScript Object Notation â .geojson, .json Description â Second most popular, Geojson is typically used in web-based mapping used by storing the coordinates as JSON.
Geography Markup Language â .gml Description â Similar to Geojson, GML has more data for the same amount of information.
Google Keyhole Markup Language â .kml, .kmz Description â XML-based and predominantly used for google earth. KMZ is a the newer, zipped version of KML.
Raster Data
Raster Data: Cell-based data where each cell represent geographic information. An Aerial photograph is one such example where each pixel has a color value
Raster Data Files: GeoTIFF â .tif, .tiff, .ovr ERDAS Imagine â .img IDRISI Raster â .rst, .rdc
Information Sourced From: https://towardsdatascience.com/getting-started-with-geospatial-works-1f7b47955438
Vector Data: Census Geographic Data:
Geographic Coordinate Data is provided by the census and compliments their census geographies
https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.2010.html
https://www.census.gov/programs-surveys/acs/geography-acs/geography-boundaries-by-year.html
Bnia created and provides for free geographic boundary data that compliment these CSA's
Retrieve GIS Data
As mentioned earlier:
When you use a pandas function to 'read-in' a dataset, the returned value is of a datatype called a 'Dataframe'.
We need a 'Geo-Dataframe', however, to effectively work with spatial data.
While Pandas does not support Geo-Dataframes; Geo-pandas does.
Geopandas has everything you love about pandas, but with added support for geo-spatial data.
Principle benefits of using Geopandas over Pandas when working with spatial data:
The geopandas plot function will now render a map by default using your 'spatial-geometries' column.
Libraries exist spatial-operations and interactive map usage.
There are many ways to have our spatial-data be read-in using geo-pandas into a geo-dataframe.
Namely, it means reading in Geo-Spatial-data from a:
(.geojson or .shp) file directly using Geo-pandas
(.csv, .json) file using Pandas and convert it to Geo-Pandas
using a prepared 'geometry' column
by transformting latitude and longitude columns into a 'geometry' column.
acquiring coordinates from an address
mapping your non-spatial-data to data-with-space
Connecting to a DB
We will review each one below
Approach 1: Reading in Data Directly
If you are using Geopandas, direct imports only work with geojson and shape files.
spatial coordinate data is properly encoded with these types of files soas to make them particularly easy to use.
You can perform this using geopandas' undefined function.
As you can see, the resultant variable is of type GeoDataFrame.
geopandas.geodataframe.GeoDataFrameGeoDataFrames are only possible when one of the columns are of a 'Geometry' Datatype
OBJECTID int64 ,CSA2010 object ,hhchpov14 float64 ,hhchpov15 float64 ,hhchpov16 float64 ,hhchpov17 float64 ,hhchpov18 float64 ,hhchpov19 float64 ,CSA2020 object ,hhchpov20 float64 ,hhchpov21 float64 ,Shape__Area float64 ,Shape__Length float64 ,geometry geometry ,dtype: objectAwesome. So that means, now you can plot maps all prety like:
And now lets take a peak at the raw data:
OBJECTID | CSA2010 | hhchpov14 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | hhchpov19 | CSA2020 | hhchpov20 | hhchpov21 | Shape__Area | Shape__Length | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allendale/Irvington/S. Hilton | 41.55 | 38.93 | 34.73 | 32.77 | 35.27 | 32.6 | Allendale/Irvington/S. Hilton | 21.42 | 21.42 | 6.38e+07 | 38770.17 | POLYGON ((-76.65726 39.27600, -76.65726 39.276... |
I'll show you more ways to save the data later, but for our example in the next section to work, we need a csv.
We can make one by saving the geo-dataframe avove using the undefined function.
The spatial data will be stored in an encoded form that will make it easy to re-open up in the future.
Approach 2: Converting Pandas into Geopandas
Approach 2: Method 1: Convert using a pre-formatted 'geometry' column
This approach loads a map using a geometry column
In our previous example, we saved a geo-dataframe as a csv.
Now lets re-open it up using pandas!
Great!
But now what?
Well, for starters, regardless of the project you are working on: It's always a good idea to inspect your data.
This is particularly important if you don't know what you're working with.
Unnamed: 0 | OBJECTID | CSA2010 | hhchpov14 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | hhchpov19 | CSA2020 | hhchpov20 | hhchpov21 | Shape__Area | Shape__Length | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | Allendale/Irvington/S. Hilton | 41.55 | 38.93 | 34.73 | 32.77 | 35.27 | 32.6 | Allendale/Irvington/S. Hilton | 21.42 | 21.42 | 6.38e+07 | 38770.17 | POLYGON ((-76.657 39.276, -76.657 39.276, -76.... |
Take notice of how the geometry column has a special.. foramatting.
All spatial data must take on a similar form encoding for it to be properly interpretted as a spatial data-type.
As far as I can tell, This is near-identical to the table I printed out in our last example.
BUT WAIT!
You'll notice, that if I run the plot function a pretty map will not de-facto appear
Why is this? Because you're not working with a geo-dataframe but just a dataframe!
Take a look:
geopandas.geodataframe.GeoDataFrameOkay... So thats not right..
What can we do about this?
Well for one, our spatial data (in the geometry-column) is not of the right data-type even though it takes on the right form.
Unnamed: 0 int64 ,OBJECTID int64 ,CSA2010 object ,hhchpov14 float64 ,hhchpov15 float64 ,hhchpov16 float64 ,hhchpov17 float64 ,hhchpov18 float64 ,hhchpov19 float64 ,CSA2020 object ,hhchpov20 float64 ,hhchpov21 float64 ,Shape__Area float64 ,Shape__Length float64 ,geometry geometry ,dtype: objectOk. So how do we change it? Well, since it's already been properly encoded...
You can convert a columns data-type from an object (or whatver else) to a 'geometry' using the undefined function.
In the example below, we convert the datatypes for all records in the 'geometry' column
Thats all! Now lets see the geometry columns data-type and the entire tables's data-type
Unnamed: 0 int64 ,OBJECTID int64 ,CSA2010 object ,hhchpov14 float64 ,hhchpov15 float64 ,hhchpov16 float64 ,hhchpov17 float64 ,hhchpov18 float64 ,hhchpov19 float64 ,CSA2020 object ,hhchpov20 float64 ,hhchpov21 float64 ,Shape__Area float64 ,Shape__Length float64 ,geometry geometry ,dtype: object geopandas.geodataframe.GeoDataFrameAs you can see, we have a geometry column of the right datatype, but our table is still only just a dataframe.
But now, you are ready to convert your entire pandas dataframe into a geo-dataframe.
You can do that by running the following function:
Aaaand BOOM.
goes the dy-no-mite
geopandas.geodataframe.GeoDataFrameApproach 2: Method 2: Convert Column(s) to Coordinate
This is the generic example but it will not work since no URL is given.
"\nexe_df['x'] = pd.to_numeric(exe_df['x'], errors='coerce')\nexe_df['y'] = pd.to_numeric(exe_df['y'], errors='coerce')\n# exe_df = exe_df.replace(np.nan, 0, regex=True)\n \n# An example of loading in an internal BNIA file\ngeometry=[Point(xy) for xy in zip(exe_df.x, exe_df.y)]\nexe_gdf = gpd.GeoDataFrame( exe_df.drop(['x', 'y'], axis=1), crs=crs, geometry=geometry)\n"Approach 2: Method 2: Example: Geoloom
Since I do not readily have a dataset with lat and long's I will have to make one.
We can split the coordinates from a geodataframe like so...
The first thing you will want to do when given a dataset with a coordinates column is ensure its datatype.
Unnamed: 0 | OBJECTID | Data_type | Attach | ProjNm | Descript | Location | URL | Name | PhEmail | Comments | GlobalID | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | 5 | Artists & Resources | NaN | Open Works | Maker Space | 1400 Greenmount Ave, Baltimore, MD, 21202, USA | http://www.openworksbmore.com | Alyce Myatt | alycemyattconsulting@gmail.com | One of Jane Brown's projects! | 140e7db7-33f1-49cd-8133-b6f75dba5851 | POINT (-76.608 39.306) |
Heres a neat trick to make it more presentable, because those points mean nothing to me.
Approach 2: Method 3: Using a Crosswalk (Need Crosswalk on Esri)
When you want to merge two datasets that do not share a common column, it is often useful to create a 'crosswalk' file that 'maps' records between two datasets. We can do this to append spatial data when a direct merge is not readily evident.
Check out this next example where we pull ACS Census data and use its 'tract' column and map it to a community. We can then aggregate the points along a the communities they belong to and map it on a choropleth!
We will set up our ACS query variables right here for easy changing
And now we will call the function with those variables and check out the result
B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total | B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years | B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years | B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years | B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over | state | county | tract | |
---|---|---|---|---|---|---|---|---|
NAME | ||||||||
Census Tract 2710.02 | 38358 | -666666666 | 34219 | 40972 | 37143 | 24 | 510 | 271002 |
This contains the CSA labels we will map our tracts to. This terminal command will download it
Here
TRACTCE10 | GEOID10 | CSA2010 | |
---|---|---|---|
199 | 280500 | 24510280500 | Oldtown/Middle East |
CSA2010 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | geometry | |
---|---|---|---|---|---|---|
0 | Allendale/Irvington/S. Hilton | 38.93 | 34.73 | 32.77 | 35.27 | POLYGON ((-76.65726 39.27600, -76.65726 39.276... |
A simple example of how this would work
---Handling Left Dataset Options--- Left column: tract ---Handling Right Dataset Options--- Right column: tract ---Ensuring Compatability Between merge_how (val: 'outer') and the Right Dataset--- Column or ['inner','left','right','outer'] value: {merge_how} ---Checking Crosswalk Dataset Options--- crosswalk_left_col TRACTCE10 ---Casting Datatypes from-to: Left->Crosswalk Datasets--- Before Casting: -> Column One: tract int64 -> Column Two: TRACTCE10 int64 After Casting: -> Column One: tract int64 -> Column Two: TRACTCE10 int64 ---Casting Datatypes from-to: Right->Crosswalk Datasets--- Before Casting: -> Column One: CSA2010 object -> Column Two: CSA2010 object After Casting: -> Column One: CSA2010 object -> Column Two: CSA2010 object ---All checks complete. Status: True --- ---PERFORMING MERGE : LEFT->CROSSWALK --- Column One : tract int64 How: CSA2010 Column Two : TRACTCE10 int64 Local Column Values Not Matched [10000] 2 Crosswalk Unique Column Values [ 10100 10200 10300 10400 10500 20100 20200 20300 30100 30200 40100 40200 60100 60200 60300 60400 70100 70200 70300 70400 80101 80102 80200 80301 80302 80400 80500 80600 80700 80800 90100 90200 90300 90400 90500 90600 90700 90800 90900 100100 100200 100300 110100 110200 120100 120201 120202 120300 120400 120500 120600 120700 130100 130200 130300 130400 130600 130700 130803 130804 130805 130806 140100 140200 140300 150100 150200 150300 150400 150500 150600 150701 150702 150800 150900 151000 151100 151200 151300 160100 160200 160300 160400 160500 160600 160700 160801 160802 170100 170200 170300 180100 180200 180300 190100 190200 190300 200100 200200 200300 200400 200500 200600 200701 200702 200800 210100 210200 220100 230100 230200 230300 240100 240200 240300 240400 250101 250102 250103 250203 250204 250205 250206 250207 250301 250303 250401 250402 250500 250600 260101 260102 260201 260202 260203 260301 260302 260303 260401 260402 260403 260404 260501 260604 260605 260700 260800 260900 261000 261100 270101 270102 270200 270301 270302 270401 270402 270501 270502 270600 270701 270702 270703 270801 270802 270803 270804 270805 270901 270902 270903 271001 271002 271101 271102 271200 271300 271400 271501 271503 271600 271700 271801 271802 271900 272003 272004 272005 272006 272007 280101 280102 280200 280301 280302 280401 280402 280403 280404 280500] ---PERFORMING MERGE : LEFT->RIGHT --- Column One : CSA2010 object How: outer Column Two : CSA2010 object NAME object ,B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total int64 ,B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years int64 ,B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years int64 ,B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years int64 ,B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over int64 ,state int64 ,county int64 ,tract int64 ,CSA2010 object ,hhchpov15 float64 ,hhchpov16 float64 ,hhchpov17 float64 ,hhchpov18 float64 ,geometry geometry ,dtype: objectApproach 2: Method 4: Geocoding Addresses and Landmarks to Coordinates
Sometimes (usually) we just don't have the coordinates of a place, but we do know it's address or that it is an established landmark.
In such cases we attempt 'geo-coding' these points in an automated manner.
While convenient, this process is error prone, so be sure to check it's work!
For this next example to take place, we need a dataset that has a bunch of addresses.
We can use the geoloom dataset from before in this example. We'll just drop geo'spatial data.
OBJECTID | Data_type | Attach | ProjNm | Descript | Location | URL | Name | PhEmail | Comments | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Artists & Resources | NaN | Joe | Test | 123 Market Pl, Baltimore, MD, 21202, USA |
But if for whatever reason the link is down, you can use this example dataframe mapping just some of the many malls in baltimore.
Location | Address | |
---|---|---|
0 | 100 N. Holliday St, Baltimore, MD 21202 | Baltimore City Council |
1 | 200 E Pratt St, Baltimore, MD | The Gallery at Harborplace |
2 | 2401 Liberty Heights Ave, Baltimore, MD | Mondawmin Mall |
3 | 201 E Pratt St, Baltimore, MD | Harborplace |
4 | 3501 Boston St, Baltimore, MD | The Shops at Canton Crossing |
You can use either the Location or Address column to perform the geo-coding on.
This function takes a while. The less columns/data/records the faster it executes.
Awesome! Now convert the dataframe into a geodataframe and map it!
A litte later down, we'll see how to make this even-more interactive.
Approach 3: Connecting to a PostGIS database
In the following example pulls point geodata from a Postgres database.
We will pull the postgres point data in two manners.
SQL query where an SQL query uses ST_Transform(the_geom,4326) to transform the_geom's CRS from a DATABASE Binary encoding into standard Lat Long's
Using a plan SQL query and performing the conversion using gpd.io.sql.read_postgis() to pull the data in as 2248 and convert the CRS using .to_crs(epsg=4326)
These examples will not work in colabs as their is no local database to connect to and has been commented out for that reason
Basic Operations
Inspection
(geopandas.geodataframe.GeoDataFrame, ,Advanced
Create Geospatial Functions
Operations:
- Reading in data (points/ geoms) -- Convert lat/lng columns to point coordinates -- Geocoding address to coordinates -- Changing coordinate reference systems -- Connecting to PostGisDB's
Basic Operations
Saving shape data
Get Polygon Centroids
- Working with Points and Polygons -- Map Points and Polygons -- Get Points in Polygons
Input(s):
Dataset (points/ bounds) url
Points/ bounds geometry column(s)
Points/ bounds crs's
Points/ bounds mapping color(s)
New filename
Output: File
This function will handle common geo spatial exploratory methods. It covers everything discussed in the basic operations and more!
OBJECTID | Data_type | Attach | ProjNm | Descript | Location | URL | Name | PhEmail | Comments | GlobalID | geometry | POINT_X | POINT_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Artists & Resources | NaN | Joe | Test | 123 Market Pl, Baltimore, MD, 21202, USA | e59b4931-e0c8-4d6b-b781-1e672bf8545a | POINT (-76.60661 39.28746) | -76.61 | 39.29 |
Processing Geometry is tedius enough to merit its own handler
As you can see we have a lot of points. Lets see if there is any better way to visualize this.
Example: Using the advanced Functions
Playing with Points: Geoloom
Points In Polygons
The red dots from when we mapped the geoloom points above were a bit too noisy.
Lets create a choropleth instead!
We can do this by aggregating by CSA.
To do this, start of by finding which points are inside of which polygons!
Since the geoloom data does not have a CSA dataset, we will need merge it to one that does!
Lets use the childhood poverty link from example one and load it up because it contains the geometry data and the csa labels.
OBJECTID | CSA2010 | hhchpov14 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | hhchpov19 | CSA2020 | hhchpov20 | hhchpov21 | Shape__Area | Shape__Length | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allendale/Irvington/S. Hilton | 41.55 | 38.93 | 34.73 | 32.77 | 35.27 | 32.6 | Allendale/Irvington/S. Hilton | 21.42 | 21.42 | 6.38e+07 | 38770.17 | POLYGON ((-76.65726 39.27600, -76.65726 39.276... |
And now lets pull in our geoloom data. But to be sure, drop the empty geometry columns or the function directly below will now work.
OBJECTID | Data_type | Attach | ProjNm | Descript | Location | URL | Name | PhEmail | Comments | POINT_X | POINT_Y | GlobalID | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Artists & Resources | NaN | Joe | Test | 123 Market Pl, Baltimore, MD, 21202, USA | -8.53e+06 | 4.76e+06 | e59b4931-e0c8-4d6b-b781-1e672bf8545a | POINT (-76.60661 39.28746) |
And now use a point in polygon method 'ponp' to get the CSA2010 column from our CSA dataset added as a column to each geoloom record.
OBJECTID | CSA2010 | hhchpov14 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | hhchpov19 | CSA2020 | hhchpov20 | hhchpov21 | Shape__Area | Shape__Length | geometry | pointsinpolygon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allendale/Irvington/S. Hilton | 41.55 | 38.93 | 34.73 | 32.77 | 35.27 | 32.60 | Allendale/Irvington/S. Hilton | 21.42 | 21.42 | 6.38e+07 | 38770.17 | POLYGON ((-76.65726 39.27600, -76.65726 39.276... | 0 |
1 | 2 | Beechfield/Ten Hills/West Hills | 22.31 | 19.42 | 21.22 | 23.92 | 21.90 | 15.38 | Beechfield/Ten Hills/West Hills | 14.77 | 14.77 | 4.79e+07 | 37524.95 | POLYGON ((-76.69479 39.30201, -76.69465 39.301... | 0 |
You'll see you have a 'pointsinpolygons' column now.
OBJECTID | CSA2010 | hhchpov14 | hhchpov15 | hhchpov16 | hhchpov17 | hhchpov18 | hhchpov19 | CSA2020 | hhchpov20 | hhchpov21 | Shape__Area | Shape__Length | geometry | pointsinpolygon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allendale/Irvington/S. Hilton | 41.55 | 38.93 | 34.73 | 32.77 | 35.27 | 32.6 | Allendale/Irvington/S. Hilton | 21.42 | 21.42 | 6.38e+07 | 38770.17 | POLYGON ((-76.65726 39.27600, -76.65726 39.276... | 0 |
Polygons in Points
Alternately, you can run the ponp function and have returned the geoloom dataset
We can count the totals per CSA using undefined
Alternately, we could map the centroid of boundaries within another boundary to find boundaries within boundaries
Folium
But if that doesn't do it for you, we can also create heat maps and marker clusters
And Time Sliders
Choropleth Timeslider
https://github.com/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb