Please read everything found on the mainpage before continuing; disclaimer and all.
About this Tutorial:
Whats Inside?
In this notebook, the basics of how to perform a merge are introduced.
- We will merge two datasets
- We will merge two datasets using a crosswalk
Objectives
By the end of this tutorial users should have an understanding of:
- How dataset merges are performed
- The types different union approaches a merge can take
- The 'mergeData' function, and how to use it in the future
SETUP
Install these libraries onto the virtual environment.
from dataplay.intaker import Intake
%autoreload 2
!nbdev_update_lib
# !nbdev_build_lib --fname './notebooks/02_Merge_Data.ipynb'
# @title Run: Import Modules # These imports will handle everything import numpy as np import pandas as pd
pd.set_option('display.max_colwidth', -1) pd.set_option('max_colwidth', 20) pd.set_option('display.expand_frame_repr', False) pd.set_option('display.precision', 2)
Retrieve Datasets
Our example will merge two simple datasets; pulling CSA names using tract ID's.
The First dataset will be obtained from the Census' ACS 5-year serveys.
Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.
The Second dataset is from a publicly accessible link
Get the Principal dataset.
We will use the function we created in our last tutorial to download the data!
# Change these values in the cell below using different geographic reference codes will change those parameters tract = '*' county = '510' state = '24' # Specify the download parameters the function will receieve here tableId = 'B19001' year = '17' saveAcs = False
df.head()
Get the Secondary Dataset
Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for Tiger/Line Shapefiles
The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureauβs demographic data, available on data.census.gov.-census.gov
For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.
# !wget https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv
or, Alternately
# !curl https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv > CSA-to-Tract-2010.csv
# Our Example dataset contains Polygon Geometry information. # We want to merge this over to our principle dataset. # we will grab it by matching on either CSA or Tract # The url listed below is public. print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv') from dataplay.intaker import Intake crosswalk = Intake.getData( 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' ) crosswalk.head()
Perform Merge & Save
The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.

- Left - returns all left records, only includes the right record if it has a match
- Right - Returns all right records, only includes the left record if it has a match
- Full - Returns all records regardless of keys matching
- Inner - Returns only records where a key match
Get Columns from both datasets to match on
You can get these values from the column values above.
Our Examples will work with the prompted values
left_on = input("Left on crosswalk column: ('tract') \n" ) or "tract" print(' \n '); print( 'Crosswalk Columns ' + str(crosswalk.columns) + '') right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) or "TRACTCE10"
Specify how the merge will be performed
We will perform a left merge in this example.
It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.
Actually perfrom the merge
merged_df = merged_df.drop(left_on, axis=1) merged_df.head()
As you can see, our Census data will now have a CSA appended to it.
# outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " ) # merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL)
Final Result
if (flag == 'y' or flag == 'Y'): left_df = Intake.getData( input("Please enter the location of your Left file: " ) ) else: tract = input("Please enter tract id (*): " ) or "*" county = input("Please enter county id (510): " ) or "510" state = input("Please enter state id (24): " ) or "24" tableId = input("Please enter acs table id (B19001): " ) or "B19001" year = input("Please enter acs year (18): " ) or "18" saveAcs = input("Save ACS? (Y/N): " ) or "N" left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs) print('right_df Example: CSA-to-Tract-2010.csv') right_df = Intake.getData( input("Please enter the location of your right_df file: " ) or 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' ) print( 'Left Columns ' + str(left_df.columns)) print( '\n ') print( 'right_df Columns ' + str(right_df.columns) + '\n') left_on = input("Left on: " ) or 'tract' right_on = input("Right on: " ) or 'TRACTCE10' how = input("How: (βleftβ, βrightβ, βouterβ, βinnerβ) " ) or 'outer' merged_df = pd.merge(left_df, right_df, left_on=left_on, right_on=right_on, how=how) merged_df = merged_df.drop(left_on, axis=1) # Save the data # Save the data saveFile = input("Save File ('Y' or 'N'): ") or 'N' if saveFile == 'Y' or saveFile == 'y': outFile = input("Saved Filename (Do not include the file extension ): ") merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);
Advanced
For this next example to work, we will need to import hypothetical csv files
Intro
The following Python function is a bulked out version of the previous notes.
- It contains everything from the tutorial plus more.
- It can be imported and used in future projects or stand alone.
Description: add columns of data from a foreign dataset into a primary dataset along set parameters.
Purpose: Makes Merging datasets simple
Services
- Merge two datasets without a crosswalk
- Merge two datasets with a crosswalk
#@ title Run: Create mergeDatasets() # Worried about infinit interactive-loops. not an issue atm. # Crosswalk needs to have exact same column names as left/right datasets def mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False, left_col=False, right_col=False, crosswalk_left_col = False, crosswalk_right_col = False, merge_how=False, # left right or columnname to retrieve interactive=True): # Interactive will ask if use_crosswalk unless crosswalk_ds == 'no' # 1. Used on Right Dataset in case merge_how is a column to pull. Returns False or Col def checkMergeHow(ds, how, interactive): inList = how in ['left', 'right', 'outer', 'inner'] inDf = Intake.checkColumn(ds, how) if ( inList or inDf ): return how elif ( not interactive ): return False else: try: print('\n Invalid merge column given. \n Please select a value from either list'); print("\n 1) Pull A single column from the right dataset: ", ds.columns) print("OR \n 2) Specify a type of join operation: (βleftβ, βrightβ, βouterβ, βinnerβ, columnName) " ) return checkMergeHow(ds, input("Column Name: " ), interactive); except: return False # User probably trying to escape interactivity # 2i. Load data via url. Coerce Dtype needed for merge. def coerceForMerge( msg, first_ds, second_ds, first_col, second_col, interactive ): if (interactive): print(f'\n---Casting Datatypes from-to: {msg} Datasets---'); print('Before Casting: '); print('-> Column One: ', first_col, first_ds[first_col].dtype) print('-> Column Two: ', second_col, second_ds[second_col].dtype) second_ds, second_col = Intake.getAndCheck(second_ds, second_col, interactive) first_ds, second_ds, status = Intake.coerce(first_ds, second_ds, first_col, second_col, interactive); if (not status and interactive): print('\n There was a problem!'); if (interactive): print('\n After Casting: '); print('-> Column One: ', first_col, first_ds[first_col].dtype) print('-> Column Two: ', second_col, second_ds[second_col].dtypes) return first_ds, second_ds, second_col, status # 2ii. def mergeAndFilter(msg, first_ds, second_ds, first_col, second_col, how, interactive): if interactive: print(f'---PERFORMING MERGE : {msg}---'); print('Column One : ', first_col, first_ds[first_col].dtype) print('How: ', how) print('Column Two : ', second_col, second_ds[second_col].dtype) first_ds = mergeOrPull(first_ds, second_ds, first_col, second_col, how) return filterEmpties(first_ds, second_ds, first_col, second_col, how, interactive) # Decide to perform a merge or commit a pull def mergeOrPull(df, cw, left_on, right_on, how): def merge(df, cw, left_on, right_on, how): df = pd.merge(df, cw, left_on=left_on, right_on=right_on, how=how) # df.drop(left_on, axis=1) df[right_on] = df[right_on].fillna(value='empty') return df def pull(df, cw, left_on, right_on, how): crswlk = dict(zip(cw[right_on], cw[how] ) ) dtype = df[left_on].dtype if dtype =='object': df[how] = df.apply(lambda row: crswlk.get(str(row[left_on]), "empty"), axis=1) elif dtype == 'int64': df[how] = df.apply(lambda row: crswlk.get(int(row[left_on]), "empty"), axis=1) return df mergeType = how in ['left', 'right', 'outer', 'inner'] if mergeType: return merge(df, cw, left_on, right_on, how) else: return pull(df, cw, left_on, right_on, how) # 2iiii. Filter between matched records and not. def filterEmpties(df, cw, left_on, right_on, how, interactive): if how in ['left', 'right', 'outer', 'inner']: how = right_on nomatch = df.loc[df[how] == 'empty'] nomatch = nomatch.sort_values(by=left_on, ascending=True) if nomatch.shape[0] > 0: # Do the same thing with our foreign tracts if(interactive): print('\n Local Column Values Not Matched ') print(nomatch[left_on].unique() ) print(len(nomatch[left_on])) print('\n Crosswalk Unique Column Values') print(cw[right_on].unique() ) # Create a new column with the tracts value mapped to its corresponding value from the crossswalk df[how].replace('empty', np.nan, inplace=True) df.dropna(subset=[how], inplace=True) # cw = cw.sort_values(by=how, ascending=True) return df # 0. Retrieve the left and right dataset. if (interactive): print('---Handling Left Dataset Options---'); left_ds, left_col = Intake.getAndCheck(left_ds, left_col, interactive) if (interactive): print('Left column:', left_col) if (interactive): print('\n---Handling Right Dataset Options---'); right_ds, right_col = Intake.getAndCheck(right_ds, right_col, interactive) if (interactive): print('Right column:', left_col) if (interactive): print(f"\n---Ensuring Compatability Between merge_how (val: '{merge_how}') and the Right Dataset---"); merge_how = checkMergeHow(right_ds, merge_how, interactive) if (interactive): print("Column or ['inner','left','right','outer'] value: ", merge_how) # 1. Retrieve the crosswalk dataset: check left-cw, right-cw. try coercing. if (interactive): print(f'\n---Checking Crosswalk Dataset Options---') # if its a df if (not Intake.isPandas(crosswalk_ds)): default = str(crosswalk_ds).lower() == 'false' # If the user used the the default crosswalk value 'False' as them if they want to use one. if (default and interactive ): crosswalk_ds = input("\nProvide a Crosswalk? ( URL/ PATH or 'NO'// 'FALSE' ) ") or False # Check if user opted to not use a crosswalk use_crosswalk = not ((str(crosswalk_ds).lower() in ["no", '', 'none', 'false'])) if (use_crosswalk): crosswalk_ds, crosswalk_left_col = Intake.getAndCheck(crosswalk_ds, crosswalk_left_col, interactive) crosswalk_ds, crosswalk_right_col = Intake.getAndCheck(crosswalk_ds, crosswalk_right_col, interactive) # 3. Coerce all datasets for Merge. if ( Intake.isPandas(crosswalk_ds) ): print('crosswalk_left_col',crosswalk_left_col) left_ds, crosswalk_ds, crosswalk_left_col, status = coerceForMerge( 'Left->Crosswalk', left_ds, crosswalk_ds, left_col, crosswalk_left_col, interactive ) right_ds, crosswalk_ds, crosswalk_right_col, status = coerceForMerge( 'Right->Crosswalk',right_ds, crosswalk_ds, right_col, crosswalk_right_col, interactive ) else: left_ds, right_ds, right_col, status = coerceForMerge('Left->Right', left_ds, right_ds, left_col, right_col, interactive ) if (interactive): print('\n---All checks complete. Status: ', status, '---\n'); if ( not status ): if (interactive):print('Merge Incomplete. Thank you!'); return False; else: if (Intake.isPandas(crosswalk_ds)): left_ds = mergeAndFilter('LEFT->CROSSWALK', left_ds, crosswalk_ds, left_col, crosswalk_left_col, crosswalk_right_col, interactive) left_col = crosswalk_right_col left_ds = mergeAndFilter('LEFT->RIGHT', left_ds, right_ds, left_col, right_col, merge_how, interactive) return left_ds
Function Explanation
Input(s):
- Dataset url
- Crosswalk Url
- Right On
- Left On
- How
- New Filename
Output: File
How it works:
Read in datasets
Perform Merge
If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']
- then a merge will be performed.
If a column name is provided in the 'how' parameter
- then that single column will be pulled from the right dataset as a new column in the left_ds.
Function Diagrams
Diagram the mergeDatasets()
mergeDatasets Flow Chart
Gannt Chart mergeDatasets()
Sequence Diagram mergeDatasets()
Function Examples
# from dataplay.geoms import readInGeometryData
Interactive Example 1. Merge Esri Data
# Hhchpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True) # Hhchpov = Hhchpov[['CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17', 'hhchpov18']] left_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson" left_col = 'CSA2010' # Table: Household Poverty # Hhpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True) # Hhpov = Hhpov[['CSA2010', 'hhpov15', 'hhpov16', 'hhpov17', 'hhpov18']] right_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson" right_col='CSA2010' merge_how = 'outer' interactive = False merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds='no', left_col=left_col, right_col=right_col, crosswalk_left_col = False, crosswalk_right_col = False, merge_how=merge_how, # left right or columnname to retrieve interactive=interactive) merged_df.head()
Example 2 ) Get CSA and Geometry with a Crosswalk using 3 links
# Change these values in the cell below using different geographic reference codes will change those parameters tract = '*' county = '510' # '059' # 153 '510' state = '24' #51 # Specify the download parameters the function will receieve here tableId = 'B19049' # 'B19001' year = '17' saveAcs = False
import IPython # from IPython.core.display import HTML IPython.core.display.HTML("") # state, county, tract, tableId, year, saveOriginal, save left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs) left_df.head(1)
# Columns: Address(es), Census Tract left_ds = left_df left_col = 'tract' # Table: Crosswalk Census Communities # 'TRACT2010', 'GEOID2010', 'CSA2010' crosswalk_ds = 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' crosswalk_left_col = 'TRACTCE10' crosswalk_right_col = 'CSA2010' # Table: right_ds = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson' right_col = 'CSA2010' interactive = False merge_how = 'outer' merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=crosswalk_ds, left_col=left_col, right_col=right_col, crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col, merge_how=merge_how, # left right or columnname to retrieve interactive=interactive) merged_df_geom.head()
Here we can save the data so that it may be used in later tutorials.
# merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)