Home 00 Dataplay 01 Download And Load 02 Merge Data
 About This T...
  Whats Inside
  Objectives
 SETUP
 Retrieve Dat...
  Get The Prin...
  Get The Seco...
 Perform Merg...
 Final Result
 Advanced
  Function Exp...
  Function Dia...
  Function Exa...
   Interactive ...
   Example Ge...
   Example Ran...
03 Map Basics Intake... 04 Nb 2 Html 05 Map Correlation N... 06 Timelapse Data Gi...

Don't Look! I'm changing!

URL Copied

Musical loops

Please read everything found on the mainpage before continuing; disclaimer and all.

BinderBinderBinderOpen Source Love svg3

NPM LicenseActivePython VersionsGitHub last commit

GitHub starsGitHub watchersGitHub forksGitHub followers

TweetTwitter Follow

About this Tutorial:

Whats Inside?

In this notebook, the basics of how to perform a merge are introduced.

Objectives

By the end of this tutorial users should have an understanding of:

SETUP

Install these libraries onto the virtual environment.

from dataplay.intaker import Intake
%autoreload 2
!nbdev_update_lib
# !nbdev_build_lib --fname './notebooks/02_Merge_Data.ipynb'
# @title Run: Import Modules
 
 # These imports will handle everything
 import numpy as np
 import pandas as pd
pd.set_option('display.max_colwidth', -1)
 pd.set_option('max_colwidth', 20)
 pd.set_option('display.expand_frame_repr', False)
 pd.set_option('display.precision', 2)

Retrieve Datasets

Our example will merge two simple datasets; pulling CSA names using tract ID's.

The First dataset will be obtained from the Census' ACS 5-year serveys.

Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.

The Second dataset is from a publicly accessible link

Get the Principal dataset.

We will use the function we created in our last tutorial to download the data!

# Change these values in the cell below using different geographic reference codes will change those parameters
 tract = '*'
 county = '510'
 state = '24'
 
 # Specify the download parameters the function will receieve here
 tableId = 'B19001'
 year = '17'
 saveAcs = False
df.head()

Get the Secondary Dataset

Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for Tiger/Line Shapefiles

The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov

For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.

# !wget https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv

or, Alternately

# !curl https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv	> CSA-to-Tract-2010.csv
# Our Example dataset contains Polygon Geometry information. 
 # We want to merge this over to our principle dataset. 
 # we will grab it by matching on either CSA or Tract
 
 # The url listed below is public.
 
 print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv')
 
 from dataplay.intaker import Intake
 
 crosswalk = Intake.getData( 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' ) 
 crosswalk.head()

Perform Merge & Save

The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.

Get Columns from both datasets to match on

You can get these values from the column values above.

Our Examples will work with the prompted values

left_on = input("Left on crosswalk column: ('tract') \n" ) or "tract"
 print(' \n ');
 print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
 right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) or "TRACTCE10" 

Specify how the merge will be performed

We will perform a left merge in this example.

It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.

Actually perfrom the merge

merged_df = merged_df.drop(left_on, axis=1)
 merged_df.head()

As you can see, our Census data will now have a CSA appended to it.

# outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
 # merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) 

Final Result

if (flag == 'y' or flag == 'Y'):
   left_df = Intake.getData( input("Please enter the location of your Left file: " ) )
 else:
   tract = input("Please enter tract id (*): " ) or "*"
   county = input("Please enter county id (510): " ) or "510"
   state = input("Please enter state id (24): " ) or "24"
   tableId = input("Please enter acs table id (B19001): " ) or "B19001"
   year = input("Please enter acs year (18): " ) or "18"
   saveAcs = input("Save ACS? (Y/N): " ) or "N"
   left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
 
 print('right_df Example: CSA-to-Tract-2010.csv')
 
 right_df = Intake.getData( input("Please enter the location of your right_df file: " ) or 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' )
 print( 'Left Columns ' + str(left_df.columns))
 print( '\n ')
 print( 'right_df Columns ' + str(right_df.columns) + '\n')
 
 left_on = input("Left on: " ) or 'tract'
 right_on = input("Right on: " ) or 'TRACTCE10'
 how = input("How: (β€˜left’, β€˜right’, β€˜outer’, β€˜inner’) " ) or 'outer'
 
 merged_df = pd.merge(left_df, right_df, left_on=left_on, right_on=right_on, how=how)
 merged_df = merged_df.drop(left_on, axis=1)
 
 # Save the data
 # Save the data
 saveFile = input("Save File ('Y' or 'N'): ") or 'N'
 if saveFile == 'Y' or saveFile == 'y':
   outFile = input("Saved Filename (Do not include the file extension ): ")
   merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);

Advanced

For this next example to work, we will need to import hypothetical csv files

Intro

The following Python function is a bulked out version of the previous notes.

Description: add columns of data from a foreign dataset into a primary dataset along set parameters.

Purpose: Makes Merging datasets simple

Services

#@ title Run: Create mergeDatasets()
 
 # Worried about infinit interactive-loops. not an issue atm.
 # Crosswalk needs to have exact same column names as left/right datasets
 def mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False,
                   left_col=False, right_col=False,
                   crosswalk_left_col = False, crosswalk_right_col = False,
                   merge_how=False, # left right or columnname to retrieve
                   interactive=True):
   # Interactive will ask if use_crosswalk unless crosswalk_ds == 'no'
 
   # 1. Used on Right Dataset in case merge_how is a column to pull. Returns False or Col
   def checkMergeHow(ds, how, interactive):
     inList = how in ['left', 'right', 'outer', 'inner']
     inDf = Intake.checkColumn(ds, how)
     if ( inList or inDf ): return how
     elif ( not interactive ): return False
     else:
       try:
         print('\n Invalid merge column given. \n Please select a value from either list');
         print("\n 1) Pull A single column from the right dataset: ", ds.columns)
         print("OR \n 2) Specify a type of join operation: (β€˜left’, β€˜right’, β€˜outer’, β€˜inner’, columnName) " )
         return checkMergeHow(ds, input("Column Name: " ), interactive);
       except: return False # User probably trying to escape interactivity
 
   # 2i. Load data via url. Coerce Dtype needed for merge.
   def coerceForMerge( msg, first_ds, second_ds, first_col, second_col, interactive ):
       if (interactive):
         print(f'\n---Casting Datatypes from-to: {msg} Datasets---');
         print('Before Casting: ');
         print('-> Column One: ', first_col, first_ds[first_col].dtype)
         print('-> Column Two: ', second_col, second_ds[second_col].dtype)
       second_ds, second_col = Intake.getAndCheck(second_ds, second_col, interactive)
       first_ds, second_ds, status = Intake.coerce(first_ds, second_ds, first_col, second_col, interactive);
       if (not status and interactive): print('\n There was a problem!');
       if (interactive):
         print('\n After Casting: ');
         print('-> Column One: ', first_col, first_ds[first_col].dtype)
         print('-> Column Two: ', second_col, second_ds[second_col].dtypes)
       return first_ds, second_ds, second_col, status
   # 2ii.
   def mergeAndFilter(msg, first_ds, second_ds, first_col, second_col, how, interactive):
       if interactive:
         print(f'---PERFORMING MERGE : {msg}---');
         print('Column One : ', first_col, first_ds[first_col].dtype)
         print('How: ', how)
         print('Column Two : ', second_col, second_ds[second_col].dtype)
       first_ds = mergeOrPull(first_ds, second_ds, first_col, second_col, how)
       return filterEmpties(first_ds, second_ds, first_col, second_col, how, interactive)
 
   # Decide to perform a merge or commit a pull
   def mergeOrPull(df, cw, left_on, right_on, how):
 
     def merge(df, cw, left_on, right_on, how):
       df = pd.merge(df, cw, left_on=left_on, right_on=right_on, how=how)
       # df.drop(left_on, axis=1)
       df[right_on] = df[right_on].fillna(value='empty')
       return df
 
     def pull(df, cw, left_on, right_on, how):
       crswlk = dict(zip(cw[right_on], cw[how]  ) )
       dtype = df[left_on].dtype
       if dtype =='object':  df[how] = df.apply(lambda row: crswlk.get(str(row[left_on]), "empty"), axis=1)
       elif dtype == 'int64':
         df[how] = df.apply(lambda row: crswlk.get(int(row[left_on]), "empty"), axis=1)
       return df
 
     mergeType = how in ['left', 'right', 'outer', 'inner']
     if mergeType: return merge(df, cw, left_on, right_on, how)
     else: return pull(df, cw, left_on, right_on, how)
 
   # 2iiii. Filter between matched records and not.
   def filterEmpties(df, cw, left_on, right_on, how, interactive):
     if how in ['left', 'right', 'outer', 'inner']: how = right_on
     nomatch = df.loc[df[how] == 'empty']
     nomatch = nomatch.sort_values(by=left_on, ascending=True)
 
     if nomatch.shape[0] > 0:
       # Do the same thing with our foreign tracts
       if(interactive):
         print('\n Local Column Values Not Matched ')
         print(nomatch[left_on].unique() )
         print(len(nomatch[left_on]))
         print('\n Crosswalk Unique Column Values')
         print(cw[right_on].unique() )
 
     # Create a new column with the tracts value mapped to its corresponding value from the crossswalk
     df[how].replace('empty', np.nan, inplace=True)
     df.dropna(subset=[how], inplace=True)
     # cw = cw.sort_values(by=how, ascending=True)
     return df
 
   # 0. Retrieve the left and right dataset.
   if (interactive): print('---Handling Left Dataset Options---');
   left_ds, left_col = Intake.getAndCheck(left_ds, left_col, interactive)
   if (interactive): print('Left column:', left_col)
 
   if (interactive): print('\n---Handling Right Dataset Options---');
   right_ds, right_col  = Intake.getAndCheck(right_ds, right_col, interactive)
   if (interactive): print('Right column:', left_col)
 
   if (interactive): print(f"\n---Ensuring Compatability Between merge_how (val: '{merge_how}') and the Right Dataset---");
   merge_how = checkMergeHow(right_ds, merge_how, interactive)
   if (interactive): print("Column or ['inner','left','right','outer'] value: ", merge_how)
 
   # 1. Retrieve the crosswalk dataset: check left-cw, right-cw. try coercing.
   if (interactive): print(f'\n---Checking Crosswalk Dataset Options---')
   # if its a df
   if (not Intake.isPandas(crosswalk_ds)):
     default = str(crosswalk_ds).lower() == 'false'
     # If the user used the the default crosswalk value 'False' as them if they want to use one.
     if (default and interactive ): crosswalk_ds = input("\nProvide a Crosswalk? ( URL/ PATH or  'NO'/ / 'FALSE' ) ") or  False
     # Check if user opted to not use a crosswalk
     use_crosswalk = not ((str(crosswalk_ds).lower() in ["no", '', 'none', 'false']))
     if (use_crosswalk):
       crosswalk_ds, crosswalk_left_col = Intake.getAndCheck(crosswalk_ds, crosswalk_left_col, interactive)
       crosswalk_ds, crosswalk_right_col = Intake.getAndCheck(crosswalk_ds, crosswalk_right_col, interactive)
 
   # 3. Coerce all datasets for Merge.
   if ( Intake.isPandas(crosswalk_ds) ):
     print('crosswalk_left_col',crosswalk_left_col)
     left_ds, crosswalk_ds, crosswalk_left_col, status = coerceForMerge( 'Left->Crosswalk', left_ds, crosswalk_ds, left_col, crosswalk_left_col, interactive )
     right_ds, crosswalk_ds, crosswalk_right_col, status = coerceForMerge( 'Right->Crosswalk',right_ds, crosswalk_ds, right_col, crosswalk_right_col, interactive )
   else:
     left_ds, right_ds, right_col, status = coerceForMerge('Left->Right', left_ds, right_ds, left_col, right_col, interactive )
 
   if (interactive): print('\n---All checks complete. Status: ', status, '---\n');
   if ( not status ):
     if (interactive):print('Merge Incomplete. Thank you!');
     return False;
   else:
     if (Intake.isPandas(crosswalk_ds)):
       left_ds = mergeAndFilter('LEFT->CROSSWALK', left_ds, crosswalk_ds, left_col, crosswalk_left_col, crosswalk_right_col, interactive)
       left_col = crosswalk_right_col
     left_ds = mergeAndFilter('LEFT->RIGHT', left_ds, right_ds, left_col, right_col, merge_how, interactive)
   return left_ds

Function Explanation

Input(s):

  • Dataset url
  • Crosswalk Url
  • Right On
  • Left On
  • How
  • New Filename

Output: File

How it works:

  • Read in datasets

  • Perform Merge

  • If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']

    • then a merge will be performed.
  • If a column name is provided in the 'how' parameter

    • then that single column will be pulled from the right dataset as a new column in the left_ds.

Function Diagrams

Diagram the mergeDatasets()

mergeDatasets Flow Chart

Gannt Chart mergeDatasets()

Sequence Diagram mergeDatasets()

Function Examples

# from dataplay.geoms import readInGeometryData 

Interactive Example 1. Merge Esri Data

# Hhchpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
 # Hhchpov = Hhchpov[['CSA2010', 'hhchpov15',	'hhchpov16',	'hhchpov17',	'hhchpov18']] 
 left_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
 left_col = 'CSA2010'
 
 # Table: Household Poverty 
 # Hhpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
 # Hhpov = Hhpov[['CSA2010', 'hhpov15',	'hhpov16',	'hhpov17',	'hhpov18']] 
 right_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
 right_col='CSA2010'
 
 merge_how = 'outer'
 interactive = False
 
 merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds='no',
                   left_col=left_col, right_col=right_col,
                   crosswalk_left_col = False, crosswalk_right_col = False,
                   merge_how=merge_how, # left right or columnname to retrieve
                   interactive=interactive)
 merged_df.head()

Example 2 ) Get CSA and Geometry with a Crosswalk using 3 links

# Change these values in the cell below using different geographic reference codes will change those parameters
 tract = '*'
 county = '510' # '059' # 153 '510'
 state = '24' #51
  
 # Specify the download parameters the function will receieve here
 tableId = 'B19049' # 'B19001'
 year = '17'
 saveAcs = False 
import IPython 
 # from IPython.core.display import HTML
 IPython.core.display.HTML("")
 # state, county, tract, tableId, year, saveOriginal, save 
 left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
 left_df.head(1) 
# Columns: Address(es), Census Tract
 left_ds = left_df
 left_col = 'tract'
 
 # Table: Crosswalk Census Communities
 # 'TRACT2010', 'GEOID2010', 'CSA2010'
 crosswalk_ds = 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv'
 crosswalk_left_col = 'TRACTCE10'
 crosswalk_right_col = 'CSA2010'
 
 # Table: 
 right_ds = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
 right_col = 'CSA2010'
 
 interactive = False
 merge_how = 'outer'
 
 merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=crosswalk_ds,
                   left_col=left_col, right_col=right_col,
                   crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
                   merge_how=merge_how, # left right or columnname to retrieve
                   interactive=interactive)
 
 merged_df_geom.head()

Here we can save the data so that it may be used in later tutorials.

# merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)

Example 3: Ran Alone