A free book covering Python data science with notebooks may be found here. It uses Jupyter Notebook, of which Google Colab is built off of.
Information for this section was pulled from a variety of resources. Click on the links to learn more!
The Colab Environment
Before we get into gritty details, take a moment to explore the Colab environment.
Setup & Configuration:
Begin by visiting https://colab.research.google.com
Click 'NEW PYTHON 3 NOTEBOOK'
For the most part, that is all it takes!
Many modules are already pre-installed on the virtual enviornment.
The following articles can help get you started. Excerpts have been selected and shown in block quotes.
Welcome to Colaboratory
Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
Zero configuration required
Free access to GPUs
Easy sharing
The document you are reading is not a static web page, but an interactive environment called a Colab notebook that lets you write and execute code.
To execute the code... use the keyboard shortcut "Command/Ctrl+Enter".
# The hash symbol at the front of this line means its a comment. # Comments show up in green and will not be interpreted upon code execution. # In this example, we will perform a simple computation to see its output below 1 + 1
# Notice how the output is now stored in the 'madeUpVariable' and 'evenMoreMadeUpVariable' variables and will not show give an output below. madeUpVariable = 1 + 1 evenMoreMadeUpVariable = 13.5
# Variable values persist across blocks (both above and below), so always make sure your variables use the correct values! # Take note of the following. Output is hidden unless it is either placed on the last line or wrapped in a 'Print' function, like so. print(evenMoreMadeUpVariable) # Though none of this will show, the program will still run. evenMoreMadeUpVariable * evenMoreMadeUpVariable # This will show up since it is on the last line. madeUpVariable
Colab notebooks allow you to combine markup, executable code, and text into a single document, along with images, HTML, LaTeX and more. When you create your own Colab notebooks, they are stored in your Google Drive account. You can easily share your Colab notebooks with co-workers or friends, allowing them to comment on your notebooks or even edit them. To learn more, see Overview of Colab. To create a new Colab notebook you can use the File menu above, or use the following link: create a new Colab notebook.
Colab notebooks are Jupyter notebooks that are hosted by Colab. To learn more about the Jupyter project, see jupyter.org.
All blockquotes in the section above was pulled from the header's link.
Colab Menu Bar
Everything you need can be found in your menu bar.
Follow the brief outline below:

File (accessible on the left hand drawer)
Locate in drive
New, Open, Upload, Save, Download
Save to Github or Drive
Edit
Undo
Select all, Cut, Copy, Paste, Delete
Find, Replace
Show/Hide all code
Clear all code outputs
View
Table of Contents (accessible on the left hand drawer)
Executed Code History
Diff Notebooks
Collapse Sections
Insert
Code/Text Cell
Section Header
Code Snippet (accessible on the left hand drawer)
Runtime
Run - This action can be used to execute all cells, or at least anything before, after, or in a selected cell.
Interrupt Execution - Just in case the code is caught in an eternal loop or is hanging.
Restart (and optionally re-run all) - Installed modules are kept but must be re-imported.
Factory reset runtime - Must re-install all modules
Tools
Command Palette - Clickable menu of shortcuts
Settings
Site - Set theming
Editor - Set indentation, fontsize, line width
Misc - Enable 'Corgie' and or 'Kittie' Mode.
Keyboard Shortcuts
Help
Overview of Colaboratory Features
Features in the header link's article are accessible from the Menu Bar.
Colaboratory "magics" are shorthand annotations that change how a cell's text is executed.
Much more on this is covered below. For now, observe what you can do with it:
With magics, you can execute terminal commands straight from a code block!
Preface your terminal command with a !
or $
so the interpreter knows the text is not Python.
!ls[0m[01;34msample_data[0m/
warning: Use
cd
or$cd
to change directories;!cd
will not work as expected.
Which means a change directory command won't persist.
!cd sample_data
ls
[0m[01;34msample_data[0m/Unless you use%
% cd sample_data//content/sample_data
ls
[0m[01;32manscombe.json[0m* mnist_test.csv california_housing_test.csv mnist_train_small.csv california_housing_train.csv [01;32mREADME.md[0m*Python variables can take output from a terminal command.
pythonVariable = !cat README.md
pythonVariable[0]
Terminal commands can take variable using {python variable}
.
!echo {pythonVariable[0]}
cd ..//content
ls
[0m[01;34msample_data[0m/The output response from the execution of a terminal command can even be stored as Python variables!
cow = !ls cow
When you change directories $cd ./filepath/
More Tricks
Other advanced code tricks include the following:
Hosting notebooks online using GitHub and myBinder.
Notebooks can also be colaboratively edited by sharing a link on Google Drive.
Colabs can connect to and run on your local machine.
Markup
In computer text processing, a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text,[1] meaning when the document is processed for display, the markup language is not shown, and is only used to format the text. -wikipedia
Markdown Guide
A) Markdown is the name given to markup used for making text rich-text.
B) Text cells (not code-cells) in Google Colab will automatically understand Markdown and display it appropriately.
C) Within Colabs, many HTML elements can readily be rendered within Markdown cells like the enriched text in this sentence.
This is not a given on other markdown viewers and can be prevented by encapsulating the html
<u>with backticks</u>
.
Badges
Badges are (typically) action-enabled icons used to call attention to the reader. These are often displayed using HTML or Markdown.
Pick a template and create your own badge from shields.io to get started!
More on Markdown:
Flags: Magics and Comments
A. 'Flags' are a special form of shorthand annotation that change how code-block's are executed.
B. These annotations augment the interpreters handling of a cell or line.
C. Flags are placed on the first line or on a per line basis depending on intent
D. There exists two types of Flags: Comment and Magics
Magics is often identified by two
%
's at the top of the document followed by the intendid magical affect.Comments use a single
#
and are less favored since the#
symbol is already overloaded.
Under normal circumstances, a
#
will preface a numeral, whats more,Markdown uses
#
's to denote a header element.
Common Uses:
A) Create section titles from within a codeblock using #@title <TITLENAME>
B) Suppress cell output using %%capture
.
C) Execute terminal commands in a cell by prefacing it with the !
line-magics.
D) Comment-ify a line in your code using the #
' prefix.
E) Render the cell as %%html
or %%javascript
or a single line with #@markdown
.
F) Creating input forms by placing the line-magics #@param {type:"DATA-TYPE"}
at the end of a variable declaration.
The Python Enviornment
Click on the following link for a quick overview of notable features.
These links are the official documentation and tutorial.
This website w3schools provides great introductory tutorials with examples.
This Python Wiki Beginners Guide provides a ton of helpful guides for programmers and nonprogrammers.
Now that we understand a bit more about Colab, we can address the following questions.
What is Python?
From the Docs
(emphasis my own)
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.
What makes Python high-level?
Because it is not assembly or as a series of ones and zeroes, memory management is made automatic.
What makes Python Object-Oriented
Basically, everything in Python is an object?! We will get back to this later. But for now, here's a peek.
undefinedMore information on JSON:
What makes Python interpreted?
Machines run on machine code and Python needs some way to be translated to machine code.
When you execute a line of python code, the process of interpreting the python code and translating (compiling) it to machine code happens in real-time.
While all languages need to be interpreted, the real-time compilation during code execution is why Python is called an interpreted as apposed to compiled language.
'Installing python' is really just the process of installing an interpreter.
Colab comes with a built-in interpreter that runs every time a cell runs.
Use this guide to learn more about local installation.
Python files can be imported for use in other scripts or interpretated directly using a Python terminal command.
python ./path/to/file/nameOfFile.py
What is the difference between Python 2 and Python 3?
The difference should not matter!
It used to, but Python 2 is now depricated. Everyone should be using Python 3.
If your computer comes with Python built-in, chances are it came with Python 2. Finagling with two versions of Python can be a pain since they use different notations.
With Colabs, this is simply not a problem because of they are brand new virtualized enviornments every time.
What are modules?
A module is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.... You can use any Python source file as a module by executing an import statement in some other Python source file. - TutorialsPoint
'Package' is a term often used to describe a suite of modules.
What are PIP and PyPI?
PIP is a de facto standard package-management system used to install and manage software packages written in Python. Many packages can be found in the default source for packages and their dependencies â Python Package Index (PyPI). Most distributions of Python come with PIP preinstalled. Python 2.7.9 and later (on the Python 2 series), and Python 3.4 and later include PIP (PIP3 for Python 3) by default. - Wikipedia
If you find Python code you like on GitHub, see if it can be found on PyPI.
If so, type pip install package
into the terminal to install the module.
Once installed, you can now 'import' the package in your Python code.
For more information on PIP, check out this cool guide
To import a library that is not in Colaboratory by default, you can use
!pip install
or!apt-get install
. - Snippets: Importing Libraries
Pandas
Colab comes with PIP pre-installed but can be installed using 'pip install pandas'
!pip install pandas
Congratulations! You've installed Pandas.
# Now that pandas has been installed on the virtual enviornment, import it as a module into your codes memory! # This looks a bit redundent but in this instance, we are assigning the pandas module to the variable 'pd'. import pandas as pd
Pandas provides tools for data analysis. As an example, let's import some JSON data!
# To use the pandas module, we refer to it by its namespace. # In this example, we use the pandas 'read_json' function to prepare our json for data play. pd.read_json('{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}', orient='index')
You can do awesome things with data when it is being interpreted as a 'dataframe'. Take a look!
newlyCreatedDataframeVariable = pd.read_json('{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}', orient='index')
# Show the first row newlyCreatedDataframeVariable.head()
# Make a copy of the dataset variable2 = newlyCreatedDataframeVariable.copy()
# Show the first row variable2.tail(2)
variable2['col 1']
# This would save the file as a CSV onto wherever the virtual enviornment is mounted. # This may be the temporary mount-point, or google drive/ local hard drive. #variable2.to_csv(index=False)
Pandas works with bunch of great utilities like Dexplot and Geopandas for enhanced visualizations.
A more thorough introduction to pandas on colabs can be found here.
Learning Objectives:
Gain an introduction to the DataFrame and Series data structures of the pandas library
Access and manipulate data within a DataFrame and Series
Import CSV data into a pandas DataFrame
Reindex a DataFrame to shuffle data
Be sure to take a look at its online library, provided to help you along the way!
External Data
The most simple way to access your data is by mounting Google Drive to your virtual enviornment.
# Run this. # Click the link that shows itself. # Give permission. # Copy the link and paste it back here. from google.colab import drive drive.mount('/content/drive')
cd ./drive/'My Drive'/colabs/DATA
ls
You can store a user's input as a value, like so:
left_on = input("Left on: " )
A neat trick to get form values can be done like this:
#@title Example form fields #@markdown Forms support many types of fields. filename = 'concrete.csv' #@param displayColumn = 'Cement' #@param {type: "string"} multiplyer2 = 100 #@param {type: "slider", min: 100, max: 200} multiplyer1 = 102 #@param {type: "number"} variable5 = '2010-11-05' #@param {type: "date"} variable6 = "monday" #@param ['monday', 'tuesday', 'wednesday', 'thursday'] displayColumn2 = "Strength" #@param ["Strength", "bananas", "oranges"] {allow-input: true} #@markdown ---
Just be sure to re-run the cell block to update the variable values.
concreteDataframe = pd.read_csv(filename) concreteDataframe.head()
concreteDataframe['NewAttribute'] = (concreteDataframe[displayColumn].head() * multiplyer2) - (concreteDataframe[displayColumn].head() * multiplyer1) concreteDataframe.head()
Putting it Together
dataguide is a package I am working on to help work with data. It provides tools and tutorials for data manipulation.
With this package, you can install ACS data with relative ease.
! pip install dataguide geopandas
from dataguide.acsDownload import retrieve_acs_data # Define our download parameters. # More information on these parameters can be found in the tutorials! tract = '*' county = '510' state = '24' tableId = 'B19001' year = '17' saveAcs = False retrieve_acs_data(state, county, tract, tableId, year, saveAcs).head(2)
# Get the Second dataset. # Our example dataset contains Polygon Geometry information. # We want to merge this over to our principle dataset. we will grab it by matching on either CSA or Tract. # The url listed below is public. print('Crosswalk Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv') print('Boundaries Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv') inFile = input("\n Please enter the location of your file : \n" ) crosswalk = pd.read_csv( inFile ) crosswalk.head()
import dataguide.mergeData
# Table: FDIC Baltimore Banks # Columns: Bank Name, Address(es), Census Tract left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=601362340&single=true&output=csv' left_col = 'Census Tract' # Table: Crosswalk Census Communities # 'TRACT2010', 'GEOID2010', 'CSA2010' right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv' right_col='TRACT2010' merge_how = 'outer' interactive = True use_crosswalk = True merged_df = mergeDatasets( left_ds=left_ds, left_col=left_col, right_ds=right_ds, right_col=right_col, merge_how='left', interactive =True, use_crosswalk=use_crosswalk )
dir(mergeDatasets)
mergeDatasets
dir(retrieve_acs_data)