Version 1.1b
This problem is a data cleaning and analysis task that exercises basic pandas, Numpy, and the graph ranking and analysis content of Notebook 11. It consists of five (5) exercises, numbered 0 through 4, worth a total of 11 points.
All exercises are independent, so if you get stuck on one, try moving on to the next one. However, in such cases do look for notes labeled, "In case Exercise XXX isn't working", as you may need to run some code cells that load pre-computed results that will allow you to continue with any subsequent exercises.
Pro-tips.
Actions
$\rightarrow$ Reset Assignment
to get a fresh, original copy of this notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those somewhere safe if you intend to keep or reuse them!)print
statement) that causes the notebook to load slowly or not at all, use Actions
$\rightarrow$ Clear Notebook Output
to get a clean copy. The clean copy will retain your code but remove any generated output. However, it will also rename the notebook to clean.xxx.ipynb
. Since the autograder expects a notebook file with the original name, you'll need to rename the clean notebook accordingly.Good luck!
One major factor in the spread of infectious diseases like COVID-19 is the connectivity of our transportation networks. Therefore, let's ask the following question in this problem: to what extent does the connectivity of the airport network help explain in which regions we have seen the most confirmed cases of COVID-19?
We'll focus on the United States network (recall Notebook 11) and analyze data at the level of US states (e.g., Washington state, California, New York state). Our analysis will have three main steps.
Before starting, run the code cell below to load some useful functions and packages.
import sys
print(sys.version)
# Needed for loading data:
import pandas as pd
print(f"Pandas version: {pd.__version__}")
# Some problem-specific helper functions:
import problem_utils
from problem_utils import get_path, assert_tibbles_are_equivalent
# For visualization:
from matplotlib.pyplot import figure, plot, semilogy, grid, legend
%matplotlib inline
Researchers at Johns Hopkins University have been tallying the number of confirmed cases of COVID-19 over time. Let's start by assembling the raw data for analysis.
Provenance of these data. JHU made these data are available in this repo on GitHub, but for this problem, we'll use a pre-downloaded copy.
Location of the data. The data are stored in files, one for each day since January 22, 2020. We can use pandas's read_csv()
to load them into a DataFrame
object. For example, here is some code to do that for January 22, March 11, and March 22. Take a moment to read this code and observe the output:
print("Location of data files:", get_path('covid19/'))
print("Location of Jan 22 data:", get_path('covid19/01-22-2020.csv'))
print("Loading...")
df0 = pd.read_csv(get_path('covid19/01-22-2020.csv'))
print("Done loading. The first 5 rows:")
df0.head(5)
df1 = pd.read_csv(get_path('covid19/03-11-2020.csv'))
df1.head(5)
df2 = pd.read_csv(get_path('covid19/03-22-2020.csv'))
df2.head(5)
Columns. Observe that the column conventions are changing over time, which will make working with this data quite messy if we don't deal with it.
In this problem, we will only be interested in the following four columns:
"Province/State"
or "Province_State"
. If your code encounters the latter ("Province_State"
), rename it to the former ("Province/State"
)."Country/Region"
or "Country_Region"
. Again, rename instances of the latter to the former."Last Update"
or "Last_Update"
. Again, rename the latter to the former."Confirmed"
. This column is named consistently for all the example data.Missing values. Observe that there may be missing values, which read_csv()
converts by default to "not-a-number" (NaN
) values. Recall that these are special floating-point values. As a by-product of using NaN
values in columns that otherwise contain integers, those integers are also converted to floating-point.
Timestamps. Observe that each dataframe has a column named "Last Update"
, which contain date and time values stored as strings. Moreover, they appear to use different formats. Later, we'll want to standardize these, and for that purpose, we'll use pandas's to_datetime()
to convert these into Python datetime
objects. That makes them easier to compare (in code) and do simple arithmetic on them (e.g., calculate the number of days in-between). The following code cells demonstrate these features.
print(type(df1['Last Update'].loc[0])) # Confirm that these values are strings
# Example: Convert a column to use `datetime` values:
df0['Timestamp'] = pd.to_datetime(df0['Last Update'])
df1['Timestamp'] = pd.to_datetime(df1['Last Update'])
df1.head(5)
# Example: Calcuate the difference, in days, between two timestamps
timestamp_0 = df1['Timestamp'].iloc[0]
timestamp_1 = df1['Timestamp'].iloc[1]
delta_t = timestamp_1 - timestamp_0
print(f"* {timestamp_0} ==> type: {type(timestamp_0)}")
print(f"* {timestamp_1} ==> type: {type(timestamp_1)}")
print(f"* Difference: ({timestamp_1}) - ({timestamp_0}) == {delta_t}\n ==> type: {type(delta_t)})")
You won't need to do date-time arithmetic directly, but standardizing in this way will facilitate things like sorting the data by timestamp.
Getting a list of available data files. Lastly, here is a function to get a list of available daily data files by filename. You don't need to read this code, but do observe the results of the demo call to see how it is useful.
def get_covid19_daily_filenames(root=get_path("covid19/")):
"""
Returns a list of file paths corresponding to JHU's
daily tallies of COVID-19 cases.
"""
from os import listdir
from os.path import isfile
from re import match
def covid19_filepath(filebase, root):
return f"{root}{filebase}"
def is_covid19_daily_file(filebase, root):
file_path = covid19_filepath(filebase, root)
return isfile(file_path) and match('^\d\d-\d\d-2020.csv$', filebase)
filenames = []
for b in listdir(root):
if is_covid19_daily_file(b, root):
filenames.append(covid19_filepath(b, root))
return sorted(filenames)
# Demo:
print(repr(get_covid19_daily_filenames()))
Given filenames
, a list of filenames that might be generated by get_covid19_filenames()
above, complete the function load_covid19_daily_data(filenames)
below so that it reads all of this data and combines it into a single tibble (as a pandas DataFrame
) containing only the following columns:
"Province/State"
: Same contents as the original data frames."Country/Region"
: Same contents as the original data frames."Confirmed"
: Same contents as the original data frames."Timestamp"
: The values from the "Last Update"
columns, but converted to datetime
objects per the demonstration discussed previously.In addition, your code should do the following:
"Province/State"
, "Country/Region"
, and "Last Update"
are written differently, so be sure to handle those cases."Confirmed"
column, any missing values should be replaced by (0). Also, this column should be converted to have an integer type. (Hint: Consider Series.fillna()
and Series.astype()
.)"Province/State"
, "Country/Region"
, "Confirmed"
, and "Last Update"
. It should also not depend on any particular ordering of the columns.Hint 0. Per the preceding examples, use
pd.read_csv()
to read the contents of each file into a data frame. However, thefilenames
list will already include a valid path, so you do not need to useget_path()
.Hint 1. Recall that you can use
pd.concat()
to concatenate data frames; one tweak in here is to use itsignore_index=True
parameter to get a clean tibble-like index.Hint 2. To easily drop duplicate rows, look for a relevant pandas built-in function.
def load_covid19_daily_data(filenames):
### BEGIN SOLUTION
from pandas import read_csv, concat, to_datetime
df_list = []
for filename in filenames:
df = read_csv(filename).rename(columns={"Province_State": "Province/State",
"Country_Region": "Country/Region",
"Last_Update": "Last Update"})
df = df[["Province/State", "Country/Region", "Confirmed", "Last Update"]]
df["Last Update"] = to_datetime(df["Last Update"])
df['Confirmed'] = df['Confirmed'].fillna(0).astype(int)
df_list.append(df)
df_combined = concat(df_list)
df_combined.rename(columns={"Last Update": "Timestamp"}, inplace=True)
df_combined.drop_duplicates(inplace=True)
return df_combined.reset_index(drop=True)
### END SOLUTION
# Demo of your function:
df = load_covid19_daily_data(get_covid19_daily_filenames())
print(f"There are {len(df)} rows in your data frame.")
print("The first five are:")
display(df.head(5))
print("A random sample of five additional rows:")
df.sample(5).sort_index()
# Test cell: `ex0__load_covid19_daily_data` (3 points)
### BEGIN HIDDEN TESTS
def ex0_soln(filenames):
from pandas import read_csv, concat, to_datetime
df_combined = None
for filename in filenames:
df = read_csv(filename).rename(columns={"Province_State": "Province/State",
"Country_Region": "Country/Region",
"Last_Update": "Last Update"})
df = df[["Province/State", "Country/Region", "Confirmed", "Last Update"]]
df["Timestamp"] = to_datetime(df["Last Update"])
del df["Last Update"]
df['Confirmed'] = df['Confirmed'].fillna(0).astype(int)
if df_combined is None:
df_combined = df
else:
df_combined = concat([df_combined, df], ignore_index=True)
df_combined.drop_duplicates(inplace=True)
return df_combined.reset_index(drop=True)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
def ex0_gen_soln(soln_file=get_path('covid19/ex0_soln.csv'), force=False):
from os.path import isfile
if isfile(soln_file) and not force:
print(f"Solution file, '{soln_file}', already exists. NOT regenerating...")
else:
print(f"Generating solution file, '{soln_file}'...")
df = ex0_soln(get_covid19_daily_filenames())
df.to_csv(soln_file, index=False)
print("==> Done!")
def ex0_gen_locales(force=False):
from os.path import isfile
from json import dump
from collections import defaultdict
from pandas import isna
outfilename = get_path("locales.json")
if isfile(outfilename) and not force:
print(f"Locales file, '{outfilename}', exists; skipping generation...")
return
print(f"Generating locales file, '{outfilename}'...")
df = load_covid19_daily_data(get_covid19_daily_filenames())
locales = defaultdict(set)
for k, row in df.iterrows():
country = row["Country/Region"]
province = row["Province/State"]
if not (isna(country) or isna(province)):
locales[country] |= {province}
for country in locales:
locales[country] = list(locales[country])
with open(get_path("locales.json"), "wt") as fp:
dump(locales, fp, indent=2, sort_keys=True)
print("==> Done generating locales file.")
ex0_gen_locales()
ex0_gen_soln(force=False)
### END HIDDEN TESTS
def ex0_random_value():
from random import random, randint, choice
from numpy import nan
from problem_utils import ex0_random_date, ex0_random_string
options = [randint(-100, 100) # int
, ex0_random_string(randint(1, 10)) # string
, ex0_random_date(), # date
'', # implicit NaN
nan # explicit NaN
]
return choice(options)
def ex0_get_locales(filename=get_path("locales.json")):
from json import load
with open(filename, "rt") as fp:
locales = load(fp)
return locales
def ex0_gen_row(locales, num_dummies=0):
from datetime import datetime
from random import choice, random, randint
from numpy import nan
from problem_utils import ex0_random_date
country = choice(list(locales.keys()))
province = nan if random() <= 0.1 else choice(locales[country])
confirmed = 0 if random() <= 0.1 else randint(1, 100000)
last_updated = ex0_random_date()
if num_dummies:
dummy_vals = tuple([ex0_random_value() for _ in range(num_dummies)])
else:
dummy_vals = ()
return (country, province, confirmed, last_updated, *dummy_vals)
def ex0_gen_df():
from random import randint, random
from pandas import DataFrame
from problem_utils import ex0_random_string
locales = ex0_get_locales()
# Generate random columns, which the student should ignore
num_dummy_cols = randint(1, 4)
dummy_cols = []
while len(dummy_cols) != num_dummy_cols:
dummy_cols = list({ex0_random_string(5) for _ in range(num_dummy_cols)})
# Generate a bunch of random rows
num_trials = randint(10, 50)
rows = [ex0_gen_row(locales, num_dummy_cols) for _ in range(num_trials)]
# Remove any initial duplicates
rows = sorted(rows, key=lambda x: repr(x))
rows_soln = [rows[0]]
for r in rows[1:]:
if repr(r) != repr(rows_soln[-1]):
rows_soln.append(r)
# Construct the solution tibble
cols_in = ["Country/Region" if random() < 0.75 else "Country_Region",
"Province/State" if random() < 0.75 else "Province_State",
"Confirmed",
"Last Update" if random() < 0.75 else "Last_Update"]
cols_out = ["Country/Region", "Province/State", "Confirmed", "Last Update"]
df_soln = DataFrame(rows_soln, columns=cols_out + dummy_cols)[cols_out] \
.rename(columns={"Last Update": "Timestamp"})
# Generate a corresponding input tibble
rows_in = []
for r in rows_soln:
s = list(r)
if s[2] == 0:
s[2] = '' # NaN counts
r_in = tuple(s)
rows_in.append(r_in)
if random() <= 0.15: # Random duplicates
for _ in range(randint(1, 4)):
rows_in.append(r_in)
df_in = DataFrame(rows_in, columns=cols_in + dummy_cols)
return df_in, df_soln
def ex0_split_df(df, max_splits=5):
from random import randint
from numpy import arange, sort, append
from numpy.random import shuffle, choice
# Shuffle the rows
df = df.sample(frac=1).reset_index(drop=True)
# Split the rows
df_split = []
num_splits = min(randint(0, max_splits), len(df))
if num_splits > 0:
split_inds = sort(choice(arange(len(df)), size=num_splits, replace=False))
if split_inds[0] > 0:
split_inds = append(0, split_inds)
if split_inds[-1] < len(df):
split_inds = append(split_inds, len(df))
for i, j in zip(split_inds[:-1], split_inds[1:]):
df_ij = df.iloc[i:j].reset_index(drop=True)
df_split.append(df_ij)
return df_split if num_splits else [df]
def ex0_certify_metadata(df):
df_cols = set(df.columns)
true_cols = {"Province/State", "Country/Region", "Confirmed", "Timestamp"}
too_many_cols = df_cols - true_cols
assert not too_many_cols, f"*** You have too many columns, including {too_many_cols}. ***"
missing_cols = true_cols - df_cols
assert not missing_cols, f"*** You are missing some columns, namely, {missing_cols}. ***"
from pandas.api.types import is_integer_dtype
assert is_integer_dtype(df["Confirmed"]), \
'*** `"Confirmed"` column has a non-integer type ({type(df["Confirmed"])}). ***'
from numpy import datetime64
from pandas import Index
assert df.select_dtypes(include=datetime64).columns == Index(["Timestamp"]), \
'*** Your data frame must have a "Timestamp" column containing `datetime` values.'
def ex0_check():
from problem_utils import canonicalize_tibble, tibbles_left_matches_right
from os import remove
from os.path import isfile
print("Generating synthetic input files...")
df_in, df_soln = ex0_gen_df()
df_split = ex0_split_df(df_in)
filenames = []
for k, df_k in enumerate(df_split):
filenames.append(f'./ex0_df{k}.csv')
print(f"- {filenames[-1]}")
df_k.to_csv(filenames[-1], index=False)
try:
print("Testing your solution...")
df = load_covid19_daily_data(filenames)
ex0_certify_metadata(df)
assert tibbles_left_matches_right(df, df_soln, verbose=True), \
"*** Your computed solution does not match ours. ***"
except:
print("\n=== Expected solution ===")
display(canonicalize_tibble(df_soln, remove_index=True))
print("\n=== Your computed solution ===")
display(canonicalize_tibble(df, remove_index=True))
print(f"\nNOTE: To see the original input files, inspect {filenames}.")
raise
else:
print("Cleaning up input files...")
for f in filenames:
if isfile(f):
print(f"- {f}")
remove(f)
for trial in range(5):
print(f"========== Trial #{trial} ==========")
ex0_check()
print("\n(Passed.)")
Whether you solved Exercise 0 or not, we have prepared a file of pre-cleaned and combined COVID-19 data. Below, the variable df_covid19
holds these data. You will need it in the subsequent exercises, so be sure to run this cell and do not modify the variable!
df_covid19 = pd.read_csv(get_path('covid19/ex0_soln.csv'), parse_dates=["Timestamp"])
df_covid19 = df_covid19.groupby(["Province/State", "Country/Region", "Timestamp"], as_index=False).sum()
# ^^^ Above `.groupby()` needed because of a change in US reporting on March 22, 2020
df_covid19.sample(5)
The dataset includes confirmed cases in the US. For instance, run this cell to see a sample of the US rows.
is_us = (df_covid19["Country/Region"] == "US")
df_covid19[is_us].sample(5)
You should see some cases where the "Province/State"
field is exactly the name of a US state, like "Georgia"
or "California"
, and other cases where you might see a more or less detailed location (e.g., a city name and a statee, like "Middlesex County, MA"
).
For subsequent analysis, we will only be interested in the rows containing state names. For instance, here are all the rows associated with "Georgia"
.
is_georgia = (df_covid19["Province/State"] == "Georgia")
df_covid19[is_us & is_georgia]
Given these data, we can order by timestamp and plot confirmed cases over time.
df_covid19[is_us & is_georgia] \
.sort_values(by="Timestamp") \
.plot(x="Timestamp", y="Confirmed", figsize=(16, 9), style='*--')
grid()
Complete the function, get_us_states(df)
, below, where
df
, is a data frame structured like the combined COVID-19 data frame (df_covid19
), having the columns "Province/State"
, "Country/Region"
, "Confirmed"
, "Timestamp"
;df
that are from the United States where the "Province/State"
field is exactly the name of any one of the US states.Regarding the second requirement, the returned object should include a row where the "Province/State"
field is "Georgia"
, but it should not include a row where this field is, say, "Atlanta, GA"
. (Put differently, we will assume the state-level accounts already include city-level counts.)
The tibble returned by your function should only have these three columns:
"Confirmed"
: The number of confirmed cases, taken from the input df
."Timestamp"
: The timestamp taken from the input df
."ST"
: The two-letter abbreviation for the state's name.Pay attention to item (3): your returned tibble should not have the state's full name, but rather, its two-letter postal code abbreviation (e.g., "GA"
instead of "Georgia"
). To help you out, here is a code cell that defines a data frame called STATE_NAMES
that holds both a list of state names and their two-letter abbreviations.
Note: The test cell for this exercise reuses functions defined in the test cell for Exercise 0. So even if you skipped Exercise 0, please run its test cell before running the one below.
STATE_NAMES = pd.read_csv(get_path('us_states.csv'))
print(f"There are {len(STATE_NAMES)} US states. The first and last three, along with their two-letter postal code abbreviations, are as follows (in alphabetical order):")
display(STATE_NAMES.head(3))
print("...")
display(STATE_NAMES.tail(3))
def get_us_states(df):
### BEGIN SOLUTION
return get_us_states__3(df)
# Solution 0: `.merge()`
def get_us_states__0(df):
df_state_names = STATE_NAMES.rename(columns={"Name": "Province/State"})
df_states = df_state_names.merge(df)
del df_states["Province/State"]
del df_states["Country/Region"]
return df_states.rename(columns={"Abbrv": "ST"})
# Solution 1: `.isin() / .str.replace()`
def get_us_states__1(df):
is_us = (df["Country/Region"] == "US")
is_state = df[is_us]["Province/State"].isin(STATE_NAMES['Name'])
df_state = df[is_us & is_state]
df_st = df_state.rename(columns={"Province/State": "ST"})
for _, row in STATE_NAMES.iterrows():
pattern = f'^{row["Name"]}$'
replacement = row["Abbrv"]
df_st["ST"] = df_st["ST"].str.replace(pattern, replacement)
del df_st["Country/Region"]
return df_st
# Solution 2: `.isin()` / `.map()`
def get_us_states__2(df):
is_us = df['Country/Region'] == 'US'
names = STATE_NAMES["Name"]
is_us_state = is_us & df['Province/State'].isin(names)
abbrvs = STATE_NAMES["Abbrv"]
name2abbrv = {name: st for name, st in zip(names, abbrvs)}
df_us = df[is_us_state].copy()
df_us['ST'] = df_us['Province/State'].map(name2abbrv)
del df_us["Province/State"]
del df_us["Country/Region"]
return df_us
# Solution 3:
def get_us_states__3(df):
US = df[df["Country/Region"] == 'US']
states = US[US["Province/State"].isin(STATE_NAMES.Name)]
states = states.join(STATE_NAMES.set_index('Name'), on='Province/State')
states['ST'] = states.Abbrv
return states[['Timestamp', 'Confirmed', 'ST']]
### END SOLUTION
# Test cell: `ex1__get_us_states` (2 points)
### BEGIN HIDDEN TESTS
def ex1_gen_soln(force=False):
def ex1_soln(df):
df_state_names = STATE_NAMES.rename(columns={"Name": "Province/State"})
df_states = df_state_names.merge(df)
del df_states["Province/State"]
del df_states["Country/Region"]
return df_states.rename(columns={"Abbrv": "ST"})
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
from os.path import isfile
soln_file = get_path('covid19/ex1_soln.csv')
if isfile(soln_file) and not force:
print(f"Solution file, '{soln_file}', already exists. NOT regenerating...")
else:
print(f"Generating solution file, '{soln_file}'...")
df = ex1_soln(df_covid19)
df.to_csv(soln_file, index=False)
print("==> Done!")
ex1_gen_soln(force=False)
### END HIDDEN TESTS
def ex1_gen_row(states):
from datetime import datetime
from random import random, randint, choice
from problem_utils import ex0_random_date, ex0_random_string
def rand_str(): return ex0_random_string(randint(1, 10))
confirmed = randint(1, 10000)
timestamp = ex0_random_date()
# Choose a province
locales = ex0_get_locales()
p = random()
if p < 0.5: # Non US country
country = choice(list(set(locales.keys()) - {"US"}))
province = choice(locales[country])
is_state = False
else:
country = "US"
if p < 0.75:
non_states = set(locales["US"]) - set(states["Name"])
province = choice(list(non_states))
is_state = False
else:
province = choice(states["Name"])
is_state = True
return timestamp, confirmed, country, province, is_state
def ex1_gen_df(max_rows, states):
from random import randint
from pandas import DataFrame, concat
st_lookup = states.set_index("Name")
num_rows = randint(1, max_rows)
df_list = []
cols_df = ["Timestamp", "Confirmed", "Country/Region", "Province/State"]
df_soln_list = []
cols_df_soln = ["Timestamp", "Confirmed", "ST"]
for _ in range(num_rows):
ts, conf, country, province, is_state = ex1_gen_row(states)
df0 = DataFrame([[ts, conf, country, province]], columns=cols_df)
df_list.append(df0)
if is_state:
st = st_lookup.loc[province]["Abbrv"]
df0_soln = DataFrame([[ts, conf, st]], columns=cols_df_soln)
df_soln_list.append(df0_soln)
assert len(df_list) > 0, "*** Problem with the test cell! ***"
df = concat(df_list, ignore_index=True).sample(frac=1).reset_index(drop=True)
if len(df_soln_list) == 0:
df_soln = DataFrame(columns=cols_df_soln)
else:
df_soln = concat(df_soln_list, ignore_index=True).sample(frac=1).reset_index(drop=True)
return df, df_soln
def ex1_check():
df, df_soln = ex1_gen_df(20, STATE_NAMES)
try:
df_your_soln = get_us_states(df)
assert_tibbles_are_equivalent(df_soln, df_your_soln)
except:
print("\n*** ERROR DETECTED ***")
print("Input data frame:")
display(df)
print("Expected solution:")
display(df_soln)
print("Your solution:")
display(df_your_soln)
raise
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex1_check()
print("\n(Passed.)")
Whether your Exercise 1 is working or not, please run the following code cell. It loads a pre-generated data frame containing just the state-level COVID-19 confirmed cases data into a variable named, df_covid19_us
. You will need it in the subsequent exercises, so do not modify it!
df_covid19_us = pd.read_csv(get_path('covid19/ex1_soln.csv'), parse_dates=["Timestamp"])
df_covid19_us.sample(5).sort_values(by=["ST", "Timestamp"])
Let df
be a data frame like df_covid19_us
, which would be produced by a correctly functioning get_us_states()
(Exercise 1). Complete the function rank_states_by_cases(df)
so that it returns a Python list
of states in decreasing order of the maximum number of confirmed cases in that state.
def rank_states_by_cases(df):
### BEGIN SOLUTION
return rank_states_by_cases__0(df)
# Method 0
def rank_states_by_cases__0(df):
return df.groupby("ST").max().sort_values(by="Confirmed", ascending=False).index.tolist()
# Method 1
def rank_states_by_cases__1(df):
max_values = []
for st in STATE_NAMES["Abbrv"]:
df_st = df[df["ST"] == st]
v = df_st["Confirmed"].max()
max_values.append((st, v))
return [st for st, v in sorted(max_values, key=lambda x: x[1], reverse=True)]
### END SOLUTION
your_covid19_rankings = rank_states_by_cases(df_covid19_us)
assert isinstance(your_covid19_rankings, list), "Did you return a Python `list` as instructed?"
print(f"Your computed ranking:\n==> {repr(your_covid19_rankings)}\n")
df_covid19_us.head()
# Test cell: `ex2__rank_states_by_cases` (1 point)
### BEGIN HIDDEN TESTS
def ex2_gen_soln(force=False):
def ex2_soln(df):
return df.groupby("ST").max().sort_values(by="Confirmed", ascending=False).index.tolist()
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
from os.path import isfile
soln_file = get_path('covid19/ex2_soln.txt')
if isfile(soln_file) and not force:
print(f"Solution file, '{soln_file}', already exists. NOT regenerating...")
else:
print(f"Generating solution file, '{soln_file}'...")
results = ex2_soln(df_covid19_us)
with open(soln_file, "wt") as fp:
for st in results:
fp.write(f"{st}\n")
print("==> Done!")
ex2_gen_soln(force=False)
### END HIDDEN TESTS
def ex2_gen_df(st):
from problem_utils import ex0_random_date
from random import randint
from pandas import DataFrame
num_rows = randint(1, 5)
confs = []
tss = []
max_conf = -1
for k in range(num_rows):
confs.append(randint(1, 1000))
if confs[-1] > max_conf: max_conf = confs[-1]
tss.append(ex0_random_date())
df_st = DataFrame({"ST": [st] * num_rows, "Confirmed": confs, "Timestamp": tss})
return df_st, max_conf
def ex2_check():
from random import randint, sample
from pandas import concat
num_states = randint(1, 5)
states = sample(list(STATE_NAMES["Abbrv"]), num_states)
vals = []
df_list = []
for st in states:
df_st, max_conf = ex2_gen_df(st)
df_list.append(df_st)
vals.append((st, max_conf))
df = concat(df_list, ignore_index=True).sort_values(by="Timestamp").reset_index(drop=True)
soln = [s for s, v in sorted(vals, key=lambda x: x[1], reverse=True)]
try:
your_soln = rank_states_by_cases(df)
assert len(soln) == len(your_soln), \
f"*** Your solution has {len(your_soln)} entries instead of {len(soln)} ***"
assert all([a == b for a, b in zip(soln, your_soln)]), \
f"*** Solutions do not match ***"
except:
print("\n*** ERROR CASE ***\n")
print("Input:")
display(df)
print("Expected solution:")
display(soln)
print("Your solution:")
display(your_soln)
raise
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex2_check()
print("\n(Passed.)")
In case you can't get a working solution to Exercise 2, we have prepared a ranked list of states by confirmed cases. The code cell below reads this list and stores it in the variable, covid19_rankings
. You will need it in the subsequent exercises, so do not modify it!
with open(get_path('covid19/ex2_soln.txt'), "rt") as fp:
covid19_rankings = [s.strip() for s in fp.readlines()]
print(repr(covid19_rankings))
Let's plot the TOP_K=15
states by number of confirmed cases. The y-axis uses a logarithmic scale in this plot.
To disable a logarithmic y-axis, add
logy=False
to any call toviz_by_state()
.
def viz_by_state(col, df, states, figsize=(16, 9), logy=False):
from matplotlib.pyplot import figure, plot, semilogy, legend, grid
figure(figsize=figsize)
plotter = plot if not logy else semilogy
for s in states:
df0 = df[df["ST"] == s].sort_values(by="Timestamp")
plotter(df0["Timestamp"], df0[col], "o:")
legend(states)
grid()
TOP_K = 15
# You can modify this cell if you want to play around with the visualization.
viz_by_state("Confirmed", df_covid19_us, covid19_rankings[:TOP_K], logy=True)
Observe that this data is irregularly sampled and noisy. For instance, the updates do not occur every day in every state, and there are spikes due to reporting errors. Therefore, it would be useful to "smooth out" the data before plotting it, to help discern the overall trends better. That is your next task.
We'll do a first cleaning step for you: filling-in (or imputing) missing daily values, so that we have at least one value per day. To see the issue more clearly, consider the data for the state of Georgia:
df_covid19_us[df_covid19_us["ST"] == "GA"].sort_values(by="Timestamp")
There are two observations on March 11 and no observations on March 13. Suppose we want one value per day for every state. Our approach will be to resample the values, using pandas built-in resampler, a standard cleaning method when dealing with irregularly sampled time-series data. There are many subtle options, so we will perform one method of resampling for you. The function below implements it, storing the results in a data frame called df_us_daily
. You do not need to understand this code right, but do run it so you can see what it will do. It will print some example results for the state of Georgia data.
def resample_daily(df):
# This implementation is a bit weird, due to a known issue: https://github.com/pandas-dev/pandas/issues/28313
df_r = df.sort_values(by=["ST", "Timestamp"]) \
.set_index("Timestamp") \
.groupby("ST", group_keys=False) \
.resample("1D", closed="right") \
.ffill() \
.reset_index()
return df_r.sort_values(by=["ST", "Timestamp"]).reset_index(drop=True)
df_us_daily = resample_daily(df_covid19_us)
df_us_daily[df_us_daily["ST"] == "GA"]
Observe how there are now samples on every consecutive day beginning on March 11.
Armed with regularly sampled data, you can now complete the next step, which is to smooth out the data using windowed daily averages, defined as follows.
Let $c_t$ denote the number of confirmed cases on day $t$, and let $d$ be a positive integer. Then the $d$-day windowed daily average on day $t$, denoted $\bar{c}_t$, is the mean number of confirmed cases in the $d$ days up to and including day $t$. Mathematically,
$$ \bar{c}_t = \frac{c_{t-(d-1)} + c_{t-(d-2)} + \cdots + c_{t-1} + c_t}{d}. $$We'll refer to the values in the numerator as the window for day $t$.
For example, suppose $c = [0, 0, 1, 2, 2, 3, 3, 4, 6, 10]$, where the first and last values are $c_0=0$ and $c_9=10$. Now suppose $d=3$ days. Then the windowed daily average on day $t$ is the average of confirmed cases on days $t-2$, $t-1$, and $t$:
$$ \bar{c}_4 = \frac{c_2 + c_3 + c_4}{3} = \frac{1 + 2 + 2}{3} = \frac{5}{3} = 1.666\ldots. $$In this example, there aren't 3-days worth of observations for days 0 and 1. Let's treat these cases as undefined, meaning there is no average computable for those days. Therefore, the final result in this example would be
$$ \bar{c} = [\mbox{nan}, \mbox{nan}, 0.333\ldots, 1.0, 1.666\ldots, 2.333\ldots, 2.666\ldots, 3.333\ldots, 4.333\ldots, 6.666\ldots], $$where $\mbox{nan}$ is a floating-point not-a-number value, which we will use a stand-in for an undefined average.
Suppose you are given a data frame df
like df_us_daily
, which resample_daily()
computed. That is, you may assume df
has three columns named "Timestamp"
, "ST"
, and "Confirmed"
. However, daily observations may appear in any order within df
. (That is, do not assume they are grouped by state or sorted by timestamp a priori.)
Please complete the function daily_windowed_avg(df, days)
so that it calculates the windowed daily average using windows of size days
. Your function should return a copy of df
with a new column named Avg
containing this average. For days with no defined average, your function should simply omit those days from the output.
Note. Although the example below shows data only for
"GA"
, the inputdf
may have more than one state's worth of data in it. Therefore, your function will need to handle that case.
For example, suppose the rows in df
with Georgia data are as follows:
Timestamp | ST | Confirmed |
---|---|---|
2020-03-12 | GA | 42 |
2020-03-17 | GA | 121 |
2020-03-11 | GA | 17 |
2020-03-15 | GA | 66 |
2020-03-18 | GA | 146 |
2020-03-16 | GA | 99 |
2020-03-13 | GA | 31 |
2020-03-14 | GA | 31 |
Observe that the rows are not necessarily in timestamp order, so you'll need to deal with that. Among these rows, the first date is March 11 and the last is March 18.
Now, suppose we use days=3
and call your function on the full dataset (with all states), and then look at just the Georgia rows, we should see
Timestamp | ST | Confirmed | Avg |
---|---|---|---|
2020-03-13 | GA | 31 | 30.000000 |
2020-03-14 | GA | 31 | 34.6666... |
2020-03-15 | GA | 66 | 42.6666... |
2020-03-16 | GA | 99 | 65.3333... |
2020-03-17 | GA | 121 | 95.3333... |
2020-03-18 | GA | 146 | 122.0000... |
You can confirm that the first day of this result, March 13, 2020, is 30, which is the average of March 11-13 (17, 42, and 31 cases, respectively). The last day, March 18, is 122, the average of March 16-18 (99, 121, and 146 cases). March 11 and 12 do not appear because they do not have three days worth of observations.
Note 0. There are many approaches to this problem. If you have good mastery of pandas, you should be able to quickly assimilate and apply its built-in
.rolling()
technique. Otherwise, it should also be straightforward to apply other techniques you already know.Note 1. To pass the autograder, you'll need to ensure that your data frame has exactly the columns shown in the above example. (We use tibble-equivalency checks so column and row ordering does not matter.)
Note 2. The
.dtype
of columns"Timestamp"
,"ST"
, and"Confirmed"
should match those of the input; the new column"Avg"
contains floating-point values, and so should have a floating-point.dtype
.Note 3. Our tester already does approximate checking for floating-point values. Therefore, if the test code reports a mismatch, you are definitely miscalculating the averages by much more than the amount allowed by roundoff error, and you will have to keep debugging.
def daily_windowed_avg(df, days):
### BEGIN SOLUTION
return daily_windowed_avg__2(df, days)
# === Version 0: Use groupby/apply paradigm ===
def daily_window_one_df(df, days):
from numpy import nan
df_new = df.sort_values(by="Timestamp")
df_new["Sums"] = df_new["Confirmed"]
for k in range(1, days):
df_new["Sums"].iloc[k:] += df_new["Confirmed"].iloc[:-k].values
df_new["Sums"] /= days
df_new.rename(columns={"Sums": "Avg"}, inplace=True)
return df_new.iloc[days-1:]
def daily_windowed_avg__0(df, days):
return df.groupby("ST").apply(lambda x: daily_window_one_df(x, days)).reset_index(drop=True)
# === Version 1: Somewhat better style ===
def daily_windowed_avg__1(df, days):
df_avg = df.sort_values(by="Timestamp") \
.set_index("Timestamp") \
.groupby("ST") \
.rolling(days) \
.mean() \
.reset_index() \
.rename(columns={"Confirmed": "Avg"}) \
.dropna()
return df_avg.merge(df, on=["ST", "Timestamp"])
# === Version 2: Variation on a theme ===
def daily_windowed_avg__2(df, days):
df = df.copy()
df.sort_values(by=['ST', 'Timestamp'], inplace=True)
averages = df[['ST', 'Confirmed']].groupby('ST') \
.rolling(days) \
.mean()
df['Avg'] = averages['Confirmed'].values
return df[df['Avg'].isnull() == False]
# === Version 3: Naive loop-based ===
def daily_windowed_avg__3(df, days):
df = df.sort_values(by=["ST", "Timestamp"])
states_list = df["ST"].unique().tolist()
df_new = pd.DataFrame()
for st in states_list:
df_st = df[df["ST"] == st]
window = [0] * days
for k, row in enumerate(df_st.itertuples()):
current_day = k % days
window[current_day] = row.Confirmed
if k < days-1: continue
avg = sum(window) / days
new_row = {"ST": row.ST, "Timestamp": row.Timestamp, "Confirmed": row.Confirmed, "Avg": avg}
df_new = df_new.append(new_row, ignore_index=True)
df_new["Confirmed"] = df_new["Confirmed"].astype(int)
return df_new
### END SOLUTION
# Demo of your function:
print('=== Two states: "AK" and "GA" ===')
is_ak_ga_before = df_us_daily["ST"].isin(["AK", "GA"])
display(df_us_daily[is_ak_ga_before])
print('=== Your results (days=3) ===')
df_us_daily_avg = daily_windowed_avg(df_us_daily, 3)
is_ak_ga_after = df_us_daily_avg["ST"].isin(["AK", "GA"])
display(df_us_daily_avg[is_ak_ga_after])
# Test cell: `ex3__daily_windowed_avg` (3 points)
### BEGIN HIDDEN TESTS
def ex3_gen_soln(force=False):
def ex3_soln(df, days):
df_avg = df.sort_values(by="Timestamp") \
.set_index("Timestamp") \
.groupby("ST") \
.rolling(days) \
.mean() \
.reset_index() \
.rename(columns={"Confirmed": "Avg"}) \
.dropna()
return df_avg.merge(df, on=["ST", "Timestamp"])
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
from os.path import isfile
soln_file = get_path('covid19/ex3_soln.csv')
if isfile(soln_file) and not force:
print(f"Solution file, '{soln_file}', already exists. NOT regenerating...")
else:
print(f"Generating solution file, '{soln_file}'...")
df = ex3_soln(df_us_daily, 3)
df.to_csv(soln_file, index=False)
print("==> Done!")
ex3_gen_soln(force=False)
### END HIDDEN TESTS
def ex3_gen_state_df(st, days):
from random import randint, random
from pandas import DataFrame, concat
from problem_utils import ex0_random_date
def rand_day():
from datetime import datetime
date = ex0_random_date()
return datetime(date.year, date.month, date.day)
def inc_date(date, days=1):
from datetime import timedelta
return date + timedelta(days=days)
dates = []
sts = []
confs = [randint(1, 10)]
avgs = []
r0 = 1 + random()
day = rand_day()
num_days = days + randint(1, 10)
for k in range(num_days):
dates.append(day)
sts.append(st)
confs.append(int(confs[0] * (r0**k)))
if k >= days-1: avgs.append(sum(confs[(-days):]) / days)
day = inc_date(day)
df = DataFrame({"Timestamp": dates,
"ST": sts,
"Confirmed": confs[1:]})
df_soln = DataFrame({"Timestamp": dates[(days-1):],
"ST": sts[(days-1):],
"Confirmed": confs[days:],
"Avg": avgs})
return df, df_soln
def ex3_gen_df():
from random import randint, sample
from pandas import concat
num_states = randint(1, 4)
days = randint(1, 4)
states = sample(STATE_NAMES["Abbrv"].tolist(), num_states)
df_list = []
df_soln_list = []
for st in states:
df_st, df_st_soln = ex3_gen_state_df(st, days)
df_list.append(df_st)
df_soln_list.append(df_st_soln)
df = concat(df_list, ignore_index=True).sample(frac=1).reset_index(drop=True)
df_soln = concat(df_soln_list, ignore_index=True).sort_values(by=["ST", "Timestamp"])
try:
df_your_soln = daily_windowed_avg(df, days)
assert_tibbles_are_equivalent(df_soln, df_your_soln)
except:
print("\n*** ERROR ***")
print("Input data frame:")
display(df)
print(f"Expected solution (days={days}):")
display(df_soln)
print("Your solution:")
display(df_your_soln)
raise
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex3_gen_df()
print("\n(Passed.)")
In case you can't get a working solution to Exercise 3, we have pre-computed the daily windowed averages. The code cell below reads this data and stores it in the variable, df_us_daily_avg
. You will need it in the subsequent exercises, so do not modify it!
with open(get_path('covid19/ex3_soln.csv'), "rt") as fp:
df_us_daily_avg = pd.read_csv(get_path('covid19/ex3_soln.csv'), parse_dates=["Timestamp"])
df_us_daily_avg[df_us_daily_avg["ST"].isin(["AK", "GA"])]
Here is a visualization of the daily averages, which should appear smoother. As such, the trends should be a little more clear as well.
# You can modify this cell if you want to play around with the visualization.
viz_by_state("Avg", df_us_daily_avg, covid19_rankings[:TOP_K], logy=True)
Recall from Notebook 11 that you used a Markov chain-based model to "rank" airport networks by how likely a certain "random flyer" is to end up at each airport. In this final step of this problem, you'll apply a similiar idea to rank states, and see how well it correlates with the state-by-state numbers of confirmed COVID-19 cases.
Raw data. First, observe that our raw data differs slightly from Notebook 11. It consists of all flights from calendar year 2019 (the latest available from the original source, as no 2020 flights are present there), and we've added a column with each airport's two-letter state postal code. Let's load these flights into a DataFrame called flights
. (You don't need to understand this code in depth, but do pay attention to the format of the output sample from flights
.)
def load_flights(infile=get_path('us-flights/us-flights-2019--86633396_T_ONTIME_REPORTING.csv')):
keep_cols = ["FL_DATE", "ORIGIN_STATE_ABR", "DEST_STATE_ABR", "OP_UNIQUE_CARRIER", "OP_CARRIER_FL_NUM"]
flights = pd.read_csv(infile)[keep_cols]
us_sts = set(STATE_NAMES["Abbrv"])
origin_is_state = flights['ORIGIN_STATE_ABR'].isin(us_sts)
dest_is_state = flights['DEST_STATE_ABR'].isin(us_sts)
return flights.loc[origin_is_state & dest_is_state].copy()
flights = load_flights()
print(f"There are {len(flights):,} direct flight segments in the `flights` data frame.")
print("Here are the first few:")
flights.head()
Outdegrees. In Notebook 11, we calculated the outdegree of each airport $u$ to be the number of distinct endpoints (other airports) reachable from $u$.
For the analysis in this problem, we will use a different definition for the outdegree. In particular, we'll define the outdegree $d_u$ of state $u$ (e.g., the state of Georgia, the state of California) to be the total number of direct flight segments from state $u$ to all other states. In pandas, we can use a group-by-count aggregation to compute these outdegrees. Here is some code that does so, producing a data frame named outdegrees
with two columns, the origin state ("Origin"
) and outdegree value ("Outdegree"
), sorted in descending order of outdegree. (You should be able to understand this code, which may help you in the next exercise.)
def calc_outdegrees(flights):
outdegrees = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']] \
.groupby(['ORIGIN_STATE_ABR']) \
.count() \
.reset_index() \
.rename(columns={'ORIGIN_STATE_ABR': 'Origin',
'DEST_STATE_ABR': 'Outdegree'}) \
.sort_values(by='Outdegree', ascending=False) \
.reset_index(drop=True)
return outdegrees
# Demo:
outdegrees = calc_outdegrees(flights)
print(f"There are {len(outdegrees)} states with a non-zero outdegree.")
print("Here are the first ten:")
outdegrees.head(10)
To run the ranking analysis, recall that we need to construct a probability transition matrix. For our state-to-state analysis, we therefore wish to estimate the probability of going from state $i$ to state $j$. Let's define that probability to be the number of direct flight segments from state $i$ to state $j$ divided by the outdegree of state $i$.
Complete the function, calc_state_trans_probs(flights, outdegrees)
to compute these state-to-state transition probabilities. Your function should accept two data frames like flights
and outdegrees
as defined above. In particular, you may assume the following;
flights
data frame has three columns: "ORIGIN_STATE_ABR"
(originating state, a two-letter abbreviation), "DEST_STATE_ABR"
(destination state abbreviation), and "FL_DATE"
(date of direct flight).outdegrees
data frame has two columns: "Origin"
(originating state, a two-letter abbreviation) and "Outdegree"
(an integer).Your function should and return a new data frame with exactly these columns:
"Origin"
: The origin state, i.e., state $i$, as a two-letter abbreviation."Dest"
: The destination state, i.e., state $j$, as a two-letter abbreviation."Count"
: The number of direct flight segments from state $i$ to state $j$."TransProb"
: The transition probability of going from state $i$ to state $j$, i.e., the count divided by the outdegree.def calc_state_trans_probs(flights, outdegrees):
### BEGIN SOLUTION
probs = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']] \
.groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']) \
.count() \
.reset_index() \
.rename(columns={'ORIGIN_STATE_ABR': 'Origin',
'DEST_STATE_ABR': 'Dest',
'FL_DATE': 'Count'}) \
.merge(outdegrees, on='Origin', how='inner')
probs['TransProb'] = probs['Count'] / probs['Outdegree']
del probs['Outdegree']
return probs
### END SOLUTION
# Demo, Part 0:
probs = calc_state_trans_probs(flights, outdegrees)
# print(f"There are {len(probs)} state-to-state transition probabilities in your result.")
# print("Here are ten with the largest transition probabilities:")
# display(probs.sort_values(by="TransProb", ascending=False).head(10))
# Demo, Part 1:
print("""
As a sanity check, let's see if the sum of all outgoing links per state
is (approximately) 1.0. If it isn't, meaning any of the rows of the
output below are `False`, use that information to help yourself debug.
""")
sanity = (probs[['Origin', 'TransProb']].groupby('Origin').sum() - 1.0).abs() < 1e-14
sanity
# Test cell: `ex4__calc_state_trans_probs` (2 points)
def ex4_gen_df_st(st):
from random import randint, random, choice
from collections import defaultdict
from problem_utils import ex0_random_date
from pandas import DataFrame
states = list(STATE_NAMES["Abbrv"])
num_unique_edges = randint(1, 4)
dates = []
dests = []
counts = defaultdict(int)
outdegree = 0
for _ in range(num_unique_edges):
if random() < 0.33:
num_reps = randint(2, 4)
else:
num_reps = 1
dest_st = choice(states)
dests += [dest_st] * num_reps
dates += [ex0_random_date() for _ in range(num_reps)]
counts[(st, dest_st)] += num_reps
outdegree += num_reps
flights = DataFrame({"FL_DATE": dates,
"ORIGIN_STATE_ABR": [st] * len(dates),
"DEST_STATE_ABR": dests})
dests_st = []
counts_st = []
probs_st = []
for (st, dest_st), c in counts.items():
dests_st.append(dest_st)
counts_st.append(c)
probs_st.append(c / outdegree)
sts = [st] * len(dests_st)
probs = DataFrame({"Origin": sts,
"Dest": dests_st,
"Count": counts_st,
"TransProb": probs_st})
return flights, probs, outdegree
def ex4_check_one():
from random import randint, sample
from pandas import DataFrame, concat
num_states = randint(1, 4)
states = list(STATE_NAMES["Abbrv"])
flights_list = []
probs_list = []
outdegrees_list = []
sts = sample(states, num_states)
for st in sts:
flights_st, probs_st, outdegree_st = ex4_gen_df_st(st)
flights_list.append(flights_st)
probs_list.append(probs_st)
outdegrees_list.append(outdegree_st)
flights = concat(flights_list, ignore_index=True) \
.sort_values(by="FL_DATE") \
.reset_index(drop=True)
probs = concat(probs_list, ignore_index=True) \
.sort_values(by="Origin") \
.reset_index(drop=True)
outdegrees = DataFrame({"Origin": sts,
"Outdegree": outdegrees_list})
try:
your_probs = calc_state_trans_probs(flights, outdegrees)
assert_tibbles_are_equivalent(probs, your_probs)
except:
print("\n*** ERROR ***\n")
print("`flights` input:")
display(flights)
print("`outdegrees` input:")
display(outdegrees)
print("Expected output:")
display(probs)
print("Your output:")
display(your_probs)
raise
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex4_check_one()
EXERCISE4_PASSED = True
print("\n(Passed.)")
The rest of this notebook completes the comparison between state-rankings by confirmed cases and those by the airport network. It does depend on a working Exercise 4. However, running it is for your edification only, as there are no additional exercises or test cells below. Nevertheless, if the autograder has trouble completing due to errors in the code below, you can try converting the code cells to Markdown (effectively disabling them) and see if that helps.
State rankings. The next code cell runs the PageRank-style algorithm on the state-to-state airport network and produces a ranking. It depends on a correct result for Exercise 4, so if yours is not working completely, it might not run to completion. If that causes issues with the autograder, you can try converting the cell to Markdown to (effectively) disable it.
def spy(A, figsize=(6, 6), markersize=0.5):
"""Visualizes a sparse matrix."""
from matplotlib.pyplot import figure, spy, show
fig = figure(figsize=figsize)
spy(A, markersize=markersize)
show()
def display_vec_sparsely(x, name='x'):
from numpy import argwhere
from pandas import DataFrame
i_nz = argwhere(x).flatten()
df_x_nz = DataFrame({'i': i_nz, '{}[i] (non-zero only)'.format(name): x[i_nz]})
display(df_x_nz.head(5))
if len(df_x_nz) > 5:
print("...")
display(df_x_nz.tail(5))
def eval_markov_chain(P, x0, t_max):
x = x0
for t in range(t_max):
x = P.T.dot(x)
return x
def rank_states_by_air_network(probs, t_max=100, verbose=True):
from numpy import array, zeros, ones, argsort, arange
from scipy.sparse import coo_matrix
from pandas import DataFrame
# Create transition matrix
unique_origins = set(probs['Origin'])
unique_dests = set(probs['Dest'])
unique_states = array(sorted(unique_origins | unique_dests))
state_ids = {st: i for i, st in enumerate(unique_states)}
num_states = max(state_ids.values()) + 1
s2s = probs.copy()
s2s['OriginID'] = s2s['Origin'].map(state_ids)
s2s['DestID'] = s2s['Dest'].map(state_ids)
P = coo_matrix((s2s['TransProb'], (s2s['OriginID'], s2s['DestID'])),
shape=(num_states, num_states))
if verbose: spy(P)
# Run ranking algorithm
x0 = zeros(num_states)
x0[state_ids['WA']] = 1.0 # First state to report confirmed COVID-19 cases
if verbose:
print("Initial condition:")
display_vec_sparsely(x0, name='x0')
x = eval_markov_chain(P, x0, t_max)
if verbose:
print("Final probabilities:")
display_vec_sparsely(x)
# Produce a results table of rank-ordered states
ranks = argsort(-x)
df_ranks = DataFrame({'Rank': arange(1, len(ranks)+1),
'State': unique_states[ranks],
'x(t)': x[ranks]})
df_ranks['ID'] = df_ranks['State'].map(state_ids)
return df_ranks
if "EXERCISE4_PASSED" in dir() and EXERCISE4_PASSED:
print("Running the ranking algorithm...")
airnet_rankings = rank_states_by_air_network(probs, verbose=False)
print(f"==> Here are the top-{TOP_K} states:")
display(airnet_rankings.head(TOP_K))
else:
print("We did not detect that the Exercise 4 test cell passed, so we aren't running this cell.")
Comparing the two rankings. We now have a ranking of states by number of confirmed COVID-19 cases, as well as a separate ranking of states by air-network connectivity. To compare them, we'll use a measure called rank-biased overlap (RBO). Very roughly speaking, this measure is an estimate of the probability that a reader comparing the top few entries of two rankings tends to encounter the same items, so a value closer to 1 means the top entries of the two rankings are more similar.
Note 0. We say "top few" above because RBO is parameterized by a "patience" parameter, which is related to how many of the top entries the reader will inspect before stopping. The reason for this parameter originates in the motivation for RBO, which was to measure the similarity between search engine results. The code we are using to calculate RBO uses this implementation).
Note 1. This cell should only be run if Exercise 4 passes.
from rbo import rbo
if "EXERCISE4_PASSED" in dir() and EXERCISE4_PASSED:
compare_rankings = rbo(covid19_rankings, # ranking by confirmed COVID-19 cases
airnet_rankings['State'].values, # ranking by air-network connectivity
0.95) # "patience" parameter
print(f"Raw RBO result: {compare_rankings}\n\n==> RBO score is {compare_rankings.ext:.3}")
else:
print("We did not detect that the Exercise 4 test cell passed, so we aren't running this cell.")
If everything is correct, you'll see an RBO score of around 0.6, which suggests that the connectivity of the airport network may help explain the number of confirmed COVID-19 cases we are seeing in each state.
Fin! You’ve reached the end of this problem. Don’t forget to restart and run all cells again to make sure it’s all working when run in sequence; and make sure your work passes the submission process. Good luck!