In this problem, you'll show your SQL and Pandas chops on the dataset consisting of Hilary Rodham Clinton's emails!
This problem has four (4) exercises (0-3) and is worth a total of ten (10) points.
This problem also involves SQLite. However, at present, the
sqlite3
module is broken on Vocareum when using the Python 3.6 kernel; therefore, the kernel has been set to Python 3.5 for this notebook. If you are trying to work on a local copy, please be wary of potential differences in modules and versions.
import sys
print("=== Python version info ===\n{}".format(sys.version))
import sqlite3 as db
print("\n=== sqlite3 version info: {} ===".format(db.version))
We downloaded this database from Kaggle.
We are only making it available on Vocareum for use in this problem. If you wish to work on this problem in your local environment, you are on your own in figuring out how to get the data (or a subset of the data) to be able to do so.
Start by running the following setup code, which will load the modules you'll need for this problem
from IPython.display import display
import pandas as pd
import numpy as np
def peek_table (db, name, num=5):
"""
Given a database connection (`db`), prints both the number of
records in the table as well as its first few entries.
"""
count = '''select count (*) FROM {table}'''.format (table=name)
peek = '''select * from {table} limit {limit}'''.format (table=name, limit=num)
print ("Total number of records:", pd.read_sql_query (count, db)['count (*)'].iloc[0], "\n")
print ("First {} entries:".format (num))
display (pd.read_sql_query (peek, db))
def list_tables (conn):
"""Return the names of all visible tables, given a database connection."""
query = """select name from sqlite_master where type = 'table';"""
c = conn.cursor ()
c.execute (query)
table_names = [t[0] for t in c.fetchall ()]
return table_names
def tbc (X):
var_names = sorted (X.columns)
Y = X[var_names].copy ()
Y.sort_values (by=var_names, inplace=True)
Y.set_index ([list (range (0, len (Y)))], inplace=True)
return Y
def tbeq(A, B):
A_c = tbc(A)
B_c = tbc(B)
return A_c.eq(B_c).all().all()
DATA_PATH = "./resource/asnlib/publicdata/"
conn = db.connect ('{}hrc.db'.format(DATA_PATH))
print ("List of tables in the database:", list_tables (conn))
peek_table (conn, 'Emails')
peek_table (conn, 'EmailReceivers', num=3)
peek_table (conn, 'Persons')
Exercise 0 (1 point). Extract the Persons
table from the database and store it as a Pandas data frame in a variable named Persons
having two columns: Id
and Name
.
### BEGIN SOLUTION
Persons = pd.read_sql_query ('SELECT * FROM Persons', conn)
### END SOLUTION
assert 'Persons' in globals ()
assert type (Persons) is type (pd.DataFrame ())
assert len (Persons) == 513
print ("Five random people from the `Persons` table:")
display (Persons.iloc[np.random.choice (len (Persons), 5)])
print ("\n(Passed!)")
Exercise 1 (3 points). Query the database to determine how frequently particular pairs of people communicate. Store the results in a Pandas data frame named CommEdges
having the following three columns:
Sender
: The ID of the sender (taken from the Emails
table).Receiver
: The ID of the receiver (taken from the EmailReceivers
table).Frequency
: The number of times this particular (Sender
, Receiver
) pair occurs.Order the results in descending order of Frequency
.
There is one corner case that you should also handle: sometimes the Sender
field is empty (unknown). You can filter these cases by checking that the sender ID is not the empty string.
# Pull 2 table into 2 dataframes
Senders = pd.read_sql_query("select Id, SenderPersonId from Emails",conn)
Receivers = pd.read_sql_query("select EmailId, PersonId from EmailReceivers",conn)
# Filter Senders table to exclude null values for SenderPersonId
Senders = Senders[Senders.SenderPersonId!=""]
# Merge 2 dataframe with pd.merge
solution_df = pd.merge(Senders,Receivers,left_on="Id",right_on="EmailId").groupby(["SenderPersonId","PersonId"]).size().reset_index()
#Rename Columns
solution_df.columns = ["Sender", "Receiver","Frequency"]
# Sorting
solution_df = solution_df.sort_values("Frequency",ascending=False)
solution_df
Senders.head()
len(Senders), len(Receivers)
display(Senders.head())
display(Receivers.head())
### BEGIN SOLUTION
query = """
SELECT
SenderPersonId AS Sender,
PersonId AS Receiver,
COUNT (*) AS Frequency
FROM
Emails, EmailReceivers
WHERE
Emails.Id = EmailReceivers.EmailId AND Sender <> ''
GROUP BY
Sender, Receiver
ORDER BY
-Frequency
"""
CommEdges = pd.read_sql_query (query, conn)
#CommEdges.to_csv ('CommEdges_soln.csv', index=False)
### END SOLUTION
# CommEdges = X
# Read what we believe is the exact result (up to permutations)
CommEdges_soln = pd.read_csv ('{}CommEdges_soln.csv'.format(DATA_PATH))
# Check that we got a data frame of the expected shape:
assert 'CommEdges' in globals ()
assert type (CommEdges) is type (pd.DataFrame ())
assert len (CommEdges) == len (CommEdges_soln)
assert set (CommEdges.columns) == set (['Sender', 'Receiver', 'Frequency'])
# Check that the results are sorted:
non_increasing = (CommEdges['Frequency'].iloc[:-1].values >= CommEdges['Frequency'].iloc[1:].values)
assert non_increasing.all ()
print ("Top 5 communicating pairs:")
display (CommEdges.head ())
assert tbeq (CommEdges, CommEdges_soln)
print ("\n(Passed!)")
Exercise 2 (3 points). Consider any pair of people, $a$ and $b$. Suppose we don't care whether person $a$ sends and person $b$ receives or whether person $b$ sends and person $a$ receives. Rather, we only care that $\{a, b\}$ have exchanged messages.
That is, the previous exercise computed a directed graph, $G = \left(g_{a,b}\right)$, where $g_{a,b}$ is the number of times (or "frequency") that person $a$ was the sender and person $b$ was the receiver. Instead, suppose we wish to compute its symmetrized or undirected version, $H = G + G^T$.
Write some code that computes $H$ and stores it in a Pandas data frame named CommPairs
with the columns, A
, B
, and Frequency
. Per the definition of $H$, the Frequency
column should combine frequencies from $G$ and $G^T$ accordingly.
### BEGIN SOLUTION
G = CommEdges.rename (columns={'Sender': 'A', 'Receiver': 'B'})
GT = CommEdges.rename (columns={'Sender': 'B', 'Receiver': 'A'})
H = pd.merge (G, GT, on=['A', 'B'], suffixes=('_G', '_GT'))
H['Frequency'] = H['Frequency_G'] + H['Frequency_GT']
del H['Frequency_G']
del H['Frequency_GT']
CommPairs = H
#CommPairs.to_csv ('CommPairs_soln.csv', index=False)
### END SOLUTION
CommPairs_soln = pd.read_csv ('{}CommPairs_soln.csv'.format(DATA_PATH))
assert 'CommPairs' in globals ()
assert type (CommPairs) is type (pd.DataFrame ())
assert len (CommPairs) == len (CommPairs_soln)
print ("Most frequently communicating pairs:")
display (CommPairs.sort_values (by='Frequency', ascending=False).head (10))
assert tbeq (CommPairs, CommPairs_soln)
print ("\n(Passed!)")
Exercise 3 (3 points). Starting with a copy of CommPairs
, named CommPairsNamed
, add two additional columns that contain the names of the communicators. Place these values in columns named A_name
and B_name
in CommPairsNamed
.
CommPairsNamed = CommPairs.copy ()
### BEGIN SOLUTION
CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['A'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'A_name'}, inplace=True)
del CommPairsNamed['Id']
CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['B'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'B_name'}, inplace=True)
del CommPairsNamed['Id']
#CommPairsNamed.to_csv ('CommPairsNamed_soln.csv', index=False)
### END SOLUTION
CommPairsNamed_soln = pd.read_csv ('{}CommPairsNamed_soln.csv'.format(DATA_PATH))
assert 'CommPairsNamed' in globals ()
assert type (CommPairsNamed) is type (pd.DataFrame ())
assert set (CommPairsNamed.columns) == set (['A', 'A_name', 'B', 'B_name', 'Frequency'])
print ("Top few entries:")
CommPairsNamed.sort_values (by=['Frequency', 'A', 'B'], ascending=False, inplace=True)
display (CommPairsNamed.head (10))
assert tbeq (CommPairsNamed, CommPairsNamed_soln)
print ("\n(Passed!)")
When you are all done, it's good practice to close the database. The following will do that for you.
conn.close ()
Fin! If you've reached this point and all tests above pass, you are ready to submit your solution to this problem. Don't forget to save you work prior to submitting.