Problem 2: "But her emails..."

In this problem, you'll show your SQL and Pandas chops on the dataset consisting of Hilary Rodham Clinton's emails!

This problem has four (4) exercises (0-3) and is worth a total of ten (10) points.

This problem also involves SQLite. However, at present, the sqlite3 module is broken on Vocareum when using the Python 3.6 kernel; therefore, the kernel has been set to Python 3.5 for this notebook. If you are trying to work on a local copy, please be wary of potential differences in modules and versions.

In [1]:
import sys
print("=== Python version info ===\n{}".format(sys.version))

import sqlite3 as db
print("\n=== sqlite3 version info: {} ===".format(db.version))
=== Python version info ===
3.7.5 (default, Dec 18 2019, 06:24:58) 
[GCC 5.5.0 20171010]

=== sqlite3 version info: 2.6.0 ===

Setup

We downloaded this database from Kaggle.

We are only making it available on Vocareum for use in this problem. If you wish to work on this problem in your local environment, you are on your own in figuring out how to get the data (or a subset of the data) to be able to do so.

Start by running the following setup code, which will load the modules you'll need for this problem

In [2]:
from IPython.display import display
import pandas as pd
import numpy as np

def peek_table (db, name, num=5):
    """
    Given a database connection (`db`), prints both the number of
    records in the table as well as its first few entries.
    """
    count = '''select count (*) FROM {table}'''.format (table=name)
    peek = '''select * from {table} limit {limit}'''.format (table=name, limit=num)

    print ("Total number of records:", pd.read_sql_query (count, db)['count (*)'].iloc[0], "\n")

    print ("First {} entries:".format (num))
    display (pd.read_sql_query (peek, db))

def list_tables (conn):
    """Return the names of all visible tables, given a database connection."""
    query = """select name from sqlite_master where type = 'table';"""
    c = conn.cursor ()
    c.execute (query)
    table_names = [t[0] for t in c.fetchall ()]
    return table_names

def tbc (X):
    var_names = sorted (X.columns)
    Y = X[var_names].copy ()
    Y.sort_values (by=var_names, inplace=True)
    Y.set_index ([list (range (0, len (Y)))], inplace=True)
    return Y

def tbeq(A, B):
    A_c = tbc(A)
    B_c = tbc(B)
    return A_c.eq(B_c).all().all()
In [3]:
DATA_PATH = "./resource/asnlib/publicdata/"
conn = db.connect ('{}hrc.db'.format(DATA_PATH))

print ("List of tables in the database:", list_tables (conn))
List of tables in the database: ['Emails', 'Persons', 'Aliases', 'EmailReceivers']
In [4]:
peek_table (conn, 'Emails')
peek_table (conn, 'EmailReceivers', num=3)
peek_table (conn, 'Persons')
Total number of records: 7945 

First 5 entries:
Id DocNumber MetadataSubject MetadataTo MetadataFrom SenderPersonId MetadataDateSent MetadataDateReleased MetadataPdfLink MetadataCaseNumber ... ExtractedTo ExtractedFrom ExtractedCc ExtractedDateSent ExtractedCaseNumber ExtractedDocNumber ExtractedDateReleased ExtractedReleaseInPartOrFull ExtractedBodyText RawText
0 1 C05739545 WOW H Sullivan, Jacob J 87 2012-09-12T04:00:00+00:00 2015-05-22T04:00:00+00:00 DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545... F-2015-04841 ... Sullivan, Jacob J <Sullivan11@state.gov> Wednesday, September 12, 2012 10:16 AM F-2015-04841 C05739545 05/13/2015 RELEASE IN FULL UNCLASSIFIED\nU.S. Department of State\nCase N...
1 2 C05739546 H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR... H 2011-03-03T05:00:00+00:00 2015-05-22T04:00:00+00:00 DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546... F-2015-04841 ... F-2015-04841 C05739546 05/13/2015 RELEASE IN PART B6\nThursday, March 3, 2011 9:45 PM\nH: Latest... UNCLASSIFIED\nU.S. Department of State\nCase N...
2 3 C05739547 CHRIS STEVENS ;H Mills, Cheryl D 32 2012-09-12T04:00:00+00:00 2015-05-22T04:00:00+00:00 DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547... F-2015-04841 ... B6 Mills, Cheryl D <MillsCD@state.gov> Abedin, Huma Wednesday, September 12, 2012 11:52 AM F-2015-04841 C05739547 05/14/2015 RELEASE IN PART Thx UNCLASSIFIED\nU.S. Department of State\nCase N...
3 4 C05739550 CAIRO CONDEMNATION - FINAL H Mills, Cheryl D 32 2012-09-12T04:00:00+00:00 2015-05-22T04:00:00+00:00 DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550... F-2015-04841 ... Mills, Cheryl D <MillsCD@state.gov> Mitchell, Andrew B Wednesday, September 12,2012 12:44 PM F-2015-04841 C05739550 05/13/2015 RELEASE IN PART UNCLASSIFIED\nU.S. Department of State\nCase N...
4 5 C05739554 H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR... Abedin, Huma H 80 2011-03-11T05:00:00+00:00 2015-05-22T04:00:00+00:00 DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554... F-2015-04841 ... F-2015-04841 C05739554 05/13/2015 RELEASE IN PART H <hrod17@clintonemail.com>\nFriday, March 11,... B6\nUNCLASSIFIED\nU.S. Department of State\nCa...

5 rows × 22 columns

Total number of records: 9306 

First 3 entries:
Id EmailId PersonId
0 1 1 80
1 2 2 80
2 3 3 228
Total number of records: 513 

First 5 entries:
Id Name
0 1 111th Congress
1 2 AGNA USEMB Kabul Afghanistan
2 3 AP
3 4 ASUNCION
4 5 Alec

Exercise 0 (1 point). Extract the Persons table from the database and store it as a Pandas data frame in a variable named Persons having two columns: Id and Name.

In [5]:
### BEGIN SOLUTION
Persons = pd.read_sql_query ('SELECT * FROM Persons', conn)
### END SOLUTION
In [6]:
assert 'Persons' in globals ()
assert type (Persons) is type (pd.DataFrame ())
assert len (Persons) == 513

print ("Five random people from the `Persons` table:")
display (Persons.iloc[np.random.choice (len (Persons), 5)])

print ("\n(Passed!)")
Five random people from the `Persons` table:
Id Name
356 357 mtorrey1
500 501 abed inh@state.gov.
512 513 ha nleym r@state.gov
148 149 Mike
418 419 valmorou state. ov
(Passed!)

Exercise 1 (3 points). Query the database to determine how frequently particular pairs of people communicate. Store the results in a Pandas data frame named CommEdges having the following three columns:

  • Sender: The ID of the sender (taken from the Emails table).
  • Receiver: The ID of the receiver (taken from the EmailReceivers table).
  • Frequency: The number of times this particular (Sender, Receiver) pair occurs.

Order the results in descending order of Frequency.

There is one corner case that you should also handle: sometimes the Sender field is empty (unknown). You can filter these cases by checking that the sender ID is not the empty string.

In [7]:
# Pull 2 table into 2 dataframes
Senders = pd.read_sql_query("select Id, SenderPersonId from Emails",conn)
Receivers = pd.read_sql_query("select EmailId, PersonId from EmailReceivers",conn)

# Filter Senders table to exclude null values for SenderPersonId
Senders = Senders[Senders.SenderPersonId!=""]

# Merge 2 dataframe with pd.merge
solution_df = pd.merge(Senders,Receivers,left_on="Id",right_on="EmailId").groupby(["SenderPersonId","PersonId"]).size().reset_index()

#Rename Columns
solution_df.columns = ["Sender", "Receiver","Frequency"]

# Sorting
solution_df = solution_df.sort_values("Frequency",ascending=False)
solution_df
Out[7]:
Sender Receiver Frequency
424 81 80 1406
40 32 80 1262
490 87 80 857
168 80 81 529
152 80 32 372
... ... ... ...
322 80 327 1
321 80 326 1
320 80 325 1
317 80 322 1
738 226 80 1

739 rows × 3 columns

In [8]:
Senders.head()
Out[8]:
Id SenderPersonId
0 2348 1
1 2630 2
2 5265 3
3 368 5
4 5032 6
In [9]:
len(Senders), len(Receivers)
Out[9]:
(7788, 9306)
In [10]:
display(Senders.head())
display(Receivers.head())
Id SenderPersonId
0 2348 1
1 2630 2
2 5265 3
3 368 5
4 5032 6
EmailId PersonId
0 1 80
1 2 80
2 3 228
3 3 80
4 4 80
In [11]:
### BEGIN SOLUTION
query = """
  SELECT
      SenderPersonId AS Sender,
      PersonId AS Receiver,
      COUNT (*) AS Frequency
  FROM
      Emails, EmailReceivers
  WHERE
      Emails.Id = EmailReceivers.EmailId AND Sender <> ''
  GROUP BY
      Sender, Receiver
  ORDER BY
      -Frequency
"""
CommEdges = pd.read_sql_query (query, conn)

#CommEdges.to_csv ('CommEdges_soln.csv', index=False)
### END SOLUTION
# CommEdges = X
In [12]:
# Read what we believe is the exact result (up to permutations)
CommEdges_soln = pd.read_csv ('{}CommEdges_soln.csv'.format(DATA_PATH))

# Check that we got a data frame of the expected shape:
assert 'CommEdges' in globals ()
assert type (CommEdges) is type (pd.DataFrame ())
assert len (CommEdges) == len (CommEdges_soln)
assert set (CommEdges.columns) == set (['Sender', 'Receiver', 'Frequency'])

# Check that the results are sorted:
non_increasing = (CommEdges['Frequency'].iloc[:-1].values >= CommEdges['Frequency'].iloc[1:].values)
assert non_increasing.all ()

print ("Top 5 communicating pairs:")
display (CommEdges.head ())

assert tbeq (CommEdges, CommEdges_soln)
print ("\n(Passed!)")
Top 5 communicating pairs:
Sender Receiver Frequency
0 81 80 1406
1 32 80 1262
2 87 80 857
3 80 81 529
4 80 32 372
(Passed!)

Exercise 2 (3 points). Consider any pair of people, $a$ and $b$. Suppose we don't care whether person $a$ sends and person $b$ receives or whether person $b$ sends and person $a$ receives. Rather, we only care that $\{a, b\}$ have exchanged messages.

That is, the previous exercise computed a directed graph, $G = \left(g_{a,b}\right)$, where $g_{a,b}$ is the number of times (or "frequency") that person $a$ was the sender and person $b$ was the receiver. Instead, suppose we wish to compute its symmetrized or undirected version, $H = G + G^T$.

Write some code that computes $H$ and stores it in a Pandas data frame named CommPairs with the columns, A, B, and Frequency. Per the definition of $H$, the Frequency column should combine frequencies from $G$ and $G^T$ accordingly.

In [13]:
### BEGIN SOLUTION
G = CommEdges.rename (columns={'Sender': 'A', 'Receiver': 'B'})
GT = CommEdges.rename (columns={'Sender': 'B', 'Receiver': 'A'})
H = pd.merge (G, GT, on=['A', 'B'], suffixes=('_G', '_GT'))
H['Frequency'] = H['Frequency_G'] + H['Frequency_GT']
del H['Frequency_G']
del H['Frequency_GT']
CommPairs = H

#CommPairs.to_csv ('CommPairs_soln.csv', index=False)
### END SOLUTION
In [14]:
CommPairs_soln = pd.read_csv ('{}CommPairs_soln.csv'.format(DATA_PATH))

assert 'CommPairs' in globals ()
assert type (CommPairs) is type (pd.DataFrame ())
assert len (CommPairs) == len (CommPairs_soln)

print ("Most frequently communicating pairs:")
display (CommPairs.sort_values (by='Frequency', ascending=False).head (10))

assert tbeq (CommPairs, CommPairs_soln)
print ("\n(Passed!)")
Most frequently communicating pairs:
A B Frequency
0 81 80 1935
3 80 81 1935
4 80 32 1634
1 32 80 1634
2 87 80 1206
6 80 87 1206
7 116 80 580
8 80 116 580
5 194 80 413
19 80 194 413
(Passed!)

Exercise 3 (3 points). Starting with a copy of CommPairs, named CommPairsNamed, add two additional columns that contain the names of the communicators. Place these values in columns named A_name and B_name in CommPairsNamed.

In [15]:
CommPairsNamed = CommPairs.copy ()

### BEGIN SOLUTION
CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['A'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'A_name'}, inplace=True)
del CommPairsNamed['Id']

CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['B'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'B_name'}, inplace=True)
del CommPairsNamed['Id']

#CommPairsNamed.to_csv ('CommPairsNamed_soln.csv', index=False)
### END SOLUTION
In [16]:
CommPairsNamed_soln = pd.read_csv ('{}CommPairsNamed_soln.csv'.format(DATA_PATH))

assert 'CommPairsNamed' in globals ()
assert type (CommPairsNamed) is type (pd.DataFrame ())
assert set (CommPairsNamed.columns) == set (['A', 'A_name', 'B', 'B_name', 'Frequency'])

print ("Top few entries:")
CommPairsNamed.sort_values (by=['Frequency', 'A', 'B'], ascending=False, inplace=True)
display (CommPairsNamed.head (10))

assert tbeq (CommPairsNamed, CommPairsNamed_soln)
print ("\n(Passed!)")
Top few entries:
A B Frequency A_name B_name
0 81 80 1935 Huma Abedin Hillary Clinton
137 80 81 1935 Hillary Clinton Huma Abedin
75 80 32 1634 Hillary Clinton Cheryl Mills
1 32 80 1634 Cheryl Mills Hillary Clinton
2 87 80 1206 Jake Sullivan Hillary Clinton
61 80 87 1206 Hillary Clinton Jake Sullivan
4 116 80 580 Lauren Jiloty Hillary Clinton
95 80 116 580 Hillary Clinton Lauren Jiloty
3 194 80 413 Sidney Blumenthal Hillary Clinton
168 80 194 413 Hillary Clinton Sidney Blumenthal
(Passed!)

When you are all done, it's good practice to close the database. The following will do that for you.

In [17]:
conn.close ()

Fin! If you've reached this point and all tests above pass, you are ready to submit your solution to this problem. Don't forget to save you work prior to submitting.