Problem 2: "But her emails..."¶

In this problem, you'll show your SQL and Pandas chops on the dataset consisting of Hilary Rodham Clinton's emails!

This problem has four (4) exercises (0-3) and is worth a total of ten (10) points.

This problem also involves SQLite. However, at present, the sqlite3 module is broken on Vocareum when using the Python 3.6 kernel; therefore, the kernel has been set to Python 3.5 for this notebook. If you are trying to work on a local copy, please be wary of potential differences in modules and versions.

import sys
print("=== Python version info ===\n{}".format(sys.version))

import sqlite3 as db
print("\n=== sqlite3 version info: {} ===".format(db.version))

=== Python version info ===
3.7.5 (default, Dec 18 2019, 06:24:58) 
[GCC 5.5.0 20171010]

=== sqlite3 version info: 2.6.0 ===

Setup¶

We downloaded this database from Kaggle.

We are only making it available on Vocareum for use in this problem. If you wish to work on this problem in your local environment, you are on your own in figuring out how to get the data (or a subset of the data) to be able to do so.

Start by running the following setup code, which will load the modules you'll need for this problem

from IPython.display import display
import pandas as pd
import numpy as np

def peek_table (db, name, num=5):
    """
    Given a database connection (`db`), prints both the number of
    records in the table as well as its first few entries.
    """
    count = '''select count (*) FROM {table}'''.format (table=name)
    peek = '''select * from {table} limit {limit}'''.format (table=name, limit=num)

    print ("Total number of records:", pd.read_sql_query (count, db)['count (*)'].iloc[0], "\n")

    print ("First {} entries:".format (num))
    display (pd.read_sql_query (peek, db))

def list_tables (conn):
    """Return the names of all visible tables, given a database connection."""
    query = """select name from sqlite_master where type = 'table';"""
    c = conn.cursor ()
    c.execute (query)
    table_names = [t[0] for t in c.fetchall ()]
    return table_names

def tbc (X):
    var_names = sorted (X.columns)
    Y = X[var_names].copy ()
    Y.sort_values (by=var_names, inplace=True)
    Y.set_index ([list (range (0, len (Y)))], inplace=True)
    return Y

def tbeq(A, B):
    A_c = tbc(A)
    B_c = tbc(B)
    return A_c.eq(B_c).all().all()

DATA_PATH = "./resource/asnlib/publicdata/"
conn = db.connect ('{}hrc.db'.format(DATA_PATH))

print ("List of tables in the database:", list_tables (conn))

List of tables in the database: ['Emails', 'Persons', 'Aliases', 'EmailReceivers']

peek_table (conn, 'Emails')
peek_table (conn, 'EmailReceivers', num=3)
peek_table (conn, 'Persons')

Total number of records: 7945 

First 5 entries:

Total number of records: 9306 

First 3 entries:

Total number of records: 513 

First 5 entries:

Exercise 0 (1 point). Extract the Persons table from the database and store it as a Pandas data frame in a variable named Persons having two columns: Id and Name.

### BEGIN SOLUTION
Persons = pd.read_sql_query ('SELECT * FROM Persons', conn)
### END SOLUTION

assert 'Persons' in globals ()
assert type (Persons) is type (pd.DataFrame ())
assert len (Persons) == 513

print ("Five random people from the `Persons` table:")
display (Persons.iloc[np.random.choice (len (Persons), 5)])

print ("\n(Passed!)")

Five random people from the `Persons` table:

(Passed!)

Exercise 1 (3 points). Query the database to determine how frequently particular pairs of people communicate. Store the results in a Pandas data frame named CommEdges having the following three columns:

Sender: The ID of the sender (taken from the Emails table).
Receiver: The ID of the receiver (taken from the EmailReceivers table).
Frequency: The number of times this particular (Sender, Receiver) pair occurs.

Order the results in descending order of Frequency.

There is one corner case that you should also handle: sometimes the Sender field is empty (unknown). You can filter these cases by checking that the sender ID is not the empty string.

# Pull 2 table into 2 dataframes
Senders = pd.read_sql_query("select Id, SenderPersonId from Emails",conn)
Receivers = pd.read_sql_query("select EmailId, PersonId from EmailReceivers",conn)

# Filter Senders table to exclude null values for SenderPersonId
Senders = Senders[Senders.SenderPersonId!=""]

# Merge 2 dataframe with pd.merge
solution_df = pd.merge(Senders,Receivers,left_on="Id",right_on="EmailId").groupby(["SenderPersonId","PersonId"]).size().reset_index()

#Rename Columns
solution_df.columns = ["Sender", "Receiver","Frequency"]

# Sorting
solution_df = solution_df.sort_values("Frequency",ascending=False)
solution_df

Senders.head()

len(Senders), len(Receivers)

(7788, 9306)

display(Senders.head())
display(Receivers.head())

### BEGIN SOLUTION
query = """
  SELECT
      SenderPersonId AS Sender,
      PersonId AS Receiver,
      COUNT (*) AS Frequency
  FROM
      Emails, EmailReceivers
  WHERE
      Emails.Id = EmailReceivers.EmailId AND Sender <> ''
  GROUP BY
      Sender, Receiver
  ORDER BY
      -Frequency
"""
CommEdges = pd.read_sql_query (query, conn)

#CommEdges.to_csv ('CommEdges_soln.csv', index=False)
### END SOLUTION
# CommEdges = X

# Read what we believe is the exact result (up to permutations)
CommEdges_soln = pd.read_csv ('{}CommEdges_soln.csv'.format(DATA_PATH))

# Check that we got a data frame of the expected shape:
assert 'CommEdges' in globals ()
assert type (CommEdges) is type (pd.DataFrame ())
assert len (CommEdges) == len (CommEdges_soln)
assert set (CommEdges.columns) == set (['Sender', 'Receiver', 'Frequency'])

# Check that the results are sorted:
non_increasing = (CommEdges['Frequency'].iloc[:-1].values >= CommEdges['Frequency'].iloc[1:].values)
assert non_increasing.all ()

print ("Top 5 communicating pairs:")
display (CommEdges.head ())

assert tbeq (CommEdges, CommEdges_soln)
print ("\n(Passed!)")

Top 5 communicating pairs:

(Passed!)

Exercise 2 (3 points). Consider any pair of people, $a$ and $b$. Suppose we don't care whether person $a$ sends and person $b$ receives or whether person $b$ sends and person $a$ receives. Rather, we only care that $\{a, b\}$ have exchanged messages.

That is, the previous exercise computed a directed graph, $G = \left(g_{a,b}\right)$, where $g_{a,b}$ is the number of times (or "frequency") that person $a$ was the sender and person $b$ was the receiver. Instead, suppose we wish to compute its symmetrized or undirected version, $H = G + G^T$.

Write some code that computes $H$ and stores it in a Pandas data frame named CommPairs with the columns, A, B, and Frequency. Per the definition of $H$, the Frequency column should combine frequencies from $G$ and $G^T$ accordingly.

### BEGIN SOLUTION
G = CommEdges.rename (columns={'Sender': 'A', 'Receiver': 'B'})
GT = CommEdges.rename (columns={'Sender': 'B', 'Receiver': 'A'})
H = pd.merge (G, GT, on=['A', 'B'], suffixes=('_G', '_GT'))
H['Frequency'] = H['Frequency_G'] + H['Frequency_GT']
del H['Frequency_G']
del H['Frequency_GT']
CommPairs = H

#CommPairs.to_csv ('CommPairs_soln.csv', index=False)
### END SOLUTION

CommPairs_soln = pd.read_csv ('{}CommPairs_soln.csv'.format(DATA_PATH))

assert 'CommPairs' in globals ()
assert type (CommPairs) is type (pd.DataFrame ())
assert len (CommPairs) == len (CommPairs_soln)

print ("Most frequently communicating pairs:")
display (CommPairs.sort_values (by='Frequency', ascending=False).head (10))

assert tbeq (CommPairs, CommPairs_soln)
print ("\n(Passed!)")

Most frequently communicating pairs:

(Passed!)

Exercise 3 (3 points). Starting with a copy of CommPairs, named CommPairsNamed, add two additional columns that contain the names of the communicators. Place these values in columns named A_name and B_name in CommPairsNamed.

CommPairsNamed = CommPairs.copy ()

### BEGIN SOLUTION
CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['A'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'A_name'}, inplace=True)
del CommPairsNamed['Id']

CommPairsNamed = pd.merge (CommPairsNamed, Persons, left_on=['B'], right_on=['Id'])
CommPairsNamed.rename (columns={'Name': 'B_name'}, inplace=True)
del CommPairsNamed['Id']

#CommPairsNamed.to_csv ('CommPairsNamed_soln.csv', index=False)
### END SOLUTION

CommPairsNamed_soln = pd.read_csv ('{}CommPairsNamed_soln.csv'.format(DATA_PATH))

assert 'CommPairsNamed' in globals ()
assert type (CommPairsNamed) is type (pd.DataFrame ())
assert set (CommPairsNamed.columns) == set (['A', 'A_name', 'B', 'B_name', 'Frequency'])

print ("Top few entries:")
CommPairsNamed.sort_values (by=['Frequency', 'A', 'B'], ascending=False, inplace=True)
display (CommPairsNamed.head (10))

assert tbeq (CommPairsNamed, CommPairsNamed_soln)
print ("\n(Passed!)")

Top few entries:

(Passed!)

When you are all done, it's good practice to close the database. The following will do that for you.

conn.close ()

Fin! If you've reached this point and all tests above pass, you are ready to submit your solution to this problem. Don't forget to save you work prior to submitting.

	Id	DocNumber	MetadataSubject	MetadataTo	MetadataFrom	SenderPersonId	MetadataDateSent	MetadataDateReleased	MetadataPdfLink	MetadataCaseNumber	...	ExtractedTo	ExtractedFrom	ExtractedCc	ExtractedDateSent	ExtractedCaseNumber	ExtractedDocNumber	ExtractedDateReleased	ExtractedReleaseInPartOrFull	ExtractedBodyText	RawText
0	1	C05739545	WOW	H	Sullivan, Jacob J	87	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...	F-2015-04841	...		Sullivan, Jacob J <Sullivan11@state.gov>		Wednesday, September 12, 2012 10:16 AM	F-2015-04841	C05739545	05/13/2015	RELEASE IN FULL		UNCLASSIFIED\nU.S. Department of State\nCase N...
1	2	C05739546	H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...	H			2011-03-03T05:00:00+00:00	2015-05-22T04:00:00+00:00	DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...	F-2015-04841	...					F-2015-04841	C05739546	05/13/2015	RELEASE IN PART	B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...	UNCLASSIFIED\nU.S. Department of State\nCase N...
2	3	C05739547	CHRIS STEVENS	;H	Mills, Cheryl D	32	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...	F-2015-04841	...	B6	Mills, Cheryl D <MillsCD@state.gov>	Abedin, Huma	Wednesday, September 12, 2012 11:52 AM	F-2015-04841	C05739547	05/14/2015	RELEASE IN PART	Thx	UNCLASSIFIED\nU.S. Department of State\nCase N...
3	4	C05739550	CAIRO CONDEMNATION - FINAL	H	Mills, Cheryl D	32	2012-09-12T04:00:00+00:00	2015-05-22T04:00:00+00:00	DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...	F-2015-04841	...		Mills, Cheryl D <MillsCD@state.gov>	Mitchell, Andrew B	Wednesday, September 12,2012 12:44 PM	F-2015-04841	C05739550	05/13/2015	RELEASE IN PART		UNCLASSIFIED\nU.S. Department of State\nCase N...
4	5	C05739554	H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...	Abedin, Huma	H	80	2011-03-11T05:00:00+00:00	2015-05-22T04:00:00+00:00	DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...	F-2015-04841	...					F-2015-04841	C05739554	05/13/2015	RELEASE IN PART	H <hrod17@clintonemail.com>\nFriday, March 11,...	B6\nUNCLASSIFIED\nU.S. Department of State\nCa...

	Id	Name
0	1	111th Congress
1	2	AGNA USEMB Kabul Afghanistan
2	3	AP
3	4	ASUNCION
4	5	Alec

	Id	Name
356	357	mtorrey1
500	501	abed inh@state.gov.
512	513	ha nleym r@state.gov
148	149	Mike
418	419	valmorou state. ov

	Sender	Receiver	Frequency
424	81	80	1406
40	32	80	1262
490	87	80	857
168	80	81	529
152	80	32	372
...	...	...	...
322	80	327	1
321	80	326	1
320	80	325	1
317	80	322	1
738	226	80	1

	Id	SenderPersonId
0	2348	1
1	2630	2
2	5265	3
3	368	5
4	5032	6

	A	B	Frequency
0	81	80	1935
3	80	81	1935
4	80	32	1634
1	32	80	1634
2	87	80	1206
6	80	87	1206
7	116	80	580
8	80	116	580
5	194	80	413
19	80	194	413

	A	B	Frequency	A_name	B_name
0	81	80	1935	Huma Abedin	Hillary Clinton
137	80	81	1935	Hillary Clinton	Huma Abedin
75	80	32	1634	Hillary Clinton	Cheryl Mills
1	32	80	1634	Cheryl Mills	Hillary Clinton
2	87	80	1206	Jake Sullivan	Hillary Clinton
61	80	87	1206	Hillary Clinton	Jake Sullivan
4	116	80	580	Lauren Jiloty	Hillary Clinton
95	80	116	580	Hillary Clinton	Lauren Jiloty
3	194	80	413	Sidney Blumenthal	Hillary Clinton
168	80	194	413	Hillary Clinton	Sidney Blumenthal

	Id	EmailId	PersonId
0	1	1	80
1	2	2	80
2	3	3	228

	EmailId	PersonId
0	1	80
1	2	80
2	3	228
3	3	80
4	4	80