Version history:
All of the header information is important. Please read it.
Topics, number of exercises: This problem builds on your knowledge of Numpy, pandas, database organization, graph abstractions, and basic Python (for interfacing with other Python libraries). It has 11 exercises, numbered 0 to 10. There are 21 available points. However, to earn 100% the threshold is 12 points. (Therefore, once you hit 12 points, you can stop. There is no extra credit for exceeding this threshold.)
Free points! This exam includes one exercise, Exercise 3, whose points are "free." However, to get these points you need to read some text and submit the notebook to the autograder at least once.
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. Exercises are not necessarily ordered by difficulty, but higher point values usually imply more difficult tasks.
Demo cells: Code cells that start with the comment ### define demo inputs
will load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for later demos to work properly but they do not affect the test cells. The data loaded by these cells may be large (at least in terms of human readability). You are free to inspect them, but we did not print them in the starter code.
Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Final reminders:
Overall workflow. This notebook has six (6) parts with about 1-3 exercises each.
Skim the code cell below and then run it. Take note of the standard preloaded modules, numpy as np
, pandas as pd
, and sqlite3 as db
, any or all of which you may need to construct your solutions.
The other functions are used by our demo and testing code. You can ignore them unless an exercise asks you to do otherwise.
### Global Imports
### BEGIN HIDDEN TESTS
%load_ext autoreload
%autoreload 2
if False: # set to True to set up
REGENERATE_OUTPUTS = False
import dill
import hashlib
def hash_check(f1, f2, verbose=True):
with open(f1, 'rb') as f:
h1 = hashlib.md5(f.read()).hexdigest()
with open(f2, 'rb') as f:
h2 = hashlib.md5(f.read()).hexdigest()
if verbose:
print(h1)
print(h2)
assert h1 == h2, f'The file "{f1}" has been modified'
with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
dill.dump(hash_check, f)
del hash_check
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
for fname in ['__init__.py', 'utils.py']:
hash_check(f'cse6040/{fname}', f'resource/asnlib/public/cse6040/{fname}')
del hash_check
### END HIDDEN TESTS
# Standard Python modules
import sys
import numpy as np
import pandas as pd
import sqlite3 as db
# Extra functions this notebook needs. Ignore these unless
# an exercise asks you to do otherwise.
import networkx as nx
import cse6040
from cse6040.utils import load_text_from_file, load_df_from_file, load_obj_from_file, load_table_from_db
In case it's helpful, here are the versions of Python and standard modules you are using:
print("* Python version: {}".format(sys.version.replace('\n', ' ')))
print(f"* Numpy version: {np.__version__}")
print(f"* pandas version: {pd.__version__}")
print(f"* sqlite3 version: {db.version}")
The actual Goodreads data is provided via a SQLite3 database. However, only some exercises require SQL; most exercises were designed with pandas in mind.
Nevertheless, even some of the pandas exercises can be solved using SQL. The cell below defines the function, dfs_to_conn
, which can be used to create in-memory database connections. If you pass in a dictionary mapping table names to pandas DataFrame
objects, then dfs_to_conn
will return a sqlite3
connection with all of the data in the DataFrame
objects available under the names given as keys. You are also free to write to the in-memory database by creating tables, inserting, deleting, updating records, etc. Anything that SQLite3 allows should work.
Example:
my_df = pd.DataFrame({'A':[1,2,3], 'B': [4,5,6], 'C':['x', 'y', 'z']})
print(my_df)
# A B C
# 0 1 4 x
# 1 2 5 y
# 2 3 6 z
conn = dfs_to_conn({'my_table': my_df})
cur = conn.cursor()
cur.execute('select A, B, C from my_table')
result = cur.fetchall()
conn.close()
print(result) # list of tuples, each tuple is a row
#[(1, 4, 'x'), (2, 5, 'y'), (3, 6, 'z')]
def dfs_to_conn(conn_dfs, index=False):
import sqlite3
conn = sqlite3.connect(':memory:')
for table_name, df in conn_dfs.items():
df.to_sql(table_name, conn, if_exists='replace', index=index)
return conn
grdbconn
)¶Some of the Goodreads data is stored in a SQLite3 database. The code cell below opens a read-only connection to it named grdbconn
.
For now, don't worry about what's there. We will explain any tables you need in the exercises that use them.
# Goodreads database connection:
grdbconn = db.connect('file:resource/asnlib/publicdata/goodreads.db?mode=ro', uri=True)
Includes Exercise 0 (2 points) and Exercise 1 (1 point).
The Goodreads dataset includes user-book interactions. An "user-book interaction" means the user "did something" with the book on the Goodreads website:
These interactions are recorded in a SQL table called Interactions
. Let's have a quick look for one of the users whose integer ID is 840218
:
pd.read_sql(r"SELECT * FROM Interactions WHERE user_id=840218", grdbconn)
Each row shows how this user interacted with some book. This user interacted with four books. However, they saved book 838000
(row 1) but did nothing else with it—that is, they did not read it, rate it, or review it.
They did rate books 1012
, giving it 5
stars, as well as 38884
and 49559
, giving both 4
stars. They did not review any book (is_reviewed=0
). Had they done so, is_reviewed
would be 1
. All values are integers.
summarize_interactions_str
¶You are asked to write a summary report of the overall interactions. Complete the function
def summarize_interactions_str(conn):
...
so that it does the following.
Inputs: The input is a SQLite3 database connection containing a table named Interactions
with the fields user_id
, book_id
, is_read
, rating
, and is_reviewed
, all containing integer values.
Your task: Calculate the following:
is_read
equals 1
;rating
is greater than 0
;is_review
equals 1
.Output: Generate and return a string that reports these results. The string should be formatted as follows:
There are 370,818 interactions.
- Unique users: 2,000
- Unique books: 138,633
- Number of reads: 208,701 (56.3% of all interactions)
- Number of ratings: 194,243 (52.4%)
- Number of reviews: 23,720 (6.4%)
In particular:
0
, e.g., 37.0%
).Additional notes and hints:
user_id
, book_id
) pairs.### Demo: Recall Python's f-strings
print(f"`pi` to 2 decimal digits: `{3.14159265358979:0.2f}`")
print(f"Behold: `{1234567890:,}` -- neat!")
### Define demo inputs
demo_conn_ex0 = db.connect(f'file:resource/asnlib/publicdata/demo_ex0.db?mode=ro', uri=True)
print("First five rows of the demo database:")
pd.read_sql(r"SELECT * FROM Interactions LIMIT 5", demo_conn_ex0)
The demo included in the solution cell below should display the following output:
There are 12,345 interactions.
- Unique users: 1,766
- Unique books: 9,348
- Number of reads: 6,844 (55.4% of all interactions)
- Number of ratings: 6,389 (51.8%)
- Number of reviews: 744 (6.0%)
### Exercise 0 solution ###
def summarize_interactions_str(conn):
# Use or adapt this template as you see fit:
template = """There are {} interactions.
- Unique users: {}
- Unique books: {}
- Number of reads: {} ({}% of all interactions)
- Number of ratings: {} ({}%)
- Number of reviews: {} ({}%)"""
### BEGIN SOLUTION
df = pd.read_sql("SELECT * FROM Interactions", conn)
n_users = len(df['user_id'].unique())
n_books = len(df['book_id'].unique())
n_reads = df['is_read'].sum()
n_rated = (df['rating'] > 0).sum()
print(df['rating'] > 0)
n_rev = df['is_reviewed'].sum()
template = f"""There are {len(df):,} interactions.
- Unique users: {n_users:,}
- Unique books: {n_books:,}
- Number of reads: {n_reads:,} ({n_reads/len(df)*100:.1f}% of all interactions)
- Number of ratings: {n_rated:,} ({n_rated/len(df)*100:.1f}%)
- Number of reviews: {n_rev:,} ({n_rev/len(df)*100:.1f}%)"""
return template
### END SOLUTION
### demo function call
print(summarize_interactions_str(demo_conn_ex0))
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex0
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_0',
'func': summarize_interactions_str, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'conn':{
'dtype': 'db', # data type of param.
'check_modified': False,
}
},
'outputs':{
'output_0': {
'index': 0,
'dtype': 'str',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(10):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(10):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: A correct implementation of summarize_interactions_str
, when run on the full Goodreads dataset, would produce the following report:
print(f"\n=== Report on the full dataset ===\n\n{load_text_from_file('ex0.txt')}")
count_interactions_by
¶Suppose we want to group the interactions and count the number by group. For example, we might want to know, for each unique user ID, how many interactions there are. Complete the function
def count_interactions_by(col, conn):
...
so that it does the following.
Inputs:
col
: The name of a columnconn
: A database connection containing a table named Interactions
Your task: For each unique value in column 'col'
of the Interactions
table, count how many interactions (rows) there are.
Output: Return a dataframe with two columns:
col
: A column with the same name as the given input column holding the unique values'count'
: A column with the number of interactions for each unique valueRefer to the demo cell below for an example of this output.
Additional notes and hints: You may assume that col
holds a valid column name. The exact order of rows and columns in your output does not matter.
Example:
### Define demo inputs ###
demo_col_ex1 = 'user_id'
demo_conn_ex1 = db.connect(f'file:resource/asnlib/publicdata/demo_ex1.db?mode=ro', uri=True)
display(pd.read_sql("SELECT * FROM Interactions", demo_conn_ex1))
Calling count_interactions_by(demo_col_ex1, demo_conn_ex1)
should produce the following output:
user_id | count |
---|---|
569241 | 3 |
604656 | 1 |
607817 | 4 |
However, calling count_interactions_by('is_read', demo_conn_ex1)
would return a two-row DataFrame
where the count of 0
and 1
values is 4
each.
### Exercise 1 solution
def count_interactions_by(col, conn):
### BEGIN SOLUTION
query = f"SELECT {col}, COUNT(*) AS count FROM Interactions GROUP BY {col}"
return pd.read_sql(query, conn)
### END SOLUTION
### demo function calls ###
display(count_interactions_by(demo_col_ex1, demo_conn_ex1))
display(count_interactions_by('is_read', demo_conn_ex1))
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_1',
'func': count_interactions_by, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'col': {
'dtype': 'str', # data type of param.
'check_modified': False,
},
'conn': {
'dtype': 'db',
'check_modified': False
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
# 'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: A correct implementation of count_interactions_by
, when run on the full Goodreads dataset for the columns is_read
, rating
, and is_reviewed
, would produce the following:
print(f"\n=== `count_interactions_by` on the full dataset ===\n")
for col_ in ['is_read', 'rating', 'is_reviewed']:
display(load_df_from_file(f"ex1-{col_}.df"))
Aside (skip if pressed for time): From these results, you might observe a hint at a phenomenon known as a monotonic behavior chain: the total number of interactions > the number who read > the number who rate > the number who review. Such phenomena have been used to improve automatic generation of item recommendations.
Includes Exercise 2 (3 points).
Many types of real-world data have power law distributions. Roughly speaking, a probability density $f(x)$ is a power law if it behaves like $\dfrac{1}{x^d}$ for some constant $d$ when $x$ is "large," i.e., as one approaches the tail of the distribution. Let's see if there are any power laws in our data.
For instance, suppose you have a pandas Series
that shows, for each user, how many books they interacted with:
ux_counts = load_df_from_file(f"ex1-user_id.df").set_index('user_id')['count']
ux_counts
The index is a user ID and the value is an integer count of how many interaction-rows are associated with them.
log2bin_count
¶Given a Series
object holding values that can range from 1 to $n$, inclusive, we wish to count how many of those integers lie within the log-two bins,
Complete the function,
def log2bin_count(series):
...
to compute these counts.
Inputs: The input series
is a pandas Series
-object holding the values.
Your tasks will involve, most likely, these steps:
Outputs: Your function should return a DataFrame
with two columns:
bin_start
: The value of the left edge of a bin, which are integers starting at 1 and all of the form $2^i$.count
: The number of values in series
that lie in $\left[ 2^i, 2^{i+1} \right)$, also an integer.See the demo below for an example.
Additional notes and hints.
pandas.cut
(pd.cut
), but you certainly do not have to use it.Example/demo: Suppose the input Series
looks like the following:
### Define demo inputs ###
demo_series_ex2 = load_df_from_file('demo_ex2.df').set_index('user_id')['count']
display(demo_series_ex2)
Then a correct solution would produce:
bin_start | count |
---|---|
64 | 3 |
256 | 2 |
128 | 2 |
32 | 2 |
16 | 1 |
There is just one input value in $[16, 32)$, namely, the value 49
. But in the bin $[64, 128)$, there are three input values: 76
, 78
, and 119
.
### Exercise 2 solution
def log2bin_count(series):
### BEGIN SOLUTION
from pandas import cut
bins = make_log2bins(series)
bin_owners = cut(series, bins=bins, labels=bins[:-1], right=False).astype(int)
counts = bin_owners.value_counts()
df = counts.to_frame().reset_index()
df = df.rename(columns={'index': 'bin_start', 0: 'count'})
return df
def make_log2bins(series):
from numpy import log2, ceil, arange
pmax = int(ceil(log2(series.max()))) + 1
return 2**arange(pmax)
### END SOLUTION
### demo function call ###
log2bin_count(demo_series_ex2)
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_2',
'func': log2bin_count, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'series': {
'dtype': 'series',
'check_modified': True,
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: A correct implementation of log2bin_count
, when run on ux_counts
from the full Goodreads dataset, would produce the following:
ux_counts
ux_counts_log2bins = load_df_from_file('ex2-log2bin_count.df')
ax = cse6040.utils.plot_series_loglog(ux_counts_log2bins.set_index('bin_start')['count'], base=2)
ax.set_xlabel('# book interactions (binned)')
ax.set_ylabel('fraction of users', rotation=0, horizontalalignment='right')
ax.set_title('Are sampled user-book interactions power-law-like?')
ax.set_aspect(1.0)
Aside (skip if pressed for time): These interactions do, indeed, appear to follow a power law in the tail of the distribution. In fact, the data in this notebook is a relatively small sample of the full dataset, which consists of hundreds of millions of interactions and has an even longer tail consistent with a power law.
Includes Exercise 3 (the freebie; 2 points).
Your next major task is to learn a little bit about NetworkX, a popular Python package for analyzing relational data (graphs with vertices and edges).
You do not have to write any code in this part; just read, learn, and enjoy! However, you do need to run Exercise 3's test cell and submit the exam to get its "free" points.
Also, do pay attention: subsequent exercises will use some of these concepts.
One way to use NetworkX is to create your graph as an edge list, which is a Python list of tuples. Each tuple (u, v, w)
means that vertex u
connects to vertex v
and that this edge from u
to v
has a weight w
.
For example, here are the first five elements of an edge list for a randomly generated graph.
demo_edge_list = cse6040.utils.random_clusters(4, 6, rng_or_seed=1_234, verbose=True)
print(f"\nThe `demo_edge_list` has {len(demo_edge_list)} tuples. The first ten are:")
demo_edge_list[:10]
Let's use NetworkX to draw the graph that this edge list represents.
(We've written wrappers around the corresponding NetworkX routines to hide some unnecessary detail.)
from cse6040.utils import to_nx, graph_spy
demo_G = to_nx(demo_edge_list) # Convert edge list to NetworkX graph
graph_spy(demo_G, style='graph', with_labels=True, figsize=(5, 5)); # Draw a picture
In this picture, the numbered circles are vertices and arrows indicate edges that go from one vertex to another. The edges are weighted, so in the picture, edges with "large" or "heavy" weights are darker and thicker than "small" or "light" weights.
The picture shows clear structure: there appear to be four clusters of vertices connected by heavy edges, and between these clusters there are only light edges.
To summarize, the NetworkX concepts you just saw are:
(u, v, w)
, where each tuple represents a weighted directed edge from u
to v
with weight w
.When you are ready, execute the cell below and submit to be sure your free points for Exercise 3 are recorded.
### test_cell_ex3 — a freebie ###
pass
print('Passed! Please submit.')
Includes Exercise 4 (1 point) and Exercise 5 (2 points).
User-user interactions. Our dataset tells us which users have viewed, read, rated, and/or reviewed the same books. We will use this fact to "connect" users to one another in a graph, and then use NetworkX to find clusters ("communities") of users.
form_analysis_sample
¶If a user gave a book a rating of 4 stars or more, that is a strong signal of interest in the book. Let's start by focusing on those interactions. Complete the following function as specified.
def form_analysis_sample(conn):
...
Inputs: conn
is a database connection to a database with an Interactions
table, as used in Exercises 0 and 1.
Your task: Return the subset of interactions where the rating is 4 or more.
Outputs: Return the subset of rows of Interactions
as a pandas DataFrame
.
Example. Recall the demo Interactions
table from Exercise 1:
### Define demo inputs ###
demo_conn_ex4 = db.connect(f'file:resource/asnlib/publicdata/demo_ex1.db?mode=ro', uri=True)
display(pd.read_sql("SELECT * FROM Interactions", demo_conn_ex4))
# use the naming convention `demo_<parameter name>_ex<exercise number>`
# for example if the function for exercise 3 has a parameter `df`, the demo variable should be named `demo_df_ex3`
A correct implementation of form_analysis_sample
should return the following DataFrame
:
user_id | book_id | is_read | rating | is_reviewed |
---|---|---|---|---|
569241 | 47199 | 1 | 5 | 1 |
569241 | 47383 | 1 | 5 | 1 |
604656 | 2345195 | 1 | 5 | 1 |
This output includes just the three rows where rating
is 5
.
Note: Although this example does not contain ratings with the value
4
, if it did, they would be included in the output.
### Exercise 4 solution
def form_analysis_sample(conn):
### BEGIN SOLUTION
return pd.read_sql("SELECT * FROM Interactions WHERE rating >= 4", conn)
### END SOLUTION
### demo function call ###
form_analysis_sample(demo_conn_ex4)
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_4',
'func': form_analysis_sample, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'conn':{
'dtype': 'db', # data type of param.
'check_modified': False,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
connect_users
¶Given the analysis sample from Exercise 4, let's "connect" users.
Let's say that two users a
and b
are connected if they both gave ratings of 4 or higher to the same book. The number of unique books they both rated this way is a measure of how strong their connection is.
Complete the following function to help identify these connections.
def connect_users(ubdf, threshold):
...
Inputs:
ubdf
: A "user-book" dataframe having these two columns: user_id
and book_id
. Each row indicates that a given user gave a given book a rating of 4 or higher.threshold
: An integer threshold on connection strength.Your tasks: Determine which pairs of users are connected. Count how many books connect them. Drop self-pairs (user_id_x == user_id_y
), as well as any pairs with fewer than threshold
connections.
Outputs: Return a new DataFrame
with three columns:
user_id_x
: A user IDuser_id_y
: Another user IDcount
: The number of books they both rated in common. Recall that this value should be >= threshold
.Additional notes and hints.
user_id_x
== user_id_y
.a
, b
) have a count k
at or above the threshold, then both (a
, b
, k
) and (b
, a
, k
) should be rows in the output table.DataFrame
with the specified columns.Aside: For really huge datasets (not what is included in this exam), dropping users with fewer than
threshold
ratings before looking for pairs might be a bit faster.
Example: Suppose the inputs are the DataFrame
shown below with a target connection threshold of 2
:
### Define demo inputs ###
demo_ubdf_ex5 = load_df_from_file("demo_ex5.df").sort_values(['book_id', 'user_id']).reset_index(drop=True)
demo_threshold_ex5 = 2
display(demo_ubdf_ex5)
For this input, connect_users
should produce:
user_id_x | user_id_y | count |
---|---|---|
0 | 2 | 2 |
0 | 3 | 2 |
2 | 0 | 2 |
3 | 0 | 2 |
Users 0
and 2
both rated books 7
and 19
, so they meet the threshold of having reviewed 2 books in common. User 1
did not review any books in common with any other user, and so they do not appear in any pair of the output.
### Exercise 5 solution
def connect_users(ubdf, threshold):
### BEGIN SOLUTION
uudf = ubdf.merge(ubdf, on='book_id') \
.groupby(['user_id_x', 'user_id_y']) \
.size() \
.reset_index() \
.rename(columns={0: 'count'})
uudf = uudf[uudf['user_id_x'] != uudf['user_id_y']]
uudf = uudf[uudf['count'] >= threshold]
uudf = uudf.reset_index(drop=True)
return uudf
### END SOLUTION
### demo function call ###
connect_users(demo_ubdf_ex5, demo_threshold_ex5)
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_5',
'func': connect_users, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'ubdf': {
'dtype': 'df', # data type of param.
'check_modified': True
},
'threshold': {
'dtype': 'int',
'check_modified': False
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: From a correct implementation of connect_users
, one way we can "draw" the connectivity is to form a sparse matrix where nonzeros represent connections. Here is a picture of this matrix for the full dataset, using a threshold of 2:
uudf = load_df_from_file('ex5.df') # user-user table
print("A sample of connections:")
display(uudf.head())
if False: # Disabled due to NetworkX version incompatibility issue (fix pending)
uudf_G = cse6040.utils.to_nx(uudf.to_records(index=False))
ax_ex5 = cse6040.utils.graph_spy(uudf_G, markersize=0.01)
ax_ex5.set_title('Spy plot: user-user interactions')
ax_ex5.set_xlabel('user id')
ax_ex5.set_ylabel('user id', rotation=0, horizontalalignment='right');
else:
cse6040.utils.display_image_from_file('resource/asnlib/public/demo-user-user-spy.png')
Aside (skip if pressed for time): The "grid-like" pattern you might see suggests that there are groups or clusters of interconnected users in the data. Our next task will try to identify them.
Includes Exercise 6 (2 points), Exercise 7 (1 point), and Exercise 8 (3 points).
The NetworkX package contains several algorithms for detecting communities, that is, clusters of "strongly interconnected" vertices in a graph (recall Part C).
We ran one of these algorithms on a graph formed from the user-user interactions you calculated in Part D. The algorithm grouped users (graph vertices) into clusters.
It returned these clusters as a list of sets, where each set is a "community" of user IDs grouped together. Since users were connected for liking the same books, it's possible users in the same community have similar tastes.
Here is the communities object that NetworkX produced for us:
communities = load_obj_from_file('demo_ex6.obj')
It is a list of sets:
type(communities), type(communities[0])
Here is how many communities there are:
len(communities)
The sizes of the 6 communities are:
[len(c) for c in communities]
Let's print the smaller two:
print("Community 1:", communities[1])
print("Community 4:", communities[4])
The values you see are user IDs.
assign_communities
¶To merge this data with our existing database, we need to convert the Python communities
data structure into a DataFrame
. Complete the function below to aid in this task:
def assign_communities(communities):
...
Inputs: The input communities
is a list of sets of integers, as in the previous example.
Your task: Convert this input into a dataframe.
Returns: Your function should return a DataFrame
with these columns:
user_id
: A user ID (an integer).comm_id
: The ID of the community it belongs to (also an integer).The community ID is its index in communities
. That is, community 0
is stored in communities[0]
, community 1
is in communities[1]
, and so on.
Example: Consider this set of communities:
### Define demo inputs ###
demo_communities_ex6 = [{1, 3, 10, 17}, {2, 6, 13, 15}, {0, 5, 11, 16}, {9, 14}, {4, 7, 8, 12}]
A correct implementation of assign_communities
would produce this result:
user_id | comm_id |
---|---|
1 | 0 |
10 | 0 |
3 | 0 |
17 | 0 |
2 | 1 |
13 | 1 |
6 | 1 |
15 | 1 |
0 | 2 |
16 | 2 |
11 | 2 |
5 | 2 |
9 | 3 |
14 | 3 |
8 | 4 |
4 | 4 |
12 | 4 |
7 | 4 |
### Exercise 6 solution
def assign_communities(communities):
### BEGIN SOLUTION
from pandas import DataFrame
all_uids = []
all_cids = []
for cid, uids in enumerate(communities):
all_uids += list(uids)
all_cids += [cid] * len(uids)
return DataFrame({'user_id': all_uids, 'comm_id': all_cids})
### END SOLUTION
### demo function call ###
assign_communities(demo_communities_ex6)
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex6
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_6',
'func': assign_communities, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'communities': {
'dtype': 'list', # data type of param.
'check_modified': True,
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
means_by_community
¶Suppose we wish to calculate means (averages) of the interaction data by community. Implement the function,
def means_by_community(intdf, comdf):
...
to perform this task.
Inputs:
intdf
: An interactions DataFrame
with columns user_id
, book_id
, is_read
, rating
, and is_reviewed
.comdf
: A communities DataFrame
with columns user_id
and comm_id
.Your task: Join these DataFrames
and then return a new DataFrame
with the mean values of the is_read
, rating
, and is_reviewed
columns by community.
Outputs: Your function should return a new DataFrame
with these columns:
comm_id
: An integer community ID, one per row.is_read
, rating
, is_reviewed
: The mean value of each column for all rows of intdf
for all users of the community. These should be stored as float
values.Additional notes: A user ID might not appear in both inputs. These should not be part of any means calculation.
Example: Consider the following two inputs:
### Define demo inputs ###
demo_intdf_ex7 = load_table_from_db("Interactions", "demo_ex7.db").sort_values(by='user_id')
demo_comdf_ex7 = load_table_from_db("Communities", "demo_ex7.db").sort_values(by='user_id')
display(demo_intdf_ex7)
display(demo_comdf_ex7)
A correct implementation of means_by_community
will return:
comm_id | is_read | rating | is_reviewed |
---|---|---|---|
0 | 1 | 4.5 | 0 |
3 | 1 | 5 | 0 |
5 | 1 | 5 | 0 |
Observe that user 34369
does not belong to any community. Therefore, none of the final averages should be affected by that user's data.
### Exercise 7 solution
def means_by_community(intdf, comdf):
### BEGIN SOLUTION
VALUES = ['is_read', 'rating', 'is_reviewed']
df = intdf.merge(comdf, on='user_id')
df = df.groupby('comm_id')[VALUES].mean().reset_index()
for c in VALUES:
df[c] = df[c].astype(float) # paranoia?
return df
### END SOLUTION
### demo function call ###
demo_result_ex7 = means_by_community(demo_intdf_ex7, demo_comdf_ex7)
display(demo_result_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_7',
'func': means_by_community, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'intdf': {
'dtype': 'df', # data type of param.
'check_modified': True,
},
'comdf': {
'dtype': 'df',
'check_modified': True
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: With a correct means_by_community
, we can see whether the communities differ in how they read, rate, and review books. Here is what would happen if we ran on the full dataset:
ex7_means = load_df_from_file('ex7-means.df')
print(f"Recall: community sizes: {[(k, len(c)) for k, c in enumerate(communities)]}")
ex7_means
get_topreads_by_community
¶Suppose we merge the community information into the interactions database. Can we then identify which books each community is "the most interested in?" Complete the following function to help answer this question:
def get_topreads_by_community(xcdf, rank):
...
Inputs: There are two inputs:
xcdf
is a DataFrame
with the following columns:user_id
: A user ID (integer)book_id
: A book ID (integer) that this user readcomm_id
: The community ID to which the user belongs (integer)rank
is an integer indicating how many of the top books we want to return. For instance, if rank=5
, then we want results for just the top 5 books in each community.Your task: For each community, calculate what percentage of its users read each book. That is, we would like to be able to see something like "in Community 2, 25% of the users read book 238." We then want to identify the top rank
books.
There are several strategies for this exercise, but you might consider something along these lines.
rank
books in each community.Outputs: Your function should return a new DataFrame
with the following columns:
comm_id
: The community IDbook_id
: A book ID that was read in that communitypercent
: The percentage of the community that read that book.comm_size
: The number of users in the communityAs noted above, return at most the top rank
books per community. In the event of ties, retain books with the lowest ID. (This choice is arbitrary but will simplify your implementation.)
Additional notes and hints: If your code calculates a fraction, don't forget to multiply by 100 to get a percentage value for your final output.
Example: Consider this input dataframe and a target rank of 2:
### Define demo inputs ###
demo_xcdf_ex8 = load_df_from_file('demo_ex8.df').reset_index(drop=True)
demo_rank_ex8 = 2
demo_xcdf_ex8
From the demo input shown above, your function should return:
comm_id | book_id | percent | comm_size |
---|---|---|---|
0 | 821 | 7.89474 | 38 |
0 | 536 | 5.26316 | 38 |
3 | 938 | 12.5 | 32 |
3 | 943 | 9.375 | 32 |
5 | 1386 | 12 | 25 |
5 | 1473 | 12 | 25 |
### Exercise 8 solution
def get_topreads_by_community(xcdf, rank=5):
### BEGIN SOLUTION
sizesdf = count_users_by_comm(xcdf, outcol='comm_size')
readersdf = count_readers_by_comm(xcdf, outcol='num_read')
df = readersdf.merge(sizesdf, on='comm_id')
df['percent'] = df['num_read'] / df['comm_size'] * 100
dftop = df.sort_values(['comm_id', 'percent', 'book_id'], ascending=[True, False, True]) \
.reset_index(drop=True) \
.groupby('comm_id').apply(lambda df: df.iloc[:rank]).reset_index(drop=True)
return dftop[['comm_id', 'book_id', 'percent', 'comm_size']]
def count_users_by_comm(xcdf, **kwargs):
"""Count the number of users in each community."""
cudf = xcdf[['comm_id', 'user_id']].drop_duplicates()
return count_by_group(cudf, 'comm_id', **kwargs)
def count_readers_by_comm(xcdf, **kwargs):
"""Count the number of readers of each book in each community."""
return count_by_group(xcdf, ['comm_id', 'book_id'], **kwargs)
def count_by_group(df, cols, outcol='count'):
return df.groupby(cols).size().reset_index().rename(columns={0: outcol})
### END SOLUTION
### demo function call ###
get_topreads_by_community(demo_xcdf_ex8, demo_rank_ex8)
RUN ME: If your function was working correctly, you would identify these top books by community on the full dataset.
ex8_topreads = load_df_from_file('ex8-output.df')
ex8_topreads
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_8',
'func': get_topreads_by_community, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'xcdf': {
'dtype': 'df', # data type of param.
'check_modified': True,
},
'rank': {
'dtype': 'int',
'check_modified': False
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Includes Exercises 9 and 10 (2 points each).
To interpret the communities, we need to bring in some book-inventory metadata, like book titles and genres. Once we've done so, will the communities make sense?
The original dataset includes information on genres for each book:
genres = pd.read_sql("SELECT * FROM Genres", grdbconn)
genres
It's a bit messy, however: the genre information is stored as a JSON-formatted Python string encoding a genre vector:
# Inspect the very first genre entry:
print(f"* Type: `{type(genres['genres'].iloc[0])}`")
print(f"* Value: '{genres['genres'].iloc[0]}'")
This genre vector says that this particular book mixes three genres: fiction
, romance
, and "mystery, thriller, crime"
(considered a single genre). Each value measures the relevance of that genre to the book.
Roughly speaking, let's interpret
555
as meaning this book is 555 / (555+23+10) ~ 94.3% "fiction" and 23 / (555+23+10) ~ 3.9% "romance."
The database stores these as genre vectors as strings. However, we can easily convert them to Python dictionaries using the following helper function, from_json_str
:
def from_json_str(s):
"""Parses the JSON string `s` and returns a Python object."""
from json import loads
return loads(s)
# Demo #
print("iloc 0:", from_json_str(genres['genres'].iloc[0]))
print("iloc 1:", from_json_str(genres['genres'].iloc[1]))
We will treat these as (mathematical) vectors that we can "add." Here is a simple function to compute the sum of two genre vectors:
def add_genre_vecs(x, y):
"""Returns the sum of two genre vectors."""
from collections import defaultdict
z = defaultdict(int)
for k, v in x.items():
z[k] += v
for k, v in y.items():
z[k] += v
return dict(z) # Converts into a regular Python dict
# Demo: start with two genre vectors, converted to `dict`:
demo_genre_vec_a = from_json_str(genres['genres'].iloc[0])
demo_genre_vec_b = from_json_str(genres['genres'].iloc[1])
# Add them:
add_genre_vecs(demo_genre_vec_a, demo_genre_vec_b)
merge_genre_vecs
¶Suppose you are given a pandas Series
whose values are JSON strings encoding individual genre vectors. Complete the function,
def merge_genre_vecs(series):
...
so that it combines the genre vectors into a single, normalized genre vector.
Inputs: The input is a Series
object containing Python strings. Each string is a JSON-formatted genre vector.
Your task:
from_json_str
from above.add_genre_vecs
from above.The result of the previous two steps is a single dictionary. The final step is to normalize this result. That is, divide each value of the result by the sum of all the values.
Outputs: Your function should return the normalized genre vector as a Python dictionary.
Example: Consider the following example input, a Series
of JSON strings:
### Define demo inputs ###
demo_series_ex9 = pd.Series(
['{"fiction": 555, "romance": 23, "mystery, thriller, crime": 10}',
'{"non-fiction": 534, "history, historical fiction, biography": 178, "fiction": 16, "comics, graphic": 6}',
'{"non-fiction": 163}',
'{"fiction": 425, "history, historical fiction, biography": 330, "young-adult": 93, "children": 190}',
'{"fantasy, paranormal": 1}'])
print(demo_series_ex9)
A correct merge_genre_vecs
implementation should return the dictionary,
{'fiction': 0.39461172741679873,
'romance': 0.009112519809825673,
'mystery, thriller, crime': 0.003961965134706815,
'non-fiction': 0.27614896988906495,
'history, historical fiction, biography': 0.20126782884310618,
'comics, graphic': 0.002377179080824089,
'young-adult': 0.036846275752773376,
'children': 0.07527733755942947,
'fantasy, paranormal': 0.0003961965134706815}
### Exercise 9 solution
def merge_genre_vecs(series):
### BEGIN SOLUTION
vec_list = series.apply(from_json_str).to_list()
vec = merge_dicts(vec_list)
norm_vec = normalize_dict(vec)
return norm_vec
def merge_dicts(list_of_dicts):
from collections import defaultdict
merged = defaultdict(int)
for d in list_of_dicts:
for k, v in d.items():
merged[k] += v
return merged
def normalize_dict(d):
total = sum(d.values())
return {k: v/total for k, v in d.items()}
### END SOLUTION
### demo function call ###
merge_genre_vecs(demo_series_ex9)
The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex9
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_9',
'func': merge_genre_vecs, # replace this with the function defined above
'inputs': { # input config dict. keys are parameter names
'series': {
'dtype':'series', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index': 0,
'dtype': 'dict',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
combine_all_data
¶The final step in our analysis is to combine several pieces of information into a final DataFrame
. In particular, we'd like to take the "top reads" results from Exercise 8 and add in (a) book titles and (b) book genres. Complete the function so that it carries out this task.
def combine_all_data(topdf, book2inv, invdf, genresdf):
...
Inputs: The inputs consist of four DataFrame
objects.
topdf
: A dataframe of the top reads by community (e.g., from Ex. 8). Its columns are:'comm_id'
: An integer community ID'book_id'
: An integer book ID'comm_size'
: The number of users in the community'percent'
: The percentage of community users that read the given bookbook2inv
: A dataframe to convert book IDs into "inventory IDs." It has two columns:'book_id'
: An integer book ID'inv_id'
: An inventory ID, which can be used to link the book to its title and genreinvdf
: An inventory of books. Its columns include:'inv_id'
: An integer inventory ID'title'
: The book's title, a string'description'
: A brief description of the bookgenres
: Genre vectors, encoded as JSON strings. Its columns are:'inv_id'
: The integer inventory ID'genres'
: The genre vector (as a JSON string)Your task: Merge all of this data into a single DataFrame
. You should perform a series of left-merges (pandas equivalent of left-joins), starting with topdf
, using either book_id
or inv_id
to link the dataframes. By doing left-joins, you will preserve all the rows of topdf
.
Outputs: Your function should return the DataFrame
. It will have only the columns listed above: 'comm_id'
, 'book_id'
, 'comm_size'
, 'percent'
, 'inv_id'
, 'title'
, 'description'
, 'genres'
.
Additional notes: You do not need to convert any of the fields, you just need to arrange the merges correctly.
Example: The following cell loads some demo inputs that you can use for testing and debugging. (Because there are several of these, we have refrained from printing them. However, you can use the next cell to write code to explore them.)
### Define demo inputs ###
demo_topdf_ex10 = load_df_from_file("demo_ex10-topdf.df")
demo_book2inv_ex10 = load_df_from_file("demo_ex10-book2inv.df")
demo_invdf_ex10 = load_df_from_file("demo_ex10-invdf.df")
demo_genresdf_ex10 = load_df_from_file("demo_ex10-genresdf.df")
# Use this cell to `display`, `print`, or otherwise explore those demo inputs
A correctly functioning combine_all_data
will produce the following output on the demo inputs:
comm_id | book_id | comm_size | percent | inv_id | title | description | genres |
---|---|---|---|---|---|---|---|
0 | 821 | 868 | 22.5806 | 5470 | 1984 | The year 1... | {"fiction": 25686, "fantasy, paranormal": 1776, "young-adult": 233} |
0 | 943 | 868 | 21.4286 | 3 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) | Harry Pott... | {"fantasy, paranormal": 54156, "young-adult": 17058, "fiction": 15016, "children": 11213, "mystery, thriller, crime": 668} |
2 | 49734 | 36 | 22.2222 | 6604887 | أنت لي | {"romance": 31, "fiction": 9} | |
2 | 23164 | 36 | 19.4444 | 7704143 | تراب الماس | "llmr@ lth... | {"fiction": 27, "mystery, thriller, crime": 32} |
3 | 943 | 340 | 77.0588 | 3 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) | Harry Pott... | {"fantasy, paranormal": 54156, "young-adult": 17058, "fiction": 15016, "children": 11213, "mystery, thriller, crime": 668} |
3 | 941 | 340 | 74.1176 | 5 | Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) | Harry Pott... | {"fiction": 12103, "children": 8558, "fantasy, paranormal": 4639, "young-adult": 1513, "mystery, thriller, crime": 537} |
4 | 139433 | 6 | 50 | 148849 | شازده کوچولو | shzdh khwc... | {"fiction": 5481, "fantasy, paranormal": 3847, "children": 8886, "young-adult": 1127} |
### Exercise 10 solution
def combine_all_data(topdf, book2inv, invdf, genresdf):
### BEGIN SOLUTION
return topdf[['comm_id', 'book_id', 'comm_size', 'percent']] \
.merge(book2inv[['book_id', 'inv_id']], on='book_id', how='left') \
.merge(invdf[['inv_id', 'title', 'description']], on='inv_id', how='left') \
.merge(genresdf, on='inv_id', how='left') \
.sort_values(['comm_id', 'percent'], ascending=[True, False])
### END SOLUTION
### demo function call ###
combine_all_data(demo_topdf_ex10, demo_book2inv_ex10, demo_invdf_ex10, demo_genresdf_ex10)
The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex10
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file': 'tc_10',
'func': combine_all_data, # replace this with the function defined above
'inputs': { # input config dict. keys are parameter names
'topdf': {'dtype': 'df', 'check_modified': True},
'book2inv': {'dtype': 'df', 'check_modified': True},
'invdf': {'dtype': 'df', 'check_modified': True},
'genresdf': {'dtype': 'df', 'check_modified': True},
},
'outputs': {
'output_0': {
'index': 0,
'dtype': 'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'KcDYbXW2PCmBcMV5nAQgPNxlLOehBn-VPhzXUPm_8Yc=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
RUN ME: If combine_all_data
is working and applied to the full Goodreads dataset, here are the results:
ex10_final = load_df_from_file('ex10-final.df')
ex10_final_groups = ex10_final.groupby('comm_id')
for comm_id in ex10_final_groups.groups.keys():
display(ex10_final_groups.get_group(comm_id))
Scan the titles, descriptions, and genres. Do the community labels appear to identify distinct communities?
If you have made it this far, that's it — congratulations on completing the exam. Don't forget to submit!
# Close database connection
try:
grdbconn.close()
except:
print("Goodreads database-connection may already be closed.")
Postscript. Had you gotten everything right, then we could have performed one final analysis on the previous result.
Suppose you calculate the normalized genre vectors for each community, and then plot the components for each community as shown below.
Darker bars correspond to more highly weighted components. You can see that the community genre-vectors are distinct from one another, albeit with some (expected) overlaps. Thus, there is, arguably, at least some additional evidence to suspect this initial grouping may be a meaningful one for helping users find other users and appropriate book recommendations. The analysis in this notebook operated on just a small fraction of the complete dataset, and it is possible that with more data more distinct communities could emerge.
Want to explore this dataset on your own? Refer to the Goodreads Dataset. It was originally collected in 2017 by researchers at the University of California, San Diego. It is quite extensive, and what we did in this exam barely scratches the surface of what is possible!