Version 1.0.1
Change History
This problem builds on your knowledge of Pandas, base Python data structures, and using new tools. (Some exercises require you to use very basic features of the networkx
package, which is well documented.) It has 9 exercises, numbered 0 to 8. There are 17 available points. However, to earn 100% the threshold is 14 points. (Therefore, once you hit 14 points, you can stop. There is no extra credit for exceeding this threshold.)
Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
The point values of individual exercises are as follows:
Before we can do any analysis, we have to read the data from the file it is stored in. We have defined load_data
and are using it to read from the data file.
### BEGIN HIDDEN TESTS
if True:
import dill
import hashlib
def hash_check(f1, f2, verbose=True):
with open(f1, 'rb') as f:
h1 = hashlib.md5(f.read()).hexdigest()
with open(f2, 'rb') as f:
h2 = hashlib.md5(f.read()).hexdigest()
if verbose:
print(h1)
print(h2)
assert h1 == h2, f'The file "{f1}" has been modified'
with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
dill.dump(hash_check, f)
del hash_check
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
### END HIDDEN TESTS
def load_data(path):
import pandas as pd
return pd.read_csv(path, names=['film_id', 'film_name', 'actor', 'year'], skiprows=1)
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex0
from tester_fw.testers import Tester_ex0
tester = Tester_ex0()
for _ in range(20):
try:
tester.run_test(load_data)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex0(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(load_data)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Next we need to explore our data. Complete the function explore_data
to return a tuple, t
, with the following elements.
t[0]
- tuple
- the shape of df
t[1]
- pd.DataFrame
- the first five rows of df
t[2]
- dict
- mapping year (int
) to the number of films released that year (int
)The input df
is a pd.DataFrame
with the following columns:
'film_id'
- unique integer associated with a film'film_name'
- the name of a film'actor'
- the name of an actor who starred in the film'year'
- the year which the film was releasedEach row in df
indicates an instance of an actor starring in a film, so it is possible that there will be multiple rows with the same 'film_name'
and 'film_id'
.
def explore_data(df):
### BEGIN SOLUTION
return df.shape\
, df.head(5)\
, df.groupby('year').apply(
lambda group:
group['film_id'].unique().shape[0]
).to_dict()
### END SOLUTION
The demo cell below should display the following output:
((15, 4),
film_id film_name actor \
8277 1599 Before I Fall Medalion Rahimi
6730 1150 A Million Ways to Die in the West Seth MacFarlane
5770 934 The Mortal Instruments: City of Bones Jamie Campbell Bower
10007 1883 Avengers: Infinity War Chris Pratt
9831 1855 Isle of Dogs Bob Balaban
year
8277 2017
6730 2014
5770 2013
10007 2018
9831 2018 ,
{2011: 2, 2012: 1, 2013: 2, 2014: 1, 2016: 1, 2017: 3, 2018: 4, 2019: 1})
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_data.pkl', 'rb') as f:
movie_data = pickle.load(f)
demo_df_ex1 = movie_data.sample(15, random_state=6040)
### call demo funtion
explore_data(demo_df_ex1)
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex1
tester = Tester_ex1()
for _ in range(20):
try:
tester.run_test(explore_data)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex1(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(explore_data)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We will continue our exploration by identifying prolific actors. Complete the function top_10_actors
to accomplish the following:
'actor'
(string) and 'count'
(int) indicating the actor's name and the number of films they have appeared in.'count'
The input df
will be as described in exercise 1.
def top_10_actors(df):
### BEGIN SOLUTION
df = df
df['actor'] = df['actor'].astype('string')
return df.groupby('actor', as_index=False)\
.apply(lambda group: group.shape[0])\
.rename(columns={None:'count'})\
.sort_values(['count', 'actor'], ascending=[False, True])\
.nlargest(
n=10
, columns='count'
, keep='all'
)\
.reset_index(drop=True)
### END SOLUTION
The demo cell below should display the following output:
actor count
0 Chloë Grace Moretz 8
1 Anna Kendrick 7
2 Jennifer Lawrence 7
3 Kevin Hart 7
4 Kristen Wiig 7
5 Melissa Leo 7
6 Melissa McCarthy 7
7 Ryan Reynolds 7
8 Bill Hader 6
9 Bryan Cranston 6
10 Christina Hendricks 6
11 Dan Stevens 6
12 Danny Glover 6
13 Idris Elba 6
14 James McAvoy 6
15 Maya Rudolph 6
16 Morgan Freeman 6
17 Nicolas Cage 6
18 Rose Byrne 6
19 Sylvester Stallone 6
Notice how all of the actors appearing in 6 or more movies are included.
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_data.pkl', 'rb') as f:
movie_data = pickle.load(f)
demo_df_ex2 = movie_data.sample(3000, random_state=6040)
### call demo funtion
print(top_10_actors(demo_df_ex2))
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex2
tester = Tester_ex2()
for _ in range(50):
try:
tester.run_test(top_10_actors)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex2(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(top_10_actors)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We will continue our exploration with a look at which years an actor has appeared in movies. Complete the function actor_years
to determine which years the given actor
has appeared in movies based off of the data in df
. Your output should meet the following requirements:
dict
mapping the actor's name to a list
of integers (int
) containing the years in which this actor appeared in films.The input df
is a pd.DataFrame
of the same form denoted in exercise 1.
def actor_years(df, actor):
### BEGIN SOLUTION
import numpy as np
return {actor: df.loc[df['actor'] == actor, 'year'].sort_values().unique().tolist()}
### END SOLUTION
The demo cell below should display the following output:
{'James Franco': [2012, 2013]}
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_data.pkl', 'rb') as f:
movie_data = pickle.load(f)
demo_df_ex3 = movie_data.sample(3000, random_state=6040)
### call demo funtion
actor_years(demo_df_ex3, 'James Franco')
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex3
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex3
tester = Tester_ex3()
for _ in range(20):
try:
tester.run_test(actor_years)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex3(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(actor_years)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
For our last exercise in exploration, we want to see some summary statistics on how many actors participated in a movie. Complete the funciton movie_size_by_year
to accomplish the following:
int
.year
: {'min': minimum size, 'max': maximum size, 'mean': mean size (rounded to the nearest integer)}}### Define movie_size_by_year
def movie_size_by_year(df):
### BEGIN SOLUTION
import pandas as pd
def summarize_year(group):
counts = group[['film_id']].groupby('film_id').apply(lambda x: len(x))
return pd.Series({
'min': counts.min(),
'max': counts.max(),
'mean': round(counts.mean())
})
return df.groupby(['year'])\
.apply(summarize_year)\
.to_dict('index')
### END SOLUTION
The demo cell below should display the following output:
{2010: {'min': 1, 'max': 8, 'mean': 2},
2011: {'min': 1, 'max': 7, 'mean': 2},
2012: {'min': 1, 'max': 8, 'mean': 2},
2013: {'min': 1, 'max': 13, 'mean': 2},
2014: {'min': 1, 'max': 4, 'mean': 1},
2015: {'min': 1, 'max': 4, 'mean': 1},
2016: {'min': 1, 'max': 2, 'mean': 1},
2017: {'min': 1, 'max': 6, 'mean': 2},
2018: {'min': 1, 'max': 6, 'mean': 2},
2019: {'min': 1, 'max': 6, 'mean': 2}}
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_data.pkl', 'rb') as f:
movie_data = pickle.load(f)
demo_df_ex4 = movie_data.sample(3000, random_state=6040)
movie_size_by_year(demo_df_ex4)
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex4
tester = Tester_ex4()
for _ in range(20):
try:
tester.run_test(movie_size_by_year)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex4(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(movie_size_by_year)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We want to ultimately do some network analytics using this data. Our first task to that end is to define our data in terms of a network. Here's the particulars of what we want in the network.
Complete the function make_network_dict
to process the data from df
into this graph structure. The graph should be returned in a nested "dictionary of sets" structure.
{'Alice':{'Bob', 'Alice', 'Charlie'}, 'Bob':{'Alice', 'Bob', 'Charlie'}, 'Charlie: {'Alice', 'Bob', 'Charlie'}}
indicates that there is an edge between Alice and Bob, an edge between Bob and Charlie, and an edge between Alice and Charlie. Instead of storing all the redundant information, we would store just {'Alice': {'Bob', 'Charlie'}, 'Bob': {'Charlie'}}
.merge
to determine all pairs of costars. Once you have that, you can worry about taking out the redundant information.def make_network_dict(df):
### BEGIN SOLUTION
from collections import defaultdict
d = defaultdict(set)
actor_pairs = df[['film_id', 'actor']]\
.merge(df[['film_id', 'actor']]
, how = 'inner'
, on= 'film_id'
)\
.query('actor_x < actor_y')\
.drop(columns='film_id')
for row in actor_pairs.itertuples():
d[row[1]].add(row[2])
return {k: v for k, v in d.items()}
### END SOLUTION
The demo cell below should display the following output:
{'Kian Lawley': {'Medalion Rahimi'},
'Maria Dizzia': {'Wendell Pierce'},
'Chosen Jacobs': {'Sophia Lillis'},
'David Ogden Stiers': {'Jesse Corti'},
'Jason Clarke': {'Kate Mara'},
'Reese Witherspoon': {'Sarah Paulson'},
'Olivia Munn': {'Zach Woods'},
'Faye Dunaway': {'Lucien Laviscount'},
'Alec Baldwin': {'Rebecca Ferguson'},
'Pierce Brosnan': {'Steve Coogan'},
'Dakota Johnson': {'Rhys Ifans'},
'Bokeem Woodbine': {'Flea'},
'Nicolas Cage': {'Robert Sheehan'},
'Bruce Dern': {'Kerry Washington'},
'Richard Jenkins': {'Sam Shepard'},
'Jessica Madsen': {'Vanessa Grasse'},
'Jason White': {'Kristen Wiig'},
'Robert Davi': {'Stephen Dorff'},
'Maggie Gyllenhaal': {'Marianne Jean-Baptiste'},
'Katherine Langford': {'Keiynan Lonsdale'},
"Denis O'Hare": {'Judi Dench'},
'Katherine Heigl': {'Michelle Pfeiffer', 'Simon Kassianides'},
'Craig Robinson': {'Emma Watson'},
'Colton Dunn': {'Nichole Bloom'},
'Daniel Sunjata': {'Jennifer Carpenter'},
'Aly Michalka': {'Cheri Oteri'},
'John Lithgow': {'Mark Duplass'},
'Ewan McGregor': {'Julianne Nicholson'},
'Chris Pine': {'Kathryn Hahn'},
'David Warner': {'Jonathan Hyde'}}
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_data.pkl', 'rb') as f:
movie_data = pickle.load(f)
demo_df_ex5 = movie_data.sample(300, random_state=6040)
### call demo funtion
make_network_dict(demo_df_ex5)
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex5
tester = Tester_ex5()
for _ in range(20):
try:
tester.run_test(make_network_dict)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex5(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(make_network_dict)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Now that we have our dictionary which maps actor names to a set
of that actor's costars, we are going to use the networkx
package to perform some graph analysis. The networkx
framework is based on the Graph
object - a Graph
holds data about the graph structure, which is made of nodes
and edges
among other attributes. Your task for this exercise will be to add edges to a networkx.Graph
object based on a dict
of sets
.
Complete the function to_nx(dos)
. Your solution should iterate through the parameter dos
, a dict
which maps actors to a set
of their costars. For each costar pair implied by the input, add an edge to the Graph
object, g
. We have provided some "wrapper" code to take care of constructing a Graph
object, g
, and returning it. All you have to do is add edges to it.
Note: Check the networkx
documentation to find how to add edges to a graph. Part of what this exercise is evaluating is your ability to find, read, and understand information on new packages well enough to get started performing its basic tasks. The information is easy to find and straight-forward in this case.
import networkx as nx
def to_nx(dos):
g = nx.Graph()
### BEGIN SOLUTION
for actor_x, costars in dos.items():
for actor_y in costars:
g.add_edge(actor_x, actor_y)
### END SOLUTION
return g
The demo cell below should display the following output:
{('Aaron Eckhart', 'Bill Nighy'),
('Aaron Eckhart', 'Cory Hardrict'),
('Aaron Eckhart', 'Nicole Kidman'),
('Aaron Eckhart', 'Ramón Rodríguez'),
('Akie Kotabe', 'Salma Hayek'),
('Akie Kotabe', 'Togo Igawa'),
('Akiva Schaffer', 'Cheri Oteri'),
('Akiva Schaffer', 'Jon Lovitz'),
('Akiva Schaffer', 'Nick Swardson'),
('Akiva Schaffer', "Shaquille O'Neal"),
('Alan Tudyk', 'Gal Gadot'),
('Alan Tudyk', 'Jennifer Lopez'),
('Alan Tudyk', 'John Leguizamo'),
('Alan Tudyk', 'Nicki Minaj'),
('Albert Tsai', 'Chloe Bennet'),
('Albert Tsai', 'Eddie Izzard'),
('Albert Tsai', 'Sarah Paulson'),
('Albert Tsai', 'Tenzing Norgay Trainor'),
('Chris Marquette', 'Alice Braga'),
('Chris Marquette', 'Ciarán Hinds'),
('Chris Marquette', 'Michael Sheen'),
('Chris Marquette', 'Rutger Hauer'),
('Chris Marquette', 'Stana Katic'),
('David Cross', 'Alison Brie'),
('David Cross', 'Gary Oldman'),
('David Cross', 'Jason Lee'),
('David Cross', 'Jesse Plemons'),
('David Cross', 'Michelle Yeoh'),
('Jeffrey Johnson', 'Bailee Madison'),
('Jeffrey Johnson', 'Ralph Waite'),
('Jeffrey Johnson', 'Robyn Lively'),
('Jeffrey Johnson', 'Tanner Maguire'),
('Jennifer Sipes', 'Christy Carlson Romano'),
('Jennifer Sipes', 'Nick Stahl'),
('Jennifer Sipes', 'Stephanie Honoré'),
('Jesse Bernstein', 'Johnny Sneed'),
('Megan Mullally', 'Aaron Paul'),
('Megan Mullally', 'Natalie Dreyfuss'),
('Megan Mullally', 'Octavia Spencer'),
('Megan Mullally', 'Richmond Arquette'),
('Mia Kirshner', 'Allie MacDonald'),
('Payman Maadi', 'Adria Arjona'),
('Payman Maadi', 'Ben Hardy'),
('Payman Maadi', 'Dave Franco'),
('Sophie Lowe', "James D'Arcy"),
('Sophie Lowe', 'Rhys Wakefield'),
('Zoe Saldana', 'Andrea Libman'),
('Zoe Saldana', 'Casey Affleck'),
('Zoe Saldana', 'Idris Elba'),
('Zoe Saldana', 'Method Man'),
('Zoe Saldana', 'Sylvester Stallone')}
### define demo inputs
import pickle
import numpy as np
rng = np.random.default_rng(6040)
with open('resource/asnlib/publicdata/network_dict.pkl', 'rb') as f:
network_dict = pickle.load(f)
demo_dos_ex6 = {k: {v for v in rng.choice(network_dict[k], 5)} for k in rng.choice(list(network_dict.keys()), 15)}
### call demo funtion
set(to_nx(demo_dos_ex6).edges)
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex6
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex6
tester = Tester_ex6()
for _ in range(20):
try:
tester.run_test(to_nx)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex6(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(to_nx)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
One thing that the networkx
package makes relatively easy is calculating the degree of each of the nodes in our graph. Here degree would be interpreted as the number of unique costars each actor has. If you have a graph g
then g.degree()
will return an object that maps each node to its degree (see note).
Complete the function high_degree_actors(g, n)
:
Given the inputs described below, determine the degree of each actor in the graph, g
. Return a pd.DataFrame
with 2 columns ('actor'
and 'degree'
), indicating an actor's name and degree. The output should have records for only the actors with the n
highest degrees. In the case of ties (two or more actors having the same degree), all of the actors with the lowest included degree should be included. (for example if there's a 3-way tie for 10th place and n
=10 then all 3 of the actors involved in the tie should be included in the output). If n
is None
, all of the actors should be included.
Sort your results by degree (descending order) and break ties (multiple actors w/ same degree) by sorting them in alphabetical order based on the actor's name.
The index of the result should be sequential numbers, starting with 0.
g
- a networkx
graph object having actor names as nodes and edges indicating whether the actors were costars based on our data. n
- int
indicating how many actors to return. This argument is optional for the user and has a default value of None
. Note: One complication is that g.degree()
isn't a dict
. Keep in mind that it can be cast to a dict
.
def high_degree_actors(g, n=None):
### BEGIN SOLUTION
import pandas as pd
actor_df = pd.DataFrame(dict(g.degree()).items(), columns=['actor', 'degree'])
if n is None:
n = actor_df.shape[0]
return actor_df.nlargest(n, 'degree', 'all').sort_values(['degree', 'actor'], ascending=[False, True]).reset_index(drop=True)
### END SOLUTION
The demo cell below should display the following output:
actor degree
0 Elizabeth Banks 9
1 Emma Stone 9
2 Bradley Cooper 8
3 Anthony Mackie 7
4 Michael Peña 7
5 Maya Rudolph 6
6 Richard Jenkins 6
7 Stanley Tucci 6
8 Steve Carell 6
Notice how 9 actors are included even though n = 7
.
### define demo inputs
import pickle
with open('resource/asnlib/publicdata/movie_network.pkl', 'rb') as f:
movie_network = pickle.load(f)
demo_g_ex7 = movie_network.subgraph({a for a, _ in sorted(movie_network.degree, key=lambda t:-t[1])[:20]})
demo_n_ex7 = 7
### call demo funtion
print(high_degree_actors(demo_g_ex7, demo_n_ex7))
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex7
tester = Tester_ex7()
for _ in range(20):
try:
tester.run_test(high_degree_actors)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex7(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(high_degree_actors)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Another place where networkx
shines is in its built-in graph algorithms, like community detection. We have calculated the communities using networkx
(check the docs for info on how to do this yourself) and have the communities
variable set to a list
of sets
(you can iterate over communities
like a list, and each set is the names of all the actors in one community).
Given
communities
- a list
containing sets
indicating membership to a particular community. The communities are a partition of the actors, so you can safely assume that an actor will only appear in one of these sets.degrees
- A pd.DataFrame
with columns 'actor'
and 'degree'
indicating the degree of each actor in the DataFrameactor
- an actor's nameComplete the function notable_actors_in_comm
. Your solution should accomplish the following:
pd.DataFrame
with two columns ('actor' and 'degree') including the top 10 actors in the same community as the given actor. def notable_actors_in_comm(communities, degrees, actor):
assert actor in {a for c in communities for a in c}, 'The given actor was not found in any of the communities!'
### BEGIN SOLUTION
degrees = degrees
for c in communities:
if actor in c: break
degrees = degrees[degrees['actor'].isin(c)].nlargest(10, 'degree', 'all').reset_index(drop=True)
return degrees
### END SOLUTION
The demo cell below should display the following output:
actor degree
0 Bryan Cranston 135
1 Anthony Mackie 116
2 Johnny Depp 115
3 Idris Elba 112
4 Joel Edgerton 109
5 James Franco 107
6 Jessica Chastain 107
7 Jeremy Renner 105
8 Chris Hemsworth 104
9 Zoe Saldana 104
### define demo inputs
import pickle
path = 'resource/asnlib/publicdata/communities.pkl'
with open(path, 'rb') as f:
communities = pickle.load(f)
path = 'resource/asnlib/publicdata/degrees.pkl'
with open(path, 'rb') as f:
degrees = pickle.load(f)
demo_actor_ex8 = 'Christian Bale'
### call demo funtion
print(notable_actors_in_comm(communities, degrees, demo_actor_ex8))
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', 'tester_6040.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester_ex8
tester = Tester_ex8()
for _ in range(20):
try:
tester.run_test(notable_actors_in_comm)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester_ex8(key=b'dMEwhBXhC_7FMtUtO_dDWItK2sJ5qq5Xn38TPo4uFlM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test(notable_actors_in_comm)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Fin. This is the end of the exam. If you haven't already, submit your work.