Midterm 1
, Fall 2022
: Reddit API Data
¶Version 1.0.2
Version History
1.0.2 - 2022/10/03 - Increase trials in test cells 3a and 8.
1.0.1 - 2022-10-01 - Increase trials in test cell 3b.
1.0.0 - 2022-09-30 - Initial release.
All of the header information is important. Please read it..
Topics, number of exercises: This problem builds on your knowledge of string operations, base Python data structures, and implementing mathematics. It has 9 exercises, numbered 0 to 8. There are 18 available points. However, to earn 100% the threshold is 14 points. (Therefore, once you hit 14 points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
Debugging your code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Final reminders:
Reddit is a popular online discussion board where users make hundreds of millions of posts and billions of comments annually. Discussion topics run the gamut from Spongebob Squarepants memes to advanced science and literature. The posts on the more active topics (called "sub-reddits") are a veritable treasure trove of information on the "pulse" of that topic. Additionally, all of this information is freely and publicly accessible via a REST API.
In this course we are interested in learning Python. There's a sub-reddit for that! (r/learnpython). For this problem we used the REST API to pull the "all-time" top 800 posts from this sub-reddit (as of summer 2022) for analysis.
In this notebook you will handle the paginated JSON responses from the API and perform two simple but meaningful analyses to extract insights from the data.
### Global Imports
import json
import re
from collections import defaultdict
# Don't worry about this hidden test. It's not actually testing anything, just starting some validation tools.
### BEGIN HIDDEN TESTS
if False: # set to True to set up
import dill
import hashlib
def hash_check(f1, f2, verbose=True):
with open(f1, 'rb') as f:
h1 = hashlib.md5(f.read()).hexdigest()
with open(f2, 'rb') as f:
h2 = hashlib.md5(f.read()).hexdigest()
if verbose:
print(h1)
print(h2)
assert h1 == h2, f'The file "{f1}" has been modified'
with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
dill.dump(hash_check, f)
del hash_check
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
### END HIDDEN TESTS
This one's a freebie!
To start things off we will load the result of the requests we made against the API into the notebook by reading a JSON file.
### Exercise 0 solution
# Load data - 1 pt freebie!
with open('resource/asnlib/publicdata/learnpython_top_800_all.json') as f:
raw_reddit_data = json.load(f)
with open('resource/asnlib/publicdata/stopwords.json') as f:
stopwords = set([s.replace("'", "") for s in json.load(f)])
The cell below will verify that the data is loaded into the notebook.
### test_cell_ex0### test_cell_ex0
assert isinstance(raw_reddit_data, list), "Make sure you ran the solution cell for this exercise!"
assert isinstance(stopwords, set), "Make sure you ran the solution cell for this exercise!"
print('Passed! Please submit.')
Motivation (don't dwell on this)
Now that the data's loaded, we have to make sense of it. Here's a printout of the first 1000 characters of the JSON we just loaded... (just take a glance, it's not worth spending time trying to visually parse this.)
[{'kind': 'Listing', 'data': {'after': 't3_lra4l2', 'dist': 100, 'modhash': '', 'geo_filter': '', 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'learnpython', 'selftext': 'Hi all, \n\nFirstly this is going to be a long post to hopefully help people genuinely looking to commit to becoming a developer by sharing my story of how I went from absolutely zero knowledge of programming (as you can see by my post history) to landing my first python developer role.\n\nLocation: UK\n\nTo kick things off about a year ago I wasnt happy with the job(s) I was doing, long hours, very low pay, so I came across python by chance. Yes I admit the money was what attracted me alone to start off with as I am quite a money motivated person. Ofcourse I knew and still know it will be a long journey to reach the salaries offered but I have managed to finally get my first step on the ladder by landing a job as a python developer. Enough of the story, lets get on with it.\n\nI will lis
Even though this is technically human readable, there are too many levels and too much text to make sense of this visually. We know that since it's a JSON everything is stored in a nested structure of lists
and dict
s. We can programmatically analyze the structure!
Requirements
Define the function analyze_structure(some_object)
. The input some_object
will either be a list
or a dict
.
For a list
input the function should return a dict
with the following keys: {'type', 'len',
'value_type'}
.
For a dict
input the function should return a dict
with the following keys: {'type', 'keys',
'value_types'}
.
See the charts below for more information on the key/value pairs:
If the input is a list
your result should have these key/value pairs.
key | value | type of value | examples (comma separated) |
---|---|---|---|
"type" | "list" | str |
The string "list" is the only acceptable value |
"len" | length of some_object |
int |
1 , 4 , 500 |
"value_type" | The type of the first element of some_object (cast to a str ) |
str |
"<class 'str'>" , "<class 'dict'>" |
If the input is a dict
your result should have these key/value pairs.
key | value | type of value | examples (comma separated) |
---|---|---|---|
"type" | "dict" | str |
The string "dict" is the only acceptable value |
"keys" | set of some_object 's keys |
set |
{'data', 'kind'} |
"value_types" | dict mapping each of some_object 's keys to the type (cast to str ) of the value associated with it. |
dict mapping str keys to str values |
{'kind': "<class 'str'>", 'data': "<class 'dict'>"} |
Note: Don't forget to cast the 'value_type'
for list
s and values of 'value_types'
for dict
s to strings!
Note: You can use str(type(x))
to cast the type
of x
to a str
.
Note this demo will run your solution 5 times, diving level by level into the raw data. Each individual run's output is separated by a blank line.
### Exercise 1 solution
def analyze_structure(some_object):
assert isinstance(some_object, (list, dict)), f'argument must be `list` or `dict`, {type(some_object)} was given.'
if isinstance(some_object, list):
# Handle the `list` case
### BEGIN SOLUTION
return {'type': 'list',
'len': len(some_object),
'value_type': str(type(some_object[0]))}
### END SOLUTION
elif isinstance(some_object, dict):
# Handle the `dict` case
### BEGIN SOLUTION
return {'type': 'dict',
'keys': set(some_object.keys()),
'value_types': {k: str(type(v)) for k, v in some_object.items()}}
### END SOLUTION
else:
assert False # This code will never execute
### demo function calls
print(analyze_structure(raw_reddit_data))
print()
print(analyze_structure(raw_reddit_data[0]))
print()
print(analyze_structure(raw_reddit_data[0]['data']))
print()
print(analyze_structure(raw_reddit_data[0]['data']['children']))
print()
print(analyze_structure(raw_reddit_data[0]['data']['children'][0]))
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_1',
'func': analyze_structure, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'some_object':{
'dtype':'dict', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'dict',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Motivation (don't dwell on this)
In order to maintain server performance and keep response sizes reasonable, the Reddit API paginates results (i.e. it only gives back a slice of the full result). raw_reddit_data
is actually a list
of those slices (each slice is a dict
). Based on the analysis we were able to do with analyze_structure
in the previous exercise, we can tell that raw_reddit_data[i]['data']['children']
contains a list
of all the post data from "page" i
. We need to combine all of these lists
into a single list
to remove some unnecessary complexity from later analysis tasks.
Requirements
Define the function combine_children(raw_reddit_data)
.
The input raw_reddit_data
is a list
of dict
s such that for any i
between 0
and len(raw_reddit_data) - 1
, there will exist a list
raw_reddit_data[i]['data']['children']
.
The output should be a single new list
which contains all of the elements of each "children" list with the order preserved.
### Define demo inputs
# child_i_j indicates the j-th element in "list of children" i
demo_raw_reddit_data_ex2 = [
{'data':{'children':['child_0_0', 'child_0_1']}},
{'data':{'children':['child_1_0', 'child_1_1']}},
{'data':{'children':['child_2_0', 'child_2_1']}}
]
The demo included in the solution cell below should display the following output:
['child_0_0', 'child_0_1', 'child_1_0', 'child_1_1', 'child_2_0', 'child_2_1']
Note - the "lists of children" in this example are greatly simplified to illustrate the top-level structure of raw_reddit_data
, the output structure, and the ordering requirement.
### Exercise 2 solution
def combine_child_data(raw_reddit_data):
### BEGIN SOLUTION
rv = []
for page in raw_reddit_data:
rv += page['data']['children']
return rv
### END SOLUTION
### demo function call
combine_child_data(demo_raw_reddit_data_ex2)
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_2',
'func': combine_child_data, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'raw_reddit_data':{
'dtype':'list', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'list',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran the following code to combine the post data for the entire data set and stored the result in resource/asnlib/publicdata/children.json
.
children = combine_child_data(raw_reddit_data)
with open('resource/asnlib/publicdata/children.json', 'w') as f:
json.dump(children, f)
clean_unicode
¶Reddit posts allow for unicode special characters and html special characters (letters with accents, emojis, etc.). These characters will cause problems with our analysis. We have provided clean_unicode(text)
which takes a string input, text
and returns that string with any such characters replaced with an empty string ''
. You will have to use it in the solution to the next exercise (or figure out a way to strip them out yourself...)
def clean_unicode(text):
text = re.sub(r'&.*?;', '', text)
encoded = text.encode('unicode_escape')
return re.sub(br"\\u....|\\U........|\\x..", b"", encoded).decode("unicode-escape")
Motivation (don't dwell on this)
The text in a Reddit post can contain a lot of characters and formatting which are not useful to our analysis. We need to clean up the text and standardize it.
There's a lot of cleaning to be done, so we will be taking care of it in two phases. This is the first phase.
Requirements
Define the function clean_text_phase_0(text)
which takes an inputs text
, a str
to be cleaned. All of the following transformations must be performed on text
and the end result returned as a str
. Some transformations may conflict with others, but using the order provided will give the correct result.
step number | what to transform | how to identify | transformation | recommendation |
---|---|---|---|---|
0 | all letters | NA | convert to lower case | use str methods |
1 | special unicode characters | handled by clean_unicode |
Use clean_unicode |
use provided helper function |
2 | urls | the sequence 'http ' followed by at least one non-space character, i.e. 'the http module' does not contain a url, but http://cse6040.gatech.edu is a url. |
remove | use re - remove pattern matches |
3 | space indicators | space character, - , + , or = |
replace with ' ' |
use re - replace pattern matches with a single space. |
4 | posessive ending | "'s" (does not have to occur at the end of a word) |
remove | use re - remove pattern matches |
5 | symbols | * , _ , : , # , . , , (the "comma" character), ~ , ? , ! , $ , % , ; , ( , ) , [ , ] , / , \ , { , } , ' , < , > , & , ` , " |
remove | use re - remove pattern matches. All symbbols are in demo sample data. |
6 | consecutive spaces | consecutive spaces. ex: ' ' |
replace with single space | use str methods to split into "list of words", re-combine with str methods |
The term "space character" should be taken to mean any character which matches the regex pattern '\s'
. This includes but is not limited to spaces, newlines, and tabs. The term "non-space character" should be taken to mean all other characters.
The term "word" should be taken to mean a series of non-space characters which are either wrapped by either space characters or the start or end of a string. We would interpret the string 'these are all words'
to have 4 words.
### Define demo inputs
with open('resource/asnlib/publicdata/children.json') as f:
children = json.load(f)
demo_text_ex3a = '''
😀 I just love #Python. You will learn to love it, too!!! (Just as much as I do).
Check out my site at https://cse6040.gatech.edu. It's not much ~~ but I think it's gr8!
It's really cool & fun! The "bee's knees" if you will. 🐝
The request module's functionality really makes HTTP requests a breeze!
allon*_:#.,~?!$%;()\[\]\'/{}\\\'<>&`"eword
"within" should remain, but "it" should not
'''
demo_text_ex3a
Note This is a simplified example which admittedly includes some nonsense. It checks most but not all requirements. If you need help comparing your output and the expected output you can use https://text-compare.com/.
### Exercise 3a solution
def clean_text_phase_0(text):
### BEGIN SOLUTION
clean = text.lower() # Convert to lowercase
clean = clean_unicode(clean)
clean = re.sub(r'http[^\s]+', '', clean) # Remove urls
clean = re.sub(r'[\s\-+=]', ' ', clean) # Replace any space character or dash with ' '
clean = re.sub(r'\'s', '', clean) # Remove "'s"
clean = re.sub(r'[*_:#.,~?!$%;()\[\]\'/{}\\\'<>&`"]', '', clean) # Remove punctuation and grouping symbols
clean = ' '.join(clean.split())
## Moving to another solution
# clean = ' '.join(s for s in clean.split() if (re.search(r'[0-9]', s) is None and s not in stopwords)) # remove any "words" (sequences without spaces) containing numbers
return clean
### END SOLUTION
### demo function call
clean_text_phase_0(demo_text_ex3a)
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. Depending on the test cases drawn in this cell, a DepricationWarning
may be raised. This is nothing to be concerned about. It indicates that a feature we are using may be disabled in a future release, but it will work for now.
### test_cell_ex3a
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_3a',
'func': clean_text_phase_0, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'text':{
'dtype':'str', # data type of param.
'check_modified':True,
},
},
'outputs':{
'output_0':{
'index':0,
'dtype':'str',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran the code below to do the following:
resource/asnlib/publicdata/clean_children_phase_0.json
clean_children_phase_0 = {child['data']['id']: clean_text_phase_0(child['data']['selftext']) for child in children}
with open('resource/asnlib/publicdata/clean_children_phase_0.json', 'w') as f:
json.dump(clean_children_phase_0, f, indent=4)
Motivation (don't dwell on this)
The text in a Reddit post can contain a lot of characters and formatting which are not useful to our analysis. We need to clean up the text and standardize it.
There's a lot of cleaning to be done, so we will be taking care of it in two phases. This is the second phase.
Requirements
Define the function clean_text_phase_1(text, stopwords)
which takes two inputs text
, a str
to be cleaned, and stopwords
, a set
of str
which should not be included in the final result. All of the following transformations must be performed on text
and the end result returned as a str
. Word removals should be done before dealing with the spaces. (Or you can take care of all 4 requirements in a clever one-liner!)
set
stopwords
.The term "space character" should be taken to mean any character which matches the regex pattern '\s'
. This includes but is not limited to spaces, newlines, and tabs. The term "non-space character" should be taken to mean all other characters.
The term "word" should be taken to mean a series of non-space characters which are either wrapped by either space characters or the start or end of a string. We would interpret the string 'these are all words'
to have 4 words.
### Define demo inputs
with open('resource/asnlib/publicdata/children.json') as f:
children = json.load(f)
demo_stopwords_ex3b = {'as', 'it', 'i', 'the'}
demo_text_ex3b = 'i just love python you will learn to love it too just as ' + \
'much as i do check out my site at it not much but i think it gr8 it really ' + \
'cool fun the bee knees if you will the request module functionality really ' + \
'makes http requests a breeze alloneword within should remain but it should not'
demo_text_ex3b
Note This is a simplified example which admittedly includes some nonsense. It checks most but not all requirements. If you need help comparing your output and the expected output you can use https://text-compare.com/.
### Exercise 3b solution
def clean_text_phase_1(text, stopwords):
### BEGIN SOLUTION
clean = text
clean = ' '.join(s for s in clean.split() if (re.search(r'[0-9]', s) is None and s not in stopwords))
return clean
### END SOLUTION
### demo function call
clean_text_phase_1(demo_text_ex3b, demo_stopwords_ex3b)
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. Depending on the test cases drawn in this cell, a DepricationWarning
may be raised. This is nothing to be concerned about. It indicates that a feature we are using may be disabled in a future release, but it will work for now.
### test_cell_ex3b
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_3b',
'func': clean_text_phase_1, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'text':{
'dtype':'str', # data type of param.
'check_modified':True,
},
'stopwords':{
'dtype':'set',
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'str',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran the code below to apply the second phase of cleaning to clean_children_phase_0
and stored the result in resource/asnlib/publicdata/clean_children.json
clean_children = {id:clean_text_phase_1(text, stopwords) for id, text in clean_children_phase_0.items()}
with open('resource/asnlib/publicdata/clean_children.json', 'w') as f:
json.dump(clean_children, f, indent=4)
Motivation (don't dwell on this)
We are eventually going to perform analysis to determine how similar posts are to one another based on the text. In order to perform this analysis we have to vectorize the text. (Don't worry, the only math involved is counting.)
Requirements
Define the function vectorize_text(text)
. The input text
is the cleaned text from a single post. The output should be a dict
mapping each unique word (consecutive non-space characters separated by spaces) to the number of times that word occurs in text
.
Note any instance of dict
will be suitable for an output. In other words defaultdict
is acceptable.
### Define demo inputs
with open('resource/asnlib/publicdata/clean_children.json') as f:
clean_children = json.load(f)
demo_text_ex4 = 'foo bar baz baz tux foo tux foo tux bar kat'
The demo included in the solution cell below should display the following output:
{'foo': 3, 'bar': 2, 'baz': 2, 'tux': 3, 'kat': 1}
Note This is a simplified example to demonstrate the structure of the input and output.
Note If using something besides a base dict
you may see something slightly different. You can use isinstance(demo_output_ex4, dict)
to check if you are returning an instance of dict
. (It should return True
.)
### Exercise 4 solution
def vectorize_text(text):
### BEGIN SOLUTION
from collections import defaultdict
vector = defaultdict(int)
for word in text.split():
vector[word] += 1
return dict(vector)
### END SOLUTION
### demo function call
demo_output_ex4 = vectorize_text(demo_text_ex4)
demo_output_ex4
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_4',
'func': vectorize_text, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'text':{
'dtype':'str', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'dict',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran the following code to do the following:
resource/asnlib/publicdata/vectorized_children.json
vectorized_children = {id: vectorize_text(text) for id, text in clean_children.items()}
with open('resource/asnlib/publicdata/vectorized_children.json', 'w') as f:
json.dump(vectorized_children, f, indent=4)
Motivation (don't dwell on this)
One metric concerning the Reddit post which may be useful is the fraction of posts a certain keyword appears in. For r/learnpython
this metric would give insight into trending topics in the Python space. Other developers have used similar metrics enhanced with sentiment analysis as part of profitable (and un-profitable) stock trading strategies.
Requirements
Define the function keyword_usage(vectors, keyword)
. The input vectors
is a dictionary mapping post ids (str
) to vectors (dict
mapping words (str
) to counts (int
)). The structure of each vector will be the same as the output to exercise 4. The input keyword
is a str
representing the keyword we are interested in.
The function should return a str
matching this format:
'The keyword <keyword> occurs in <pct>% of posts in the sample.'
<keyword>
is thekeyword
parameter.<pct>
is the percentage of posts which contain at least one instance ofkeyword
.<pct>
must be rounded to 2 decimal places and include 2 decimal places. (i.e.15.00
is acceptable but15
,15.0
, and15.000
are not.)
- Use this formula for calculating
<pct>
$\text{<pct>} = 100\times\frac{\text{\# of posts containing keyword}}{\text{\# of posts}}$
Note: Format strings are a great tool for accomplishing this task.
### Define demo inputs
with open('resource/asnlib/publicdata/vectorized_children.json') as f:
vectorized_children = json.load(f)
demo_vectors_ex5 = {
'post_0': {'pycharm':1, 'python':3, 'ide': 2, 'jupyter': 3},
'post_1': {'python':1, 'ide': 2, 'jupyter': 3},
'post_2': {'python':3, 'jupyter': 5},
'post_3': {'jupyter': 2}
}
demo_keyword_0_ex5 = 'python'
demo_keyword_1_ex5 = 'jupyter'
demo_keyword_2_ex5 = 'ide'
demo_keyword_3_ex5 = 'pycharm'
The demo included in the solution cell below should display the following output:
The keyword "python" occurs in 75.00% of posts in the sample.
The keyword "jupyter" occurs in 100.00% of posts in the sample.
The keyword "ide" occurs in 50.00% of posts in the sample.
The keyword "pycharm" occurs in 25.00% of posts in the sample.
Note This is a simplified example to demonstrate the structure of the inputs and outputs.
Note The demo calls keyword_usage
four times. Each line of output comes from a single function call.
### Exercise 5 solution
def keyword_usage(vectors, keyword):
### BEGIN SOLUTION
n = len(vectors)
m = sum(keyword in vector.keys() for vector in vectors.values())
return f"The keyword \"{keyword}\" occurs in {round(100*m/n, 2):.2f}% of posts in the sample."
### END SOLUTION
print(keyword_usage(demo_vectors_ex5, demo_keyword_0_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_1_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_2_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_3_ex5))
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_5',
'func': keyword_usage, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'vectors':{
'dtype':'dict', # data type of param.
'check_modified':True,
},
'keyword':{
'dtype':'str', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Read this if you want more information on how we are calculating similarity. All formulas will be given in the exercises where they are used.
The dictionaries we created for each post's text in the earlier exercises can also be thought of as vectors in a high dimensional space. In this context we can say that there is an angle $\theta \in [0, \pi/2]$ between any two of these vectors $v_0$ and $v_1$. The more similar the two posts are, the smaller the angle will be, thus the larger $\cos\theta$ will be. We can calculate $\cos\theta$ with the following formula.
$$\cos\theta = \frac{v_0 \cdot v_1}{\|v_0\|\|v_1\|}$$
Let's denote the unit vectors with the same direction as $v_0$ and $v_1$ by $\hat{v_0}$ and $\hat{v_1}$ respectively. Then the formula becomes simpler.
We can calculate any unit vector $\hat{v_i} \in \mathcal{R}^n$ with the following: $$\hat{v_i} = \frac{v_i}{\|v_i\|}$$ where $$\|v_i\| = \sqrt{\sum_{j = 0}^{n-1}{v_{i,j}^2}}$$
Motivation (don't dwell on this)
Normalizing the vectors (making them all unit length) before using them for calculation will make the next step easier. We are effectively performing the $\hat{v_i}$ calculation from the Cosine Similarity section.
$$\hat{v_i} = \frac{v_i}{\|v_i\|}$$where $$\|v_i\| = \sqrt{\sum_{j = 0}^{n-1}{v_{i,j}^2}}$$
Requirements
Define the function normalize_vector(vector)
. The input vector
($v_i$ from the math above) will be a dict
mapping words (str
) to int
. Perform the following calculations and return ret_vec
.
SS
= sum of the squares of each value in vector
; i.e - $\sum_{j = 0}^{n-1}{v_{i,j}^2}$ from the math above.magnitude
= $\sqrt{\text{SS}}$; i.e - $\|v_i\|$ from the math above.ret_vec
= dictionary mapping each key in vector
to the value associated with that key in vector
divided by magnitude
; i.e - $\hat{v_i}$ from the math above.vector['python']
= 5
and magnitude
= 50
then ret_vec['python']
should be 0.1
.### Define demo inputs
demo_vector_ex6_0 = {
'python': 4,
'jupyter': 3
}
demo_vector_ex6_1 = {
'python': 5,
'code': 12
}
The demo included in the solution cell below should display the following output:
{'python': 0.8, 'jupyter': 0.6}
{'python': 0.38461538461538464, 'code': 0.9230769230769231}
Note This demo runs normalize_vector
twice on (once on each input).
Note For this simplified example the intermediate values should be as follows:
input | SS |
magnitude |
---|---|---|
demo_vector_ex6_0 | 25 |
5.0 |
demo_vector_ex6_1 | 169 |
13.0 |
### Exercise 6 solution
def normalize_vector(vector):
### BEGIN SOLUTION
from math import sqrt
ss = sum(v**2 for v in vector.values())
magnitude = sqrt(ss)
return {k: v/magnitude for k, v in vector.items()}
### END SOLUTION
### demo function call
print(normalize_vector(demo_vector_ex6_0))
print(normalize_vector(demo_vector_ex6_1))
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex6
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_6',
'func': normalize_vector, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'vector':{
'dtype':'dict', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We used the following code to normalize all the vectors and save the result in resource/asnlib/publicdata/normalized_children.json
normalized_children = {id:normalize_vector(vector) for id, vector in vectorized_children.items()}
with open('resource/asnlib/publicdata/normalized_children.json', 'w') as f:
json.dump(normalized_children, f, indent=4)
Motivation (don't dwell on this)
Now that we have normalized word vectors for each post, it's time to implement the similarity calculation.
Requirements
Define calc_similarity(vector_0, vector_1)
. The inputs vector_0
and vector_1
will be dict
s mapping str
to float
. Perform the following calculations and return the end result.
vector_0[k]
times vector_1[k]
for each key k
that is present in both vector_0
and vector_1
.float
.### Define demo inputs
with open('resource/asnlib/publicdata/normalized_children.json') as f:
normalized_children = json.load(f)
demo_vector_0_ex7 = {'python': 0.8, 'jupyter': 0.6}
demo_vector_1_ex7 = {'python': 0.38461538461538464, 'code': 0.9230769230769231}
The demo included in the solution cell below should display the following output:
0.30769
1.0
Note This simplified example contains two runs of normalize_vector
. The first run uses both demo inputs. The second run uses demo_vector_0_ex7
as both vector_0
and vector_1
to illustrate that comparing a vector to itself should always give the result of 1.0
since $cos(0) = 1$.
### Exercise 7 solution
def calc_similarity(vector_0, vector_1):
### BEGIN SOLUTION
sim = sum(vector_0[key]*vector_1[key] for key in (vector_0.keys() & vector_1.keys()))
return round(sim, 5)
### END SOLUTION
### demo function call
print(calc_similarity(demo_vector_0_ex7, demo_vector_1_ex7))
print(calc_similarity(demo_vector_0_ex7, demo_vector_0_ex7))
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_7',
'func': calc_similarity, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'vector_0':{
'dtype':'dict', # data type of param.
'check_modified':True,
},
'vector_1':{
'dtype':'dict', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran the following code to create a similarity matrix dict
mapping str
to dict
(inner dict
maps str
to float
) and write the result to
resource/asnlib/publicdata/similarity_matrix.json
. The value stored in similarity_matrix[a][b]
is the cosine similarity score between post id a
and post id b
.
Based on how the matrix was constructed it is "square". In other words, if similarity_matrix[a]
and similarity_matrix[b]
exist then similarity_matrix[a][b]
and similarity_matrix[b][a]
are guaranteed to exist.
similarity_matrix = {
id: {other_id: calc_similarity(normalized_children[id], normalized_children[other_id]) for other_id in normalized_children} \
for id in normalized_children
}
with open('resource/asnlib/publicdata/similarity_matrix.json', 'w') as f:
json.dump(similarity_matrix, f, indent=4)
Motivation (don't dwell on this)
We are interested in identifying the posts which are most similar to a given post. We have already done all of the math to capture that information, but it's in yet another huge JSON.
Requirements
Define n_most_similar(sim_matrix, post_id, n)
. The input sim_matrix
will be a nested dict
with the same structure as similarity_matrix
described in the cell above. The input post_id
will be the id of the post we are interested in and will be a key of sim_matrix
. The function should do the following:
sim_matrix[post_id]
).list
of tuples
for this intermediate result. They should be sorted in descending order of similarity. If there are ties (2 or more posts with the same similarity to post_id
) then break the ties by sorting them into alphabetical order.list
of tuples
of length n+1
containing post_id
and the n
most similar posts. Tuples should be of the form (post id (str
), similarity score (float
)) Note the post associated with post_id
will always be at the top of the list since it will have a similarity of 1.0
to itself.
### Define demo inputs
with open('resource/asnlib/publicdata/similarity_matrix.json') as f:
similarity_matrix = json.load(f)
demo_sim_matrix_ex8 = {
'post_0': {'post_0': 1.0, 'post_1': 0.79686,'post_2': 0.67635,'post_3': 0.52459,'post_4': 0.2334},
'post_1': {'post_0': 0.79686,'post_1': 1.0, 'post_2': 0.65699,'post_3': 0.45758,'post_4': 0.11308},
'post_2': {'post_0': 0.67635,'post_1': 0.65699,'post_2': 1.0, 'post_3': 0.51968,'post_4': 0.16411},
'post_3': {'post_0': 0.52459,'post_1': 0.45758,'post_2': 0.51968,'post_3': 1.0, 'post_4': 0.33981},
'post_4': {'post_0': 0.2334, 'post_1': 0.11308,'post_2': 0.16411,'post_3': 0.33981,'post_4': 1.0}}
demo_n_ex8 = 2
demo_post_id_0 = 'post_0'
demo_post_id_1 = 'post_3'
The demo included in the solution cell below should display the following output:
[('post_0', 1.0), ('post_1', 0.79686), ('post_2', 0.67635)]
[('post_3', 1.0), ('post_0', 0.52459), ('post_2', 0.51968)]
Note This demo calls n_most_similar
twice. The outputs are separated by a blank line. The first uses demo_post_id_0
as post_id
, and the second uses demo_post_id_1
as post_id
.
### Exercise 8 solution
def n_most_similar(sim_matrix, post_id, n):
### BEGIN SOLUTION
sims = sorted(sim_matrix[post_id].items(), key=lambda x: (-x[1], x[0]))
return sims[:n+1]
### END SOLUTION
print(n_most_similar(demo_sim_matrix_ex8, demo_post_id_0, demo_n_ex8))
print()
print(n_most_similar(demo_sim_matrix_ex8, demo_post_id_1, demo_n_ex8))
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_8',
'func': n_most_similar, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'sim_matrix':{
'dtype':'dict', # data type of param.
'check_modified':True,
},
'post_id':{
'dtype':'str', # data type of param.
'check_modified':True,
},
'n':{
'dtype':'int', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': True, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Fin. If you have made it this far, congratulations on completing the exam. Don't forget to submit!