Midterm 1, Fall 2022: Reddit API Data

solution
Version 1.0.2

Version History

1.0.2 - 2022/10/03 - Increase trials in test cells 3a and 8.
1.0.1 - 2022-10-01 - Increase trials in test cell 3b.
1.0.0 - 2022-09-30 - Initial release.

All of the header information is important. Please read it..

Topics, number of exercises: This problem builds on your knowledge of string operations, base Python data structures, and implementing mathematics. It has 9 exercises, numbered 0 to 8. There are 18 available points. However, to earn 100% the threshold is 14 points. (Therefore, once you hit 14 points, you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

Debugging your code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0: 1 point(s)
  • Exercise 1: 2 point(s)
  • Exercise 2: 1 point(s)
  • Exercise 3: 5 point(s) - Contains 2 independent parts - 3a: 3 points; 3b: 2 points
  • Exercise 4: 1 point(s)
  • Exercise 5: 2 point(s)
  • Exercise 6: 2 point(s)
  • Exercise 7: 1 point(s)
  • Exercise 8: 2 point(s)

Final reminders:

  • Submit after every exercise
  • If you have questions check the exam guide
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed, and take short breaks at your leisure

Topic Introduction

Reddit is a popular online discussion board where users make hundreds of millions of posts and billions of comments annually. Discussion topics run the gamut from Spongebob Squarepants memes to advanced science and literature. The posts on the more active topics (called "sub-reddits") are a veritable treasure trove of information on the "pulse" of that topic. Additionally, all of this information is freely and publicly accessible via a REST API.

In this course we are interested in learning Python. There's a sub-reddit for that! (r/learnpython). For this problem we used the REST API to pull the "all-time" top 800 posts from this sub-reddit (as of summer 2022) for analysis.

In this notebook you will handle the paginated JSON responses from the API and perform two simple but meaningful analyses to extract insights from the data.

In [ ]:
### Global Imports
import json
import re
from collections import defaultdict

# Don't worry about this hidden test. It's not actually testing anything, just starting some validation tools.
### BEGIN HIDDEN TESTS
if False: # set to True to set up
    import dill
    import hashlib
    def hash_check(f1, f2, verbose=True):
        with open(f1, 'rb') as f:
            h1 = hashlib.md5(f.read()).hexdigest()
        with open(f2, 'rb') as f:
            h2 = hashlib.md5(f.read()).hexdigest()
        if verbose:
            print(h1)
            print(h2)
        assert h1 == h2, f'The file "{f1}" has been modified'
    with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
        dill.dump(hash_check, f)
    del hash_check
    with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
        hash_check = dill.load(f)
    for fname in ['testers.py', '__init__.py', 'test_utils.py']:
        hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
    del hash_check
### END HIDDEN TESTS

Exercise 0 - (1 Points):

This one's a freebie!

To start things off we will load the result of the requests we made against the API into the notebook by reading a JSON file.

In [ ]:
### Exercise 0 solution
# Load data - 1 pt freebie!
with open('resource/asnlib/publicdata/learnpython_top_800_all.json') as f:
    raw_reddit_data = json.load(f)
with open('resource/asnlib/publicdata/stopwords.json') as f:
    stopwords = set([s.replace("'", "") for s in json.load(f)])

The cell below will verify that the data is loaded into the notebook.

In [ ]:
### test_cell_ex0### test_cell_ex0
assert isinstance(raw_reddit_data, list), "Make sure you ran the solution cell for this exercise!"
assert isinstance(stopwords, set), "Make sure you ran the solution cell for this exercise!"
print('Passed! Please submit.')

Exercise 1 - (2 Points):

Motivation (don't dwell on this)

Now that the data's loaded, we have to make sense of it. Here's a printout of the first 1000 characters of the JSON we just loaded... (just take a glance, it's not worth spending time trying to visually parse this.)

[{'kind': 'Listing', 'data': {'after': 't3_lra4l2', 'dist': 100, 'modhash': '', 'geo_filter': '', 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'learnpython', 'selftext': 'Hi all, \n\nFirstly this is going to be a long post to hopefully help people genuinely looking to commit to becoming a developer by sharing my story of how I went from absolutely zero knowledge of programming (as you can see by my post history) to landing my first python developer role.\n\nLocation: UK\n\nTo kick things off about a year ago I wasnt happy with the job(s) I was doing, long hours, very low pay, so I came across python by chance. Yes I admit the money was what attracted me alone to start off with as I am quite a money motivated person. Ofcourse I knew and still know it will be a long journey to reach the salaries offered but I have managed to finally get my first step on the ladder by landing a job as a python developer. Enough of the story, lets get on with it.\n\nI will lis

Even though this is technically human readable, there are too many levels and too much text to make sense of this visually. We know that since it's a JSON everything is stored in a nested structure of lists and dicts. We can programmatically analyze the structure!

Requirements

Define the function analyze_structure(some_object). The input some_object will either be a list or a dict.

  • For a list input the function should return a dict with the following keys: {'type', 'len','value_type'}.

  • For a dict input the function should return a dict with the following keys: {'type', 'keys','value_types'}.

See the charts below for more information on the key/value pairs:

If the input is a list your result should have these key/value pairs.

key value type of value examples (comma separated)
"type" "list" str The string "list" is the only acceptable value
"len" length of some_object int 1, 4, 500
"value_type" The type of the first element of some_object (cast to a str) str "<class 'str'>", "<class 'dict'>"

If the input is a dict your result should have these key/value pairs.

key value type of value examples (comma separated)
"type" "dict" str The string "dict" is the only acceptable value
"keys" set of some_object's keys set {'data', 'kind'}
"value_types" dict mapping each of some_object's keys to the type (cast to str) of the value associated with it. dict mapping str keys to str values {'kind': "<class 'str'>", 'data': "<class 'dict'>"}

Note: Don't forget to cast the 'value_type' for lists and values of 'value_types' for dicts to strings!

Note: You can use str(type(x)) to cast the type of x to a str.

The demo included in the solution cell below should display the following output: ``` {'type': 'list', 'len': 8, 'value_type': ""} {'type': 'dict', 'keys': {'kind', 'data'}, 'value_types': {'kind': "", 'data': ""}} {'type': 'dict', 'keys': {'geo_filter', 'before', 'modhash', 'after', 'dist', 'children'}, 'value_types': {'after': "", 'dist': "", 'modhash': "", 'geo_filter': "", 'children': "", 'before': ""}} {'type': 'list', 'len': 100, 'value_type': ""} {'type': 'dict', 'keys': {'kind', 'data'}, 'value_types': {'kind': "", 'data': ""}} ```

Note this demo will run your solution 5 times, diving level by level into the raw data. Each individual run's output is separated by a blank line.

In [ ]:
### Exercise 1 solution
def analyze_structure(some_object):
    assert isinstance(some_object, (list, dict)), f'argument must be `list` or `dict`, {type(some_object)} was given.'
    if isinstance(some_object, list):
        # Handle the `list` case
        ### BEGIN SOLUTION
        return {'type': 'list',
                'len': len(some_object),
                'value_type': str(type(some_object[0]))}
        ### END SOLUTION
    elif isinstance(some_object, dict):
        # Handle the `dict` case
        ### BEGIN SOLUTION
        return {'type': 'dict',
                'keys': set(some_object.keys()),
                'value_types': {k: str(type(v)) for k, v in some_object.items()}}
        ### END SOLUTION
    else:
        assert False # This code will never execute
    
### demo function calls
print(analyze_structure(raw_reddit_data))
print()
print(analyze_structure(raw_reddit_data[0]))
print()
print(analyze_structure(raw_reddit_data[0]['data']))
print()
print(analyze_structure(raw_reddit_data[0]['data']['children']))
print()
print(analyze_structure(raw_reddit_data[0]['data']['children'][0]))

The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1', 
    'func': analyze_structure, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'some_object':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'dict',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 2 - (1 Points):

Motivation (don't dwell on this)

In order to maintain server performance and keep response sizes reasonable, the Reddit API paginates results (i.e. it only gives back a slice of the full result). raw_reddit_data is actually a list of those slices (each slice is a dict). Based on the analysis we were able to do with analyze_structure in the previous exercise, we can tell that raw_reddit_data[i]['data']['children'] contains a list of all the post data from "page" i. We need to combine all of these lists into a single list to remove some unnecessary complexity from later analysis tasks.

Requirements

Define the function combine_children(raw_reddit_data).

The input raw_reddit_data is a list of dicts such that for any i between 0 and len(raw_reddit_data) - 1, there will exist a list raw_reddit_data[i]['data']['children'].

The output should be a single new list which contains all of the elements of each "children" list with the order preserved.

In [ ]:
### Define demo inputs

# child_i_j indicates the j-th element in "list of children" i
demo_raw_reddit_data_ex2 = [
    {'data':{'children':['child_0_0', 'child_0_1']}},
    {'data':{'children':['child_1_0', 'child_1_1']}},
    {'data':{'children':['child_2_0', 'child_2_1']}}
]

The demo included in the solution cell below should display the following output:

['child_0_0', 'child_0_1', 'child_1_0', 'child_1_1', 'child_2_0', 'child_2_1']

Note - the "lists of children" in this example are greatly simplified to illustrate the top-level structure of raw_reddit_data, the output structure, and the ordering requirement.

In [ ]:
### Exercise 2 solution
def combine_child_data(raw_reddit_data):
    ### BEGIN SOLUTION
    rv = []
    for page in raw_reddit_data:
        rv += page['data']['children']
    return rv
    ### END SOLUTION
    
### demo function call
combine_child_data(demo_raw_reddit_data_ex2)

The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex2
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_2', 
    'func': combine_child_data, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'raw_reddit_data':{
            'dtype':'list', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'list',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We ran the following code to combine the post data for the entire data set and stored the result in resource/asnlib/publicdata/children.json.

children = combine_child_data(raw_reddit_data)
with open('resource/asnlib/publicdata/children.json', 'w') as f:
    json.dump(children, f)

Helper Function clean_unicode

Reddit posts allow for unicode special characters and html special characters (letters with accents, emojis, etc.). These characters will cause problems with our analysis. We have provided clean_unicode(text) which takes a string input, text and returns that string with any such characters replaced with an empty string ''. You will have to use it in the solution to the next exercise (or figure out a way to strip them out yourself...)

In [ ]:
def clean_unicode(text):
    text = re.sub(r'&.*?;', '', text)
    encoded = text.encode('unicode_escape')
    return re.sub(br"\\u....|\\U........|\\x..", b"", encoded).decode("unicode-escape")

Exercise 3a - (3 Points):

Motivation (don't dwell on this)

The text in a Reddit post can contain a lot of characters and formatting which are not useful to our analysis. We need to clean up the text and standardize it.

There's a lot of cleaning to be done, so we will be taking care of it in two phases. This is the first phase.

Requirements

Define the function clean_text_phase_0(text) which takes an inputs text, a str to be cleaned. All of the following transformations must be performed on text and the end result returned as a str. Some transformations may conflict with others, but using the order provided will give the correct result.

step number what to transform how to identify transformation recommendation
0 all letters NA convert to lower case use str methods
1 special unicode characters handled by clean_unicode Use clean_unicode use provided helper function
2 urls the sequence 'http' followed by at least one non-space character, i.e. 'the http module' does not contain a url, but http://cse6040.gatech.edu is a url. remove use re - remove pattern matches
3 space indicators space character, -, + , or = replace with ' ' use re - replace pattern matches with a single space.
4 posessive ending "'s" (does not have to occur at the end of a word) remove use re - remove pattern matches
5 symbols *, _, :, #, ., ,(the "comma" character), ~, ?, !, $, %, ;, (, ), [, ], /, \, {, }, ', <, >, &, `, " remove use re - remove pattern matches. All symbbols are in demo sample data.
6 consecutive spaces consecutive spaces. ex: ' ' replace with single space use str methods to split into "list of words", re-combine with str methods

The term "space character" should be taken to mean any character which matches the regex pattern '\s'. This includes but is not limited to spaces, newlines, and tabs. The term "non-space character" should be taken to mean all other characters.

The term "word" should be taken to mean a series of non-space characters which are either wrapped by either space characters or the start or end of a string. We would interpret the string 'these are all words' to have 4 words.

In [ ]:
### Define demo inputs
with open('resource/asnlib/publicdata/children.json') as f:
    children = json.load(f)

demo_text_ex3a = '''
😀 I just love #Python. You will learn to love it, too!!! (Just as much as I do). 


Check out my site at https://cse6040.gatech.edu. It's not much ~~ but I think it's gr8!
        	       It's really cool & fun! The "bee's knees" if you will. 🐝

The request module's functionality really makes HTTP requests a breeze!
allon*_:#.,~?!$%;()\[\]\'/{}\\\'<>&`"eword

"within" should remain, but "it" should not
'''
demo_text_ex3a
The demo included in the solution cell below should display the following output: ``` 'i just love python you will learn to love it too just as much as i do check out my site at it not much but i think it gr8 it really cool fun the bee knees if you will the request module functionality really makes http requests a breeze alloneword within should remain but it should not' ```

Note This is a simplified example which admittedly includes some nonsense. It checks most but not all requirements. If you need help comparing your output and the expected output you can use https://text-compare.com/.

In [ ]:
### Exercise 3a solution
def clean_text_phase_0(text):
    ### BEGIN SOLUTION
    clean = text.lower() # Convert to lowercase
    clean = clean_unicode(clean)
    clean = re.sub(r'http[^\s]+', '', clean) # Remove urls
    clean = re.sub(r'[\s\-+=]', ' ', clean) # Replace any space character or dash with ' '
    clean = re.sub(r'\'s', '', clean) # Remove "'s"
    clean = re.sub(r'[*_:#.,~?!$%;()\[\]\'/{}\\\'<>&`"]', '', clean) # Remove punctuation and grouping symbols 
    clean = ' '.join(clean.split())
    ## Moving to another solution
    #     clean = ' '.join(s for s in clean.split() if (re.search(r'[0-9]', s) is None and s not in stopwords)) # remove any "words" (sequences without spaces) containing numbers
    return clean 
    ### END SOLUTION
    
### demo function call
clean_text_phase_0(demo_text_ex3a)

The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

Depending on the test cases drawn in this cell, a DepricationWarning may be raised. This is nothing to be concerned about. It indicates that a feature we are using may be disabled in a future release, but it will work for now.

In [ ]:
### test_cell_ex3a
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_3a', 
    'func': clean_text_phase_0, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'text':{
            'dtype':'str', # data type of param.
            'check_modified':True,
        },
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'str',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We ran the code below to do the following:

  • Clean the text for all posts.
  • Store the cleaned texts in a dictionary mapping the post id to the cleaned text.
  • Store this dictionary in resource/asnlib/publicdata/clean_children_phase_0.json
clean_children_phase_0 = {child['data']['id']: clean_text_phase_0(child['data']['selftext']) for child in children}
with open('resource/asnlib/publicdata/clean_children_phase_0.json', 'w') as f:
    json.dump(clean_children_phase_0, f, indent=4)

Exercise 3b - (2 Points):

Motivation (don't dwell on this)

The text in a Reddit post can contain a lot of characters and formatting which are not useful to our analysis. We need to clean up the text and standardize it.

There's a lot of cleaning to be done, so we will be taking care of it in two phases. This is the second phase.

Requirements

Define the function clean_text_phase_1(text, stopwords) which takes two inputs text, a str to be cleaned, and stopwords, a set of str which should not be included in the final result. All of the following transformations must be performed on text and the end result returned as a str. Word removals should be done before dealing with the spaces. (Or you can take care of all 4 requirements in a clever one-liner!)

  • Remove any word which contains a number.
  • Remove any word which is a member of the set stopwords.
  • Replace any consecutive spaces with a single space.
  • Remove leading or trailing spaces from the end result. I.e. the end result should start and end with a non-space character.

The term "space character" should be taken to mean any character which matches the regex pattern '\s'. This includes but is not limited to spaces, newlines, and tabs. The term "non-space character" should be taken to mean all other characters.

The term "word" should be taken to mean a series of non-space characters which are either wrapped by either space characters or the start or end of a string. We would interpret the string 'these are all words' to have 4 words.

In [ ]:
### Define demo inputs
with open('resource/asnlib/publicdata/children.json') as f:
    children = json.load(f)
demo_stopwords_ex3b = {'as', 'it', 'i', 'the'}
demo_text_ex3b = 'i just love python you will learn to love it too just as ' + \
    'much as i do check out my site at it not much but i think it gr8 it really ' + \
    'cool fun the bee knees if you will the request module functionality really ' + \
    'makes http requests a breeze alloneword within should remain but it should not'
demo_text_ex3b
The demo included in the solution cell below should display the following output: ``` 'just love python you will learn to love too just much do check out my site at not much but think really cool fun bee knees if you will request module functionality really makes http requests a breeze alloneword within should remain but should not' ```

Note This is a simplified example which admittedly includes some nonsense. It checks most but not all requirements. If you need help comparing your output and the expected output you can use https://text-compare.com/.

In [ ]:
### Exercise 3b solution
def clean_text_phase_1(text, stopwords):
    ### BEGIN SOLUTION
    clean = text    
    clean = ' '.join(s for s in clean.split() if (re.search(r'[0-9]', s) is None and s not in stopwords))
    return clean 
    ### END SOLUTION
    
### demo function call
clean_text_phase_1(demo_text_ex3b, demo_stopwords_ex3b)

The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

Depending on the test cases drawn in this cell, a DepricationWarning may be raised. This is nothing to be concerned about. It indicates that a feature we are using may be disabled in a future release, but it will work for now.

In [ ]:
### test_cell_ex3b
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_3b', 
    'func': clean_text_phase_1, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'text':{
            'dtype':'str', # data type of param.
            'check_modified':True,
        },
        'stopwords':{
            'dtype':'set',
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'str',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We ran the code below to apply the second phase of cleaning to clean_children_phase_0 and stored the result in resource/asnlib/publicdata/clean_children.json

clean_children = {id:clean_text_phase_1(text, stopwords) for id, text in clean_children_phase_0.items()}
with open('resource/asnlib/publicdata/clean_children.json', 'w') as f:
    json.dump(clean_children, f, indent=4)

Exercise 4 - (1 Points):

Motivation (don't dwell on this)

We are eventually going to perform analysis to determine how similar posts are to one another based on the text. In order to perform this analysis we have to vectorize the text. (Don't worry, the only math involved is counting.)

Requirements

Define the function vectorize_text(text). The input text is the cleaned text from a single post. The output should be a dict mapping each unique word (consecutive non-space characters separated by spaces) to the number of times that word occurs in text.

Note any instance of dict will be suitable for an output. In other words defaultdict is acceptable.

In [ ]:
### Define demo inputs

with open('resource/asnlib/publicdata/clean_children.json') as f:
    clean_children = json.load(f)

demo_text_ex4 = 'foo bar baz baz tux foo tux foo tux bar kat'

The demo included in the solution cell below should display the following output:

{'foo': 3, 'bar': 2, 'baz': 2, 'tux': 3, 'kat': 1}

Note This is a simplified example to demonstrate the structure of the input and output.

Note If using something besides a base dict you may see something slightly different. You can use isinstance(demo_output_ex4, dict) to check if you are returning an instance of dict. (It should return True.)

In [ ]:
### Exercise 4 solution
def vectorize_text(text):
    ### BEGIN SOLUTION
    from collections import defaultdict
    vector = defaultdict(int)
    for word in text.split():
        vector[word] += 1
    return dict(vector)
    ### END SOLUTION
    
### demo function call
demo_output_ex4 = vectorize_text(demo_text_ex4)
demo_output_ex4

The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_4', 
    'func': vectorize_text, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'text':{
            'dtype':'str', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'dict',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We ran the following code to do the following:

  • Vectorize the text for all posts.
  • Store the vectors in a dictionary mapping the post id to the associated vector.
  • Store this dictionary in resource/asnlib/publicdata/vectorized_children.json
vectorized_children = {id: vectorize_text(text) for id, text in clean_children.items()}
with open('resource/asnlib/publicdata/vectorized_children.json', 'w') as f:
    json.dump(vectorized_children, f, indent=4)

Exercise 5 - (2 Points):

Motivation (don't dwell on this)

One metric concerning the Reddit post which may be useful is the fraction of posts a certain keyword appears in. For r/learnpython this metric would give insight into trending topics in the Python space. Other developers have used similar metrics enhanced with sentiment analysis as part of profitable (and un-profitable) stock trading strategies.

Requirements

Define the function keyword_usage(vectors, keyword). The input vectors is a dictionary mapping post ids (str) to vectors (dict mapping words (str) to counts (int)). The structure of each vector will be the same as the output to exercise 4. The input keyword is a str representing the keyword we are interested in.

The function should return a str matching this format:

'The keyword <keyword> occurs in <pct>% of posts in the sample.'

  • <keyword> is the keyword parameter.
  • <pct> is the percentage of posts which contain at least one instance of keyword. <pct> must be rounded to 2 decimal places and include 2 decimal places. (i.e. 15.00 is acceptable but 15, 15.0, and 15.000 are not.)
    • Use this formula for calculating <pct>
    • $\text{<pct>} = 100\times\frac{\text{\# of posts containing keyword}}{\text{\# of posts}}$

Note: Format strings are a great tool for accomplishing this task.

In [ ]:
### Define demo inputs
with open('resource/asnlib/publicdata/vectorized_children.json') as f:
    vectorized_children = json.load(f)

demo_vectors_ex5 = {
    'post_0': {'pycharm':1, 'python':3, 'ide': 2, 'jupyter': 3},
    'post_1': {'python':1, 'ide': 2, 'jupyter': 3},
    'post_2': {'python':3, 'jupyter': 5},
    'post_3': {'jupyter': 2}
}

demo_keyword_0_ex5 = 'python'
demo_keyword_1_ex5 = 'jupyter'
demo_keyword_2_ex5 = 'ide'
demo_keyword_3_ex5 = 'pycharm'

The demo included in the solution cell below should display the following output:

The keyword "python" occurs in 75.00% of posts in the sample.
The keyword "jupyter" occurs in 100.00% of posts in the sample.
The keyword "ide" occurs in 50.00% of posts in the sample.
The keyword "pycharm" occurs in 25.00% of posts in the sample.

Note This is a simplified example to demonstrate the structure of the inputs and outputs.

Note The demo calls keyword_usage four times. Each line of output comes from a single function call.

In [ ]:
### Exercise 5 solution
def keyword_usage(vectors, keyword):
    ### BEGIN SOLUTION
    n = len(vectors)
    m = sum(keyword in vector.keys() for vector in vectors.values())
    return f"The keyword \"{keyword}\" occurs in {round(100*m/n, 2):.2f}% of posts in the sample."
    ### END SOLUTION
    
print(keyword_usage(demo_vectors_ex5, demo_keyword_0_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_1_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_2_ex5))
print(keyword_usage(demo_vectors_ex5, demo_keyword_3_ex5))

The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_5', 
    'func': keyword_usage, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'vectors':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'keyword':{
            'dtype':'str', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Cosine Similarity

Read this if you want more information on how we are calculating similarity. All formulas will be given in the exercises where they are used.

The dictionaries we created for each post's text in the earlier exercises can also be thought of as vectors in a high dimensional space. In this context we can say that there is an angle $\theta \in [0, \pi/2]$ between any two of these vectors $v_0$ and $v_1$. The more similar the two posts are, the smaller the angle will be, thus the larger $\cos\theta$ will be. We can calculate $\cos\theta$ with the following formula.
$$\cos\theta = \frac{v_0 \cdot v_1}{\|v_0\|\|v_1\|}$$ Let's denote the unit vectors with the same direction as $v_0$ and $v_1$ by $\hat{v_0}$ and $\hat{v_1}$ respectively. Then the formula becomes simpler.

$$\cos\theta = \hat{v_0} \cdot \hat{v_1}$$

We can calculate any unit vector $\hat{v_i} \in \mathcal{R}^n$ with the following: $$\hat{v_i} = \frac{v_i}{\|v_i\|}$$ where $$\|v_i\| = \sqrt{\sum_{j = 0}^{n-1}{v_{i,j}^2}}$$

Exercise 6 - (2 Points):

Motivation (don't dwell on this)

Normalizing the vectors (making them all unit length) before using them for calculation will make the next step easier. We are effectively performing the $\hat{v_i}$ calculation from the Cosine Similarity section.

$$\hat{v_i} = \frac{v_i}{\|v_i\|}$$

where $$\|v_i\| = \sqrt{\sum_{j = 0}^{n-1}{v_{i,j}^2}}$$

Requirements

Define the function normalize_vector(vector). The input vector ($v_i$ from the math above) will be a dict mapping words (str) to int. Perform the following calculations and return ret_vec.

  • SS = sum of the squares of each value in vector; i.e - $\sum_{j = 0}^{n-1}{v_{i,j}^2}$ from the math above.
  • magnitude = $\sqrt{\text{SS}}$; i.e - $\|v_i\|$ from the math above.
  • ret_vec = dictionary mapping each key in vector to the value associated with that key in vector divided by magnitude; i.e - $\hat{v_i}$ from the math above.
  • For example if vector['python'] = 5 and magnitude = 50 then ret_vec['python'] should be 0.1.
In [ ]:
### Define demo inputs

demo_vector_ex6_0 = {
    'python': 4,
    'jupyter': 3
}

demo_vector_ex6_1 = {
    'python': 5,
    'code': 12
}

The demo included in the solution cell below should display the following output:

{'python': 0.8, 'jupyter': 0.6}
{'python': 0.38461538461538464, 'code': 0.9230769230769231}

Note This demo runs normalize_vector twice on (once on each input).

Note For this simplified example the intermediate values should be as follows:

input SS magnitude
demo_vector_ex6_0 25 5.0
demo_vector_ex6_1 169 13.0
In [ ]:
### Exercise 6 solution
def normalize_vector(vector):
    ### BEGIN SOLUTION
    from math import sqrt
    ss = sum(v**2 for v in vector.values())
    magnitude = sqrt(ss)
    return {k: v/magnitude for k, v in vector.items()}
    ### END SOLUTION
    
### demo function call
print(normalize_vector(demo_vector_ex6_0))
print(normalize_vector(demo_vector_ex6_1))

The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex6
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_6', 
    'func': normalize_vector, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'vector':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We used the following code to normalize all the vectors and save the result in resource/asnlib/publicdata/normalized_children.json

normalized_children = {id:normalize_vector(vector) for id, vector in vectorized_children.items()}
with open('resource/asnlib/publicdata/normalized_children.json', 'w') as f:
    json.dump(normalized_children, f, indent=4)

Exercise 7 - (1 Points):

Motivation (don't dwell on this)

Now that we have normalized word vectors for each post, it's time to implement the similarity calculation.

Requirements

Define calc_similarity(vector_0, vector_1). The inputs vector_0 and vector_1 will be dicts mapping str to float. Perform the following calculations and return the end result.

  • Calculate the product of vector_0[k] times vector_1[k] for each key k that is present in both vector_0 and vector_1.
  • Compute the sum of all these products and round to 5 decimal places.
  • Return the rounded sum as a float.
In [ ]:
### Define demo inputs
with open('resource/asnlib/publicdata/normalized_children.json') as f:
    normalized_children = json.load(f)
demo_vector_0_ex7 = {'python': 0.8, 'jupyter': 0.6}
demo_vector_1_ex7 = {'python': 0.38461538461538464, 'code': 0.9230769230769231}

The demo included in the solution cell below should display the following output:

0.30769
1.0

Note This simplified example contains two runs of normalize_vector. The first run uses both demo inputs. The second run uses demo_vector_0_ex7 as both vector_0 and vector_1 to illustrate that comparing a vector to itself should always give the result of 1.0 since $cos(0) = 1$.

In [ ]:
### Exercise 7 solution
def calc_similarity(vector_0, vector_1):
    ### BEGIN SOLUTION
    sim = sum(vector_0[key]*vector_1[key] for key in (vector_0.keys() & vector_1.keys()))
    return round(sim, 5)
    ### END SOLUTION
    
### demo function call
print(calc_similarity(demo_vector_0_ex7, demo_vector_1_ex7))
print(calc_similarity(demo_vector_0_ex7, demo_vector_0_ex7))

The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7', 
    'func': calc_similarity, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'vector_0':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'vector_1':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

We ran the following code to create a similarity matrix dict mapping str to dict (inner dict maps str to float) and write the result to resource/asnlib/publicdata/similarity_matrix.json. The value stored in similarity_matrix[a][b] is the cosine similarity score between post id a and post id b.

Based on how the matrix was constructed it is "square". In other words, if similarity_matrix[a] and similarity_matrix[b] exist then similarity_matrix[a][b] and similarity_matrix[b][a] are guaranteed to exist.

similarity_matrix = {
    id: {other_id: calc_similarity(normalized_children[id], normalized_children[other_id]) for other_id in normalized_children} \
    for id in normalized_children
    }
with open('resource/asnlib/publicdata/similarity_matrix.json', 'w') as f:
    json.dump(similarity_matrix, f, indent=4)

Exercise 8 - (3 Points):

Motivation (don't dwell on this)

We are interested in identifying the posts which are most similar to a given post. We have already done all of the math to capture that information, but it's in yet another huge JSON.

Requirements

Define n_most_similar(sim_matrix, post_id, n). The input sim_matrix will be a nested dict with the same structure as similarity_matrix described in the cell above. The input post_id will be the id of the post we are interested in and will be a key of sim_matrix. The function should do the following:

  • Get the "similarity vector" (sim_matrix[post_id]).
  • Sort the key/value pairs based on the similarity score. It is recommended but not required to have a list of tuples for this intermediate result. They should be sorted in descending order of similarity. If there are ties (2 or more posts with the same similarity to post_id) then break the ties by sorting them into alphabetical order.
  • Return a list of tuples of length n+1 containing post_id and the n most similar posts. Tuples should be of the form (post id (str), similarity score (float))

Note the post associated with post_id will always be at the top of the list since it will have a similarity of 1.0 to itself.

In [ ]:
### Define demo inputs

with open('resource/asnlib/publicdata/similarity_matrix.json') as f:
    similarity_matrix = json.load(f)

demo_sim_matrix_ex8 = {
  'post_0': {'post_0': 1.0,    'post_1': 0.79686,'post_2': 0.67635,'post_3': 0.52459,'post_4': 0.2334},
  'post_1': {'post_0': 0.79686,'post_1': 1.0,    'post_2': 0.65699,'post_3': 0.45758,'post_4': 0.11308},
  'post_2': {'post_0': 0.67635,'post_1': 0.65699,'post_2': 1.0,    'post_3': 0.51968,'post_4': 0.16411},
  'post_3': {'post_0': 0.52459,'post_1': 0.45758,'post_2': 0.51968,'post_3': 1.0,    'post_4': 0.33981},
  'post_4': {'post_0': 0.2334, 'post_1': 0.11308,'post_2': 0.16411,'post_3': 0.33981,'post_4': 1.0}}

demo_n_ex8 = 2
demo_post_id_0 = 'post_0'
demo_post_id_1 = 'post_3'

The demo included in the solution cell below should display the following output:

[('post_0', 1.0), ('post_1', 0.79686), ('post_2', 0.67635)]

[('post_3', 1.0), ('post_0', 0.52459), ('post_2', 0.51968)]

Note This demo calls n_most_similar twice. The outputs are separated by a blank line. The first uses demo_post_id_0 as post_id, and the second uses demo_post_id_1 as post_id.

In [ ]:
### Exercise 8 solution
def n_most_similar(sim_matrix, post_id, n):
    ### BEGIN SOLUTION
    sims = sorted(sim_matrix[post_id].items(), key=lambda x: (-x[1], x[0]))
    return sims[:n+1]
    ### END SOLUTION
    
print(n_most_similar(demo_sim_matrix_ex8, demo_post_id_0, demo_n_ex8))
print()
print(n_most_similar(demo_sim_matrix_ex8, demo_post_id_1, demo_n_ex8))

The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_8', 
    'func': n_most_similar, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'sim_matrix':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'post_id':{
            'dtype':'str', # data type of param.
            'check_modified':True,
        },
        'n':{
            'dtype':'int', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'S90rT5WLPFy08a82F6SXyiKoyeTi33DOHh7bxXASQw0=', path='resource/asnlib/publicdata/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'bE_KdJGq_bhBuoTZRi37O7tu3s38ac4bJUsgUEXnK8s=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(100):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Fin. If you have made it this far, congratulations on completing the exam. Don't forget to submit!