`Spring 2025, Midterm 1`: `Game On!`¶

Version 1.0.1

v1.0.1 - Fix Ex3 type hint; Added Ex6 clarification hint

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of built-in python data structures such as lists and sets, nested data structures, math as code, and basic algorithm concepts. It has 9 exercises numbered 0 to 8. There are 18 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 2 point(s)
Exercise 1 - : 3 point(s)
Exercise 2 - : 2 point(s)
Exercise 3 - : 3 point(s)
Exercise 4 - : 1 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 1 point(s) - FREE
Exercise 8 - : 2 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

### Global imports
import dill
from cse6040_devkit import plugins, utils
from collections import defaultdict, Counter
from math import log
from pprint import pprint

utils.add_from_file('defaultdict_check', plugins)

with open('resource/asnlib/publicdata/user_items.dill', 'rb') as f:
    users = dill.load(f)

with open('resource/asnlib/publicdata/games.dill', 'rb') as f:
    games = dill.load(f)

The Problem: Creating a Recommender System¶

Background. As of 2024, Steam is the largest digital distribution platform for selling and distributing video games. It hosts over 30,000 unique titles which are available for consumer purchase. Consumers who purchase games on Steam are provided with an account. These user accounts are tied to a user's game purchases, which makes it possible to see who owns which games. The storefront also tracks information about the games it distributes, such as relevant tags, the number of reviews, the text of individual reviews written by users, and more.

Your overall task. Your goal is to create an individually-tailored recommendation system for Steam. You will create two recommendation systems by taking two different approaches:

Content filtering: you will try to recommend games which are similar to the games a user already likes.
Collaborative filtering: you will try to recommend games which other, similar users seem to like.

At the end, we will combine these results to create an ordered list of recommended games which a user could purchase.

The datasets. You will work with two datasets to solve this problem. Both were obtained from the research produced at The University of California, San Diego and Julian McAuley's research team, such as by Wang-Chen Kang. The datasets describe:

A set of games hosted on Steam and information relevant to those games, such as their price, the name of the developer, and tags associated with the game.
A set of user profiles and associated information about them, such as the games they own and the number of hours they have spent playing each game.

Both datasets are provided as Python lists. If you have not already done so, run the cells above this paragraph to load the data into memory.

Part 0: Data Exploration and Cleaning¶

Before we begin creating a recommendation system, we need to deal with the fact that our data are a bit messy. Let's start by cleaning up our inputs and organizing them so it's easier to work with our information later.

Exercise 0: (2 points)¶

dictionary_key_frequency

Your task: define dictionary_key_frequency as follows:

To begin, it will be helpful to get a sense for what sorts of attributes we have access to in our input data and how frequently we have access to that information. You will do this by completing the following task:

Calculate the frequencies of the keys found in a list of dictionaries.

Inputs:

list_of_dictionaries: A list of dictionaries.

Return:

key_frequencies: A dictionary.
- The keys should be the be the keys found in the input dictionaries.
- The values should be the frequencies of the keys.
  - The frequency of a key is calculated as the number of times it appears, divided by the total number of dictionaries.
  - Frequencies should be rounded to six decimal places.

Hints

You may find the Counter() data structure, provided by the collections library helpful. It is not required to solve the problem.

### Solution - Exercise 0  
def dictionary_key_frequency(list_of_dictionaries: list) -> dict:
    ### BEGIN SOLUTION
    keys = [key for element in list_of_dictionaries for key in element]
    key_counts = dict(Counter(keys))
    num_elements = len(list_of_dictionaries)
    for key in key_counts:
        proportion = round(key_counts[key] / num_elements, 6)
        key_counts[key] = proportion

    return key_counts
    ### END SOLUTION

### Demo function call
demo_list_of_dict = [
    {"a": 1, "b": 2},
    {"a": 3, "b": 11, "c": 4},
    {"a": 5, "b": 6, "d": 7},
    {"b": 8, "c": 9}
]
print('Here is the desired output for `demo_list_of_dicts`:')
pprint(dictionary_key_frequency(demo_list_of_dict))
print('-------------------------------------------------')
print('Here are the keys in the dictionary and their frequencies for the demo output:')
pprint(dictionary_key_frequency(games))

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/dictionary_key_frequencies_demo.dill', 'rb') as fp:
    dictionary_key_frequencies_demo = dill.load(fp)

The demo should display this printed output.

Here is the desired output for `demo_list_of_dicts`:
{'a': 0.75, 'b': 1.0, 'c': 0.5, 'd': 0.25}
-------------------------------------------------
Here are the keys in the dictionary and their frequencies for the demo output:
{'app_name': 0.999938,
 'developer': 0.897339,
 'discount_price': 0.007002,
 'early_access': 1.0,
 'genres': 0.897837,
 'id': 0.999938,
 'metascore': 0.083305,
 'price': 0.95715,
 'publisher': 0.749432,
 'release_date': 0.935678,
 'reviews_url': 0.999938,
 'sentiment': 0.776505,
 'specs': 0.97915,
 'tags': 0.994928,
 'title': 0.936207,
 'url': 1.0}

The cell below will test your solution for dictionary_key_frequency (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=dictionary_key_frequency,
              ex_name='dictionary_key_frequency',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to dictionary_key_frequency did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=dictionary_key_frequency,
              ex_name='dictionary_key_frequency',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to dictionary_key_frequency did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 1: (3 points)¶

organize_game_info

Your task: define organize_game_info as follows:

Organize the information in games_list, represented as dictionaries with various attributes, into a more useful format for processing with the recommender system. Inputs:

games_list: A list of dictionaries, where each dictionary represents a game and may or may not contain attributes 'id', 'name', and 'tags'.
The term "attribute" is not the same as the term "key" in this context. The key for an attribute may or may not be the same as the attribute's name. (e.g. the key for the 'id' attribute may be 'game_id', 'id_number', 'id', or anything else.)
attribute_map: A dictionary that maps attribute names to their corresponding keys in the game dictionaries.
For example: attribute_map['id'] is the key we would use to look up the 'id' attribute in one of the game dictionaries rather than the string 'id'.
game_info_lookup: An optional dictionary containing pre-existing game data that should be merged with the newly extracted information. Defaults to None.

Return:

game_info_lookup: A dictionary with the structure described below.

Requirements/steps:

Extract relevant information:
- Extract the 'id', 'name', and 'tags' of each game from the games_list using the attribute_map to determine the correct keys.
- If the 'name' or 'id' of a game are not present or can not be resolved, skip the extraction for that game.
- If the 'tags' of a game are not present or can not be resolved, set the 'tags' to an empty list.
- Convert all tags to lowercase and remove any leading/trailing whitespace.
Create a result dictionary:
- Create a dictionary where the keys are game IDs and the values are dictionaries containing the 'name' and 'tags' of the game.
- If the same game ID is found for multiple game dictionaries, the first appearance (smaller index in the list) should be processed, and later appearances should be ignored.
Merge with existing information:
- If game_info_lookup is provided, merge its contents with the newly created result dictionary.
- In instances where a key is common to the result dictionary and game_info_lookup, the value from game_info_lookup should be kept as part of the post-merge result.
Return the post-merge result dictionary.

Hints

If game_info_lookup is provided, make sure you don't modify the input!

### Solution - Exercise 1  
def organize_game_info(games_list: list, attribute_map: dict, game_info_lookup=None) -> dict:
    ### BEGIN SOLUTION
    if not game_info_lookup:
        game_info_lookup = dict()
    else:
        game_info_lookup = game_info_lookup.copy()

    id_attr = attribute_map['id']
    name_attr = attribute_map['name']
    tags_attr = attribute_map['tags']
    
    for game in games_list:
        if id_attr not in game or name_attr not in game:
            continue
        id = game[id_attr]
        name = game[name_attr]
        tags = game.get(tags_attr, [])
        if tags:
            tags = [tag.lower().strip() for tag in tags]

        if id not in game_info_lookup:
            game_info_lookup[id] = {'name': name, 'tags': tags}

    return game_info_lookup
    ### END SOLUTION

### Demo function call
def example_organize_game_info(output, ex_num, game_arg, attr_map_arg, tf, ex_IDs):
    print(f'Example {ex_num} ------------------------------------------------------------')
    print('`organize_game_info` will be called with the following attribute map:')
    pprint(attr_map_arg)
    print('\nThe full results are stored in `game_info_lookup_demo`. Here are the results for a subset of the full dictionary.')
    pprint({ID: output[ID] for ID in ex_IDs})
    print(f'\nThe following should show `{tf}`: {"100" in output}\n')
# Example 0 (A small-scale demo)
print(f'Example 0 ------------------------------------------------------------')
print('This example is just to help you understand the logic of the problem.')
games_list = [
    {'name': 'Game A', 'game_tags': ['Action', 'Adventure']}, 
    {'game_id': 2, 'game_name': 'Game B', 'game_tags': ['RPG', 'Strategy']}, 
    {'id': 3, 'game_name': 'Game C', 'game_tags': ['RPG', 'Strategy']},
    {'game_id': 4, 'game_name': 'Game D', 'tags': ['FPS', 'Action']},
    {'game_id': 4, 'game_name': 'Game D', 'game_tags': ['FPS', 'Action']},
    {'game_id': 22, 'game_name': 'Game E', 'game_tags': []}
]
attribute_map = {
    'id': 'game_id',
    'name': 'game_name',
    'tags': 'game_tags'
}
pprint(organize_game_info(games_list, attribute_map))
# Example 1 (without existing dictionary)
attr_map_1 = {
    'id': 'id',
    'name': 'app_name',
    'tags': 'tags'
}
game_info_lookup_demo = organize_game_info(games, attr_map_1)
example_organize_game_info(game_info_lookup_demo, 1, games, attr_map_1, False, ('10', '1002', '100400', '10090'))
# Example 2 (with existing dictionary)
attr_map_2 = {
    'id': 'item_id',
    'name': 'item_name',
    'tags': 'tags'
}
all_user_games = [game for user in users for game in user.get('items', [])]
game_info_lookup_demo = organize_game_info(all_user_games, attr_map_2, game_info_lookup_demo)
example_organize_game_info(game_info_lookup_demo, 2, all_user_games, attr_map_2, True, ('10', '100', '10000'))

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/game_info_lookup_partial_demo.dill', 'rb') as fp:
    game_info_lookup_partial_demo = dill.load(fp)

with open('resource/asnlib/publicdata/game_info_lookup_demo.dill', 'rb') as fp:
    game_info_lookup_demo = dill.load(fp)

The demo should display this printed output.

Example 0 ------------------------------------------------------------
This example is just to help you understand the logic of the problem.
{2: {'name': 'Game B', 'tags': ['rpg', 'strategy']},
 4: {'name': 'Game D', 'tags': []},
 22: {'name': 'Game E', 'tags': []}}
Example 1 ------------------------------------------------------------
`organize_game_info` will be called with the following attribute map:
{'id': 'id', 'name': 'app_name', 'tags': 'tags'}

The full results are stored in `game_info_lookup_demo`. Here are the results for a subset of the full dictionary.
{'10': {'name': 'Counter-Strike',
        'tags': ['action',
                 'fps',
                 'multiplayer',
                 'shooter',
                 'classic',
                 'team-based',
                 'competitive',
                 'first-person',
                 'tactical',
                 "1990's",
                 'e-sports',
                 'pvp',
                 'military',
                 'strategy',
                 'score attack',
                 'survival',
                 'assassin',
                 '1980s',
                 'ninja',
                 'tower defense']},
 '1002': {'name': 'Rag Doll Kung Fu',
          'tags': ['indie', 'fighting', 'multiplayer']},
 '100400': {'name': 'Silo 2', 'tags': ['animation & modeling', 'software']},
 '10090': {'name': 'Call of Duty: World at War',
           'tags': ['zombies',
                    'world war ii',
                    'fps',
                    'action',
                    'multiplayer',
                    'shooter',
                    'moddable',
                    'co-op',
                    'first-person',
                    'singleplayer',
                    'war',
                    'online co-op',
                    'gore',
                    'historical',
                    'survival',
                    'classic',
                    'tanks',
                    'great soundtrack',
                    'adventure',
                    'horror']}}

The following should show `False`: False

Example 2 ------------------------------------------------------------
`organize_game_info` will be called with the following attribute map:
{'id': 'item_id', 'name': 'item_name', 'tags': 'tags'}

The full results are stored in `game_info_lookup_demo`. Here are the results for a subset of the full dictionary.
{'10': {'name': 'Counter-Strike',
        'tags': ['action',
                 'fps',
                 'multiplayer',
                 'shooter',
                 'classic',
                 'team-based',
                 'competitive',
                 'first-person',
                 'tactical',
                 "1990's",
                 'e-sports',
                 'pvp',
                 'military',
                 'strategy',
                 'score attack',
                 'survival',
                 'assassin',
                 '1980s',
                 'ninja',
                 'tower defense']},
 '100': {'name': 'Counter-Strike: Condition Zero Deleted Scenes', 'tags': []},
 '10000': {'name': 'Enemy Territory: Quake Wars', 'tags': []}}

The following should show `True`: True

The cell below will test your solution for organize_game_info (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=organize_game_info,
              ex_name='organize_game_info',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to organize_game_info did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=organize_game_info,
              ex_name='organize_game_info',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to organize_game_info did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Part 1: Content Filtering¶

Background. There are a few different ways to create a recommendation system. One approach is called content filtering. In this approach, we will try to build a model which tries to understand what types of things a person likes. Then, we will try to recommend things which are most similar to the things they tend to like.

We are going to try to do this by building a variant of a common classification model: the Naive Bayes Classifier.

Exercise 2: (2 points)¶

split_games_by_ownership

Your task: define split_games_by_ownership as follows:

Our model is going to need to know whether a user already owns a game or not.

Return a tuple which contains two sets: the IDs for the games a user owns, and the IDs for the games a user does not own.

Inputs:

user: A dictionary representing a user, containing an 'items' key whose value is a list of dictionaries, each with an 'item_id' key.
game_lookup: A dictionary where keys are game IDs and values are game details.
The keys of game_lookup are the ids of all the games which could potentially be owned by a user.

Return:

The following elements should be returned as a two-element tuple:
- owned: a set containing the IDs of every game owned by user.
- not_owned: a set containing the IDs of every game not owned by user.

Requirements/steps:

If a user does not own any games, you should still return an empty set for owned.

Hints

A user's games are contained under the items key in their user dictionary.
The game IDs are contained under the item_id key for each game owned by the user.

### Solution - Exercise 2  
def split_games_by_ownership(user: dict, game_lookup: dict) -> tuple:
    ### BEGIN SOLUTION
    user_games = user.get('items', [])
    owned = set(map(lambda g: g.get('item_id'), user_games))
    all_games = set(game_lookup.keys())
    not_owned = all_games - owned
    return owned, not_owned
    ### END SOLUTION

### Demo function call
# Example 0: Simple demo
demo_user = {
    'user_id': 123,
    'items': [
        {'item_id': 'game_a'},
        {'item_id': 'game_c'}
    ]
}
demo_game_lookup = {
    'game_a': {'name': 'Game A'},
    'game_b': {'name': 'Game B'},
    'game_c': {'name': 'Game C'},
    'game_d': {'name': 'Game D'}
}
print('The following output is designed to help you understand the logic of the question:')
print(split_games_by_ownership(demo_user, demo_game_lookup))
print('-------------------------------------------------------------')
# Example 1: Expected output
owned_demo, not_owned_demo = split_games_by_ownership(users[0], game_info_lookup_demo)
print('The full output is contained in `owned_demo` and `not_owned_demo`.')
print('Here are the first 15 games owned by the first user, ordered by key:')
pprint(set(sorted([game for game in     owned_demo])[:15]))
print('Here are the first 15 games not owned by the first user, ordered by key:')
pprint(set(sorted([game for game in not_owned_demo])[:15]))

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/split_games_by_ownership_demo.dill', 'rb') as fp:
    owned_demo, not_owned_demo = dill.load(fp)

The demo should display this printed output.

The following output is designed to help you understand the logic of the question:
({'game_a', 'game_c'}, {'game_b', 'game_d'})
-------------------------------------------------------------
The full output is contained in `owned_demo` and `not_owned_demo`.
Here are the first 15 games owned by the first user, ordered by key:
{'10180',
 '10190',
 '102600',
 '104700',
 '104900',
 '105600',
 '10680',
 '107100',
 '107300',
 '107310',
 '108710',
 '11200',
 '113020',
 '113200',
 '116100'}
Here are the first 15 games not owned by the first user, ordered by key:
{'10',
 '100',
 '10000',
 '1002',
 '100400',
 '100410',
 '10080',
 '10090',
 '100970',
 '100980',
 '10100',
 '10110',
 '10120',
 '10130',
 '10140'}

The cell below will test your solution for split_games_by_ownership (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=split_games_by_ownership,
              ex_name='split_games_by_ownership',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=25)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to split_games_by_ownership did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=split_games_by_ownership,
              ex_name='split_games_by_ownership',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=5,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to split_games_by_ownership did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 3: (3 points)¶

get_tag_probabilities

Your task: define get_tag_probabilities as follows:

A Naive Bayes Classifier needs to know the probability that an item chosen at random from a class (classification group) has a given attribute.

Create a dictionary which maps tags to their smoothed probabilities that a game in a population has that tag.

Inputs:

game_IDs: A set of game IDs which define a class of games. These IDs should all be present as keys in game_lookup.
game_lookup: A dictionary mapping game IDs to game metadata.
- Each game's metadata is a dictionary including a tags key mapped to a list of that game's tags.
alpha: A provided smoothing constant which is an integer.

Return:

smoothed_outputs: A dictionary containing the following elements:
- tag_probabilities: A dictionary with every tag in the games provided by game_IDs as keys and their smoothed probabilities as the values.
- smoothed_default: A floating point value of the formula when $N_i = 0$ (see formula below).
- bign_i: The number of games, in the class specified by game_IDs, which have tag $i$, represented by $N_i$ (see formula below).
- bign: The number of games in the class defined by game_IDs, represented by $N$ (see formula below).
- d: The total number of unique tags (or the cardinality of the tags) which appear in ALL the games found in game_lookup, represented by $d$ (see formula below).

Requirements/steps:

To calculate a smoothed probability, use the following formula: $$\hat{\theta}_i = \frac{N_i + \alpha}{N + (\alpha * d)}$$
In this formula:
- $\hat{\theta}_i$: The smoothed probability for tag $i$.
- $N_i$: The number of games in the class which have tag $i$. Put another way, the number of times tag $i$ appears in the games specified by game_IDs.
- $N$: The number of games in the class defined by game_IDs.
- $d$: The dimensionality of the data or the total number of unique tags (or the cardinality of the tags) which appear across ALL the games found in game_lookup.
- $\alpha$: A provided smoothing constant.

Hints

You will need to get the tags associated with each game in game_IDs from game_lookup.
To calculate $N$ and $d$, you may find the len() function and the set() data structure useful.
To make counting the tag frequencies easier, you may find the Counter() data structure provided by the collections library helpful. It is not required to solve the problem.

### Solution - Exercise 3  
def get_tag_probabilities(game_IDs: set, game_lookup: dict, alpha: int) -> dict:
    ### BEGIN SOLUTION
    tags_by_game = [game_lookup[game_ID]['tags'] for game_ID in game_IDs]
    total_games = len(tags_by_game)

    all_tags = set([
        tag
        for game
        in game_lookup
            for tag
            in game_lookup[game]['tags']
    ])
    n_tags = len(all_tags)
    
    tag_counts = Counter()
    for game_tags in tags_by_game:
        tag_counts.update(game_tags)
    bign_i=tag_counts.copy()
    for tag in tag_counts:
        tag_counts[tag] += alpha
        tag_counts[tag] /= total_games + (alpha * n_tags)
        
    smoothed_default = alpha / (total_games + (alpha * n_tags))

    output = {
        'tag_probabilities': dict(tag_counts),
        'smoothed_default': smoothed_default,
        'bign_i': dict(bign_i),
        'bign': total_games,
        'd':n_tags
    }
    return output
    ### END SOLUTION

### Demo function call
owned_probs_demo     = get_tag_probabilities(    owned_demo, game_info_lookup_demo, 1)
not_owned_probs_demo = get_tag_probabilities(not_owned_demo, game_info_lookup_demo, 1)
print('The full results are in `owned_probs_demo`.')
# Results for owned games
print('Here are the probabilities of a subset tags within the games owned by the user:')
pprint({tag:     owned_probs_demo['tag_probabilities'][tag] for tag in ('comedy', 'fps', 'strategy', 'puzzle')})
print(f'Your `smoothed_default` value for the games owned by the user is: {owned_probs_demo["smoothed_default"]}')
print('Here are the counts for a subset tags for `bign_i`:')
pprint({tag:     owned_probs_demo['bign_i'][tag] for tag in ('comedy', 'fps', 'strategy', 'puzzle')})
print(f'Your `bign` value for the games owned by the user is: {owned_probs_demo["bign"]}')
print(f'Your `d` value for the games owned by the user is: {owned_probs_demo["d"]}')
# Results for non-owned games
print('------------------------------------------------------------------')
print('Here are the probabilities of a subset tags within the games not owned by the user:')
pprint({tag: not_owned_probs_demo['tag_probabilities'][tag] for tag in ('comedy', 'fps', 'strategy', 'puzzle')})
print(f'Your `smoothed_default` value for the games not owned by the user is: {not_owned_probs_demo["smoothed_default"]}')
print('Here are the counts for a subset tags for `bign_i`:')
pprint({tag:     not_owned_probs_demo['bign_i'][tag] for tag in ('comedy', 'fps', 'strategy', 'puzzle')})
print(f'Your `bign` value for the games not owned by the user is: {not_owned_probs_demo["bign"]}')
print(f'Your `d` value for the games not owned by the user is: {not_owned_probs_demo["d"]}')

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/owned_probs_demo.dill', 'rb') as fp:
    owned_probs_demo     = dill.load(fp)

with open('resource/asnlib/publicdata/not_owned_probs_demo.dill', 'rb') as fp:
    not_owned_probs_demo = dill.load(fp)

The demo should display this printed output.

The full results are in `owned_probs_demo`.
Here are the probabilities of a subset tags within the games owned by the user:
{'comedy': 0.10193321616871705,
 'fps': 0.08787346221441125,
 'puzzle': 0.08084358523725835,
 'strategy': 0.11072056239015818}
Your `smoothed_default` value for the games owned by the user is: 0.0017574692442882249
Here are the counts for a subset tags for `bign_i`:
{'comedy': 57, 'fps': 49, 'puzzle': 45, 'strategy': 62}
Your `bign` value for the games owned by the user is: 230
Your `d` value for the games owned by the user is: 339
------------------------------------------------------------------
Here are the probabilities of a subset tags within the games not owned by the user:
{'comedy': 0.02604415274463007,
 'fps': 0.028639618138424822,
 'puzzle': 0.06166467780429594,
 'strategy': 0.22389618138424822}
Your `smoothed_default` value for the games not owned by the user is: 2.983293556085919e-05
Here are the counts for a subset tags for `bign_i`:
{'comedy': 872, 'fps': 959, 'puzzle': 2066, 'strategy': 7504}
Your `bign` value for the games not owned by the user is: 33181
Your `d` value for the games not owned by the user is: 339

The cell below will test your solution for get_tag_probabilities (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=get_tag_probabilities,
              ex_name='get_tag_probabilities',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to get_tag_probabilities did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=get_tag_probabilities,
              ex_name='get_tag_probabilities',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=5,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to get_tag_probabilities did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 4: (1 points)¶

tag_probability_with_default

Your task: define tag_probability_with_default as follows:

It would be helpful to easily bundle our smoothed_default into our collection of smooth probabilities.

Create a default dictionary which returns the smoothed_default for a missing key.

Inputs:

tag_probabilities: Our dictionary which maps tags to probabilities.
smoothed_default: Our default, smoothed value, which is a float.

Return:

tag_probability_default: A default dictionary containing the values from tag_probabilities which returns the smoothed_default for missing values.

Requirements/steps:

You may find the documentation for defaultdicts useful.
Remember, a defaultdict uses a function to return the default value.

Hints

You can take several approaches to solve this problem. One is to loop over each key-value pair and update the new default dictionary. Alternatively, you can just pass the dictionary to the defaultdict constructor. Either approach works.

### Solution - Exercise 4  
def tag_probability_with_default(tag_probabilities: dict, smoothed_default):
    ### BEGIN SOLUTION
    tag_probability_default = defaultdict(
        lambda: smoothed_default,
        tag_probabilities
    )
    return tag_probability_default
    ### END SOLUTION

### Demo function call
owned_probs_default_demo     = tag_probability_with_default(**{k:v for k,v in owned_probs_demo.items() if k in ['tag_probabilities','smoothed_default']})
not_owned_probs_default_demo = tag_probability_with_default(**{k:v for k,v in not_owned_probs_demo.items() if k in ['tag_probabilities','smoothed_default']})
try:
    assert "A SUPER SPECIAL MYSTERY KEY" not in owned_probs_default_demo
except:
    print('Did you insert the random key into the dictionary before we checked for it?')
try:
    assert isinstance(owned_probs_default_demo, defaultdict), "Are you SURE you're returning a default dictionary?"
except:
    print('Are you sure your dictionary is a defaultdict?')
print(f'Default value of one of our dictionaries: {owned_probs_demo["smoothed_default"]}')
print(f'Value associated with absent key:         {owned_probs_default_demo["A SUPER SPECIAL MYSTERY KEY"]}')

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/owned_probs_default_demo.dill', 'rb') as fp:
    owned_probs_default_demo     = dill.load(fp)
with open('resource/asnlib/publicdata/not_owned_probs_default_demo.dill', 'rb') as fp:
    not_owned_probs_default_demo = dill.load(fp)

The demo should display this printed output.

Default value of one of our dictionaries: 0.0017574692442882249
Value associated with absent key:         0.0017574692442882249

The cell below will test your solution for tag_probability_with_default (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.defaultdict_check(tag_probability_with_default),
              ex_name='tag_probability_with_default',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to tag_probability_with_default did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.defaultdict_check(tag_probability_with_default),
              ex_name='tag_probability_with_default',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to tag_probability_with_default did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 5: (2 points)¶

bernoulli_conditional

Your task: define bernoulli_conditional as follows:

A Naive Bayes model requires some conditional probability definition to function.

Define our conditional probability function as outlined below.

Inputs:

x_tag: Some tag, possibly contained as a key in prob_lookup.
present: An integer equal to 0 or 1, as defined below.
prob_lookup: A default dictionary from exercise 4. It will return the smoothed probability for x_tag.

Return:

conditional_prob: The Bernoulli conditional probability, as defined below.

Requirements/steps:

Return the Bernoulli conditional probability, which we can write with the following notation: $$P(x_i|y)$$
You can understand this step as us computing the prior probability that our random variable is in one of two states.

Hints

$P(x_i=1|y)$ is the value contained in prob_lookup, associated with x_tag.
- Since a Bernoulli variable can be equal to 1 or 0, you will need to figure out what to do when $x_i = 0$. The Wikipedia article on the Bernoulli distribution may be useful!
$x_i$ is defined by our variable, therefore $x_i=present$.
You might find it easiest to break the equation into multiple parts and add them at the end.
Alternatively, you might realize that you can simply use an if statement, depending on the value of present. That works too!

### Solution - Exercise 5  
def bernoulli_conditional(x_tag: str, present: int, prob_lookup: dict):
    ### BEGIN SOLUTION
    A =      prob_lookup[x_tag]  * present
    B = (1 - prob_lookup[x_tag]) * (1 - present)
    return A + B
    ### END SOLUTION

### Demo function call
# Example 0: Simple Case
prob_lookups = {"category_A": 0.7, "category_B": 0.3}

prob_given_A_present      = bernoulli_conditional("category_A", 1, prob_lookups)
prob_given_A_not_present  = bernoulli_conditional("category_A", 0, prob_lookups)

prob_given_B_present      = bernoulli_conditional("category_B", 1, prob_lookups)
prob_given_B_not_present  = bernoulli_conditional("category_B", 0, prob_lookups)
print(f"Category A has a {prob_given_A_present:.3} probability of being present and a {prob_given_A_not_present:.3} probability of not being present.")
print(f"Category B has a {prob_given_B_present:.3} probability of being present and a {prob_given_B_not_present:.3} probability of not being present.")
print("------------------------------------------------------------")

# Example 1: Demo Output
bernoulli_conditional_prob_demo = bernoulli_conditional('strategy', 0, owned_probs_default_demo)
print(f'Your conditional function produced the following probability when the "strategy" tag is absent:\n{bernoulli_conditional_prob_demo}')
bernoulli_conditional_prob_demo = bernoulli_conditional('strategy', 1, owned_probs_default_demo)
print(f'Your conditional function produced the following probability when the "strategy" tag is present:\n{bernoulli_conditional_prob_demo}')

The demo should display this printed output.

Category A has a 0.7 probability of being present and a 0.3 probability of not being present.
Category B has a 0.3 probability of being present and a 0.7 probability of not being present.
------------------------------------------------------------
Your conditional function produced the following probability when the "strategy" tag is absent:
0.8892794376098418
Your conditional function produced the following probability when the "strategy" tag is present:
0.11072056239015818

The cell below will test your solution for bernoulli_conditional (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 5  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=bernoulli_conditional,
              ex_name='bernoulli_conditional',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to bernoulli_conditional did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=bernoulli_conditional,
              ex_name='bernoulli_conditional',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to bernoulli_conditional did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Aside: What IS a Naive Bayes Model?¶

IMPORTANT! You do not need to read this section to solve exercise 6! You might find it helpful, but it is not required.

Notes. A Naive Bayes Classifier typically works by using the following formula:

$$\hat{y} = \text{argmax}_y P(y)\prod_{i=1}^n P(x_i|y)$$

We calculate the right-hand side of the equation for each class, $y$, and simply choose the class $y$ which maximizes the result.

To create a recommendation system, we will choose the class which maximizes the ratio of $P(x_i|y)$ for a game's probability. In other words, we want to recommend games with the highest value for the following equation:

$$\frac{\prod_{i=1}^n P(x_i|y_{\text{owned}})}{\prod_{i=1}^n P(x_i|y_{\text{not owned}})}$$

However, due to numerical considerations related to small floating-point values, we'll actually take the logarithms of both the numerator and denominator and calculate the difference. So, the equation we actually want to maximize is this:

$$\sum_{i=1}^n\ln{P(x_i|y_{\text{owned}})} - \sum_{i=1}^n\ln{P(x_i|y_{\text{not owned}})}$$

This is what you will compute in exercise 6.

If you want to learn more, you may find the relevant Scikit-Learn page informative!

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load a function needed for Exercise 6.

### Run Me!!!
demo_b_cond = utils.load_object_from_publicdata('demo_b_cond')

Exercise 6: (2 points)¶

bernoulli_bayes

Your task: define bernoulli_bayes as follows:

For a given game, return the difference of the log-sums of the conditional probabilities.

Inputs:

b_cond: A correctly defined bernoulli_conditional function. You may obtain $P(x_i|y)$ by calling b_cond(tag, present, probs) (where probs is either owned_probs or not_owned_probs).
game: A game dictionary e.g. {name: str, tags: list}, in the form provided by exercise 1.
owned_probs: A default dictionary containing the probabilities for $P(x_i|y_{\text{owned}})$
not_owned_probs: A default dictionary containing the probabilities for $P(x_i|y_{\text{not owned}})$
all_tags: A set containing every tag. $x_i$ is an arbitrary tag in all_tags.

Return:

log_prob_diff: The difference in the sum of the log probabilities, as defined below.

Requirements/steps:

The difference in the log probabilities is defined by this equation: $$\sum_{i=1}^n\ln{P(x_i|y_{\text{owned}})} - \sum_{i=1}^n\ln{P(x_i|y_{\text{not owned}})}$$
If a game has no tags, return a value of 0.

Hints

You will need to use the log function, provided by the math module.
$x_i$ is equal to 1 if a tag is present for game. Otherwise, it should be equal to 0.

Note: the b_cond supplied will fail (generate a KeyError) if you refer any tag outside the feature space defined by all_tags. Use this to your advantage when debugging.

### Solution - Exercise 6  
def bernoulli_bayes(b_cond, game: dict, owned_probs: dict, not_owned_probs: dict, all_tags: set):
    ### BEGIN SOLUTION
    tags = set(game['tags'])
    if not tags:
        return 0
    owned_prob = 0
    not_owned_prob = 0
    for tag in all_tags:
        present = 1 if tag in tags else 0
        owned_prob     += log(b_cond(tag, present, owned_probs))
        not_owned_prob += log(b_cond(tag, present, not_owned_probs))
    return owned_prob - not_owned_prob
    ### END SOLUTION

### Demo function call
# Part 1: Computing the log-probabilities
all_tags = set(tag for game in game_info_lookup_demo for tag in game_info_lookup_demo[game]['tags'])
print('Here is the difference in the log-probabilities produced for "Call of Duty: World at War":')
bernoulli_bayes_prob_demo = bernoulli_bayes(
    demo_b_cond,
    game_info_lookup_demo['10090'],
    owned_probs_default_demo,
    not_owned_probs_default_demo,
    all_tags
)
print(bernoulli_bayes_prob_demo)
# Part 2: Using the scores to create recommendations
game_items = game_info_lookup_demo.values()
game_order = map(
    lambda g: bernoulli_bayes(
        demo_b_cond,
        g,
        owned_probs_default_demo,
        not_owned_probs_default_demo,
        all_tags
    ),
    game_items
)
sorted_games = sorted(zip(game_items, game_order), key=lambda x: x[1], reverse=True)[:15]
content_recs_demo = [game[0]['name'] for game in sorted_games]
print('Here are the 15 highest rated games for our user, based on content filtering:')
pprint(content_recs_demo)

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/bernoulli_bayes_recs.dill', 'rb') as fp:
    bernoulli_bayes_recs_demo = dill.load(fp)

The demo should display this printed output.

Here is the difference in the log-probabilities produced for "Call of Duty: World at War":
12.141327968970366
Here are the 15 highest rated games for our user, based on content filtering:
['Call of Juarez® Gunslinger',
 'Call of Duty® 4: Modern Warfare®',
 'Team Fortress 2',
 'Call of Duty®: Ghosts',
 'Death Squared',
 'Counter-Strike: Global Offensive',
 'Killing Floor',
 'Hitman: Absolution™',
 'Grand Theft Auto IV: Complete Edition',
 'Deus Ex: Game of the Year Edition',
 'Saints Row 2',
 'Left 4 Dead 2',
 'Saints Row IV',
 'Antichamber',
 'Left 4 Dead']

The cell below will test your solution for bernoulli_bayes (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=bernoulli_bayes,
              ex_name='bernoulli_bayes',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to bernoulli_bayes did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=bernoulli_bayes,
              ex_name='bernoulli_bayes',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to bernoulli_bayes did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Part 2: Collaborative Filtering¶

Background. There is another approach we can take to try to determine which games we should recommend to users. This approach is called collaborative filtering. In this approach, we assume that users who are similar to each other are likely to enjoy similar things.

We can exploit this by calculating some score for similarity between two users. Then, we use those scores as weights and sum a collection of weighted votes to rank our items.

You will implement a version of this approach in exercise 8, using the results from the (free!) exercise 7.

Exercise 7: (1 points) - FREE¶

cosine_similarity

Example: we have defined cosine_similarity as follows:

IMPORTANT! The following exercise is free, but you MUST run the cells to earn the points!

Define the function to calculate the cosine similarity of two users, given the collections of games they own.

Inputs:

game_set_A: A collection of game IDs which are owned by user A.
game_set_B: A collection of game IDs which are owned by user B.
alpha: A smoothing constant.
feature_size: A value to help us calculate the smoothing constant.

Return:

smoothed_similarity: The smoothed similarity, as defined below.

Requirements/steps:

Calculate the smoothed cosine similarity, which is defined as: $$\frac{\sum_{i=1}^n [A_i * B_i] + \alpha}{(\sqrt{\sum_{i=1}^n A_i^2} * \sqrt{\sum_{i=1}^n A_i^2}) + (\alpha * n)}$$
This can be represented much more succinctly by the expression: $$\frac{A \cdot B + \alpha}{||A|| * ||B|| + (\alpha * n)}$$

Notes:

We will spend a lot more time talking about vector math like this over the duration of the rest of the course!

### Solution - Exercise 7  
def cosine_similarity(game_set_A: set, game_set_B: set, alpha: float, feature_size: int):
    A_mag = len(game_set_A) ** (1/2)
    B_mag = len(game_set_B) ** (1/2)
    A_dot_B = len(game_set_A & game_set_B)

    smoothed_numerator = A_dot_B + alpha
    smoothed_denominator = (A_mag * B_mag) + (alpha * feature_size)

    return smoothed_numerator/smoothed_denominator

### Demo function call
set_A = set(item.get('item_id') for item in users[0].get('items'))
set_B = set(item.get('item_id') for item in users[1].get('items'))
cosine_similarity_demo = cosine_similarity(set_A, set_B, 0.01, 28000)
print(f'Our two chosen users have a cosine similarity of:\n{cosine_similarity_demo}')

The demo should display this printed output.

Our two chosen users have a cosine similarity of:
0.12979939616717187

The test cell below will always pass. Please submit to collect your free points for cosine_similarity (exercise 7).

### Test Cell - Exercise 7  


print('Passed! Please submit.')

Exercise 8: (2 points)¶

create_collab_scores

Your task: define create_collab_scores as follows:

Use the weights calculated by our similarity function to create weighted scores for the the games in ID_to_games.

Inputs:

user_key: A key which can be used to get the games owned by the user out of the ID_to_games dictionary.
ID_to_games: A dictionary which maps user IDs to a set of games they own.
sim_func: A function which calculates the cosine similarity of two user's game sets. Call it by writing sim_func(user_set_A, user_set_B) where the arguments are two sets of user games obtained from ID_to_games.

Return:

game_scores: A dictionary of ALL games scores, calculated as defined below.

Requirements/steps:

Calculate the similarities of the user provided by user_key with every other user in ID_to_games.
Then, sum the similarities for each game based on which game a user owns.
- In other words, if 'Portal 2' is owned by three users and their similarity scores are [0.1, 0.35, and 0.2], then the final score for 'Portal 2' should be 0.65.
- DO NOT add the similarity score produced by comparing the set of games obtained by the user_key with itself!

Hints

You may find a defaultdict is useful here, but it is not required.

### Solution - Exercise 8  
def create_collab_scores(user_key: int, ID_to_games: dict, sim_func):
    ### BEGIN SOLUTION
    game_set_A = ID_to_games[user_key]
    cosine_similarities = {
        user_ID: sim_func(game_set_A, ID_to_games[user_ID])
        for user_ID
        in ID_to_games
    }

    game_scores = defaultdict(int)

    for user_ID in ID_to_games:
        if user_ID == user_key:
            continue
        for game in ID_to_games[user_ID]:
            game_scores[game] += cosine_similarities[user_ID]
            
    return dict(game_scores)
    ### END SOLUTION

### Demo function call
# Create our inputs
user_ID_to_games = {
    user.get('user_id', None): set(
        item.get('item_id', None)
        for item
        in user.get('items')
    ) for user in users
}

def wrap_sim_func(alpha, feature_size):
    def inner(set_A, set_B):
        return cosine_similarity(set_A, set_B, alpha, feature_size)
    return inner

demo_sim_func = wrap_sim_func(0.01, 32000)

# Create our collaborative scores
demo_collab_scores = create_collab_scores(users[0]['user_id'], user_ID_to_games, demo_sim_func)

# Inspect our results
collaborative_recs_demo = sorted(demo_collab_scores, key=lambda game: demo_collab_scores[game], reverse=True)[:15]
collaborative_recs_demo = [game_info_lookup_demo[g]['name'] for g in collaborative_recs_demo]
print(f'Collaborative score for "Call of Duty: World at War":\n{demo_collab_scores["10090"]}')
print('Here are the top 15 recommended games for our user, based on collaborative filtering:')
pprint(collaborative_recs_demo)

Example. RUN ME!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

with open('resource/asnlib/publicdata/demo_collab_scores.dill', 'rb') as fp:
    demo_collab_scores = dill.load(fp)

with open('resource/asnlib/publicdata/demo_collab_recs.dill', 'rb') as fp:
    demo_collab_recs = dill.load(fp)

The demo should display this printed output.

Collaborative score for "Call of Duty: World at War":
61.151157105373954
Here are the top 15 recommended games for our user, based on collaborative filtering:
["Garry's Mod",
 'Counter-Strike: Global Offensive',
 'Left 4 Dead 2',
 'Terraria',
 'Unturned',
 'Portal 2',
 'The Elder Scrolls V: Skyrim',
 'PAYDAY 2',
 'Borderlands 2',
 'Counter-Strike: Source',
 'Warframe',
 'Half-Life 2',
 'Portal',
 "Sid Meier's Civilization® V",
 'Just Cause 2']

The cell below will test your solution for create_collab_scores (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=create_collab_scores,
              ex_name='create_collab_scores',
              key=b'bXITTMSvN2mdmM8cprP7s1wI32RY8znER4wcdi8MGTY=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to create_collab_scores did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=create_collab_scores,
              ex_name='create_collab_scores',
              key=b'6eyvyfuHYk8oOnG006W1whQrRHKNi9LV7vkKWcQmRwc=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to create_collab_scores did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Fin¶

If you've made it this far, congratulations! You are done. Please submit your exam! The remainder of this notebook is a reflection on our approach and food for further thought.

Postscript: Combining Our Results

You might reasonably be wondering: is there a way for us to combine our results from content and collaborative filtering?

The answer is: absolutely! One approach is to simply weight our rankings and then do something similar to how we approached exercise 8. The code below shows an example of how we might do this. Note that there are many other approaches we could also take. This is just a relatively simple approach.

def hybrid_recommendations(content_recs: list, collaborative_recs: list, alpha: float) -> list:
    content_weights       = [(1/rank) * alpha       for rank in range(1, len(content_recs) + 1)]
    collaborative_weights = [(1/rank) * (1 - alpha) for rank in range(1, len(collaborative_recs) + 1)]

    recommendation_sort_weights = defaultdict(int)
    for rec, weight in zip(content_recs, content_weights):
        recommendation_sort_weights[rec] += weight
    for rec, weight in zip(collaborative_recs, collaborative_weights):
        recommendation_sort_weights[rec] += weight

    sorted_recs = sorted(recommendation_sort_weights.items(), key=lambda x: x[1], reverse=True)
    return sorted_recs

with open('resource/asnlib/publicdata/bernoulli_bayes_recs.dill', 'rb') as fp:
    content_recs_demo = dill.load(fp)
with open('resource/asnlib/publicdata/demo_collab_scores.dill', 'rb') as fp:
    demo_collab_scores = dill.load(fp)
with open('resource/asnlib/publicdata/demo_collab_recs.dill', 'rb') as fp:
    collaborative_recs_demo = dill.load(fp)

hybrid_recs = hybrid_recommendations(content_recs_demo, collaborative_recs_demo, 0.5)
print('Here are our final recommendations, placing equal weight on content and collaborative filtering:')
pprint(hybrid_recs)

A separate question you might have is: "how good is our recommendation system"?

This is a great question. To properly answer this question, we would need to split our data into training and test sets and evaluate some accuracy metric over the test sets. One metric we might be able to use is the MAP@K, or "Mean Average Precision at K". We encourage those of you who are curious to try obtaining some precision metrics and seeing whether you can improve your recommendations!

Spring 2025, Midterm 1: Game On!¶

The Problem: Creating a Recommender System¶

Part 0: Data Exploration and Cleaning¶

Exercise 0: (2 points)¶

Example. RUN ME!¶

Exercise 1: (3 points)¶

Example. RUN ME!¶

Part 1: Content Filtering¶

Exercise 2: (2 points)¶

Example. RUN ME!¶

Exercise 3: (3 points)¶

Example. RUN ME!¶

Exercise 4: (1 points)¶

Example. RUN ME!¶

Exercise 5: (2 points)¶

Aside: What IS a Naive Bayes Model?¶

Example. RUN ME!¶

Exercise 6: (2 points)¶

Example. RUN ME!¶

Part 2: Collaborative Filtering¶

Exercise 7: (1 points) - FREE¶

Exercise 8: (2 points)¶

Example. RUN ME!¶

Fin¶

`Spring 2025, Midterm 1`: `Game On!`¶