Final Exam: Olympics!

Version 1.0.0

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of Pandas, Python data structures, SQL, string manipulation, feature engineering, train/test split, model performance metrics. It has 11 exercises numbered 0 to 10. There are 23 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 1 point(s)

  • Exercise 1 - : 4 point(s)

  • Exercise 2 - : 1 point(s)

  • Exercise 3 - : 3 point(s)

  • Exercise 4 - : 2 point(s)

  • Exercise 5 - : 2 point(s)

  • Exercise 6 - : 2 point(s)

  • Exercise 7 - : 3 point(s)

  • Exercise 8 - : 2 point(s)

  • Exercise 9 - : 2 point(s)

  • Exercise 10 - : 1 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

Run Me!!!

In [1]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time

Task Overview 🏅

For this exam, you are given Olympic athlete data and you are tasked with analyzing historical data from the Olympic Games to uncover insights and building a predictive model. The primary dataset contains information on athletes, including their demographics, events, medals, and countries across 120 years of Olympic Games (Summer and Winter). A secondary dataset contains information about the countries hosting the Olympics including information about the year, cities, and costs.

🔍 Part 1: Exploratory Data Analysis (EDA)

  • Understanding the initial data contained within the datasets (Ex0)
  • Cleaning the athlete datasets to prepare for our analysis (Ex1 and Ex2)
  • Performing in-depth EDA to understand, rank, analyze, and visualize the athlete data (Ex3, Ex4, Ex5)
  • Analyzing trends associated with the costs of hosting the Olympic games (Ex6, Ex7)

🧠 Part 2: Feature Engineering

  • Transforming the categorical data to use within various modeling techniques (Ex8)
  • Performing preprocessing techniques including imputing missing data and scaling data for modeling techniques (Postscript)

🔮 Part 3: Predictive Modeling

  • Creating trainsets and testsets for our modeling (Ex9)
  • Evaluating model outputs using myriads of modeling evaluation metrics (Ex10)
  • Predicting 2016 Olympic medal winners based on training data from 1992-2014 using an assortment of models and their hyperparameters (Postscript)

Data Origin

The primary dataset includes 2 components from Kaggle:

  • results_df which includes athlete level data for over 120 years of olympic event
  • noc_regions_df which contains country level enrichment.

The secondary dataset (included as SQL tables) includes:

  • olympic_costs which details Olympiad budgets and costs from Wikipedia
  • inflation which comprises annual inflation data from 1929-2023 from Investopedia

Run the cell below to load the data!

In [2]:
### Run Me!!!
import pandas as pd
import numpy as np
from pprint import pprint
import sqlite3

results_df=utils.load_object_from_publicdata('results_df.dill')
noc_regions_df=utils.load_object_from_publicdata('noc_regions_df.dill')
conn=sqlite3.connect('resource/asnlib/publicdata/olympic.db')

Exercises

Exercise 0: (1 points)

explore_olympics_data__FREE

Example: we have defined explore_olympics_data__FREE as follows:

This is a free exercise! Please run the test cell below to collect your FREE point! We encourage you to review the samples of both results_df and noc_regions_df in the cell below:

  • results_df A DataFrame of historic results from athletes competing in both summer and winter Olympics for over 120 years.
  • noc_regions_df A DataFrame mapping NOCs to their respective country. A National Olympic Committee (NOC) is a country's organization that is responsible for its participation in the Olympic Games.
In [3]:
### Solution - Exercise 0  
def explore_olympics_data__FREE(results_df: pd.DataFrame, noc_regions_df: pd.DataFrame, headsize:int=10) -> tuple:
    results_df_preview = results_df.head(n=headsize)
    noc_regions_df_preview = noc_regions_df.head(n=headsize)
    return (results_df_preview,noc_regions_df_preview)

### Demo function call
results_df_preview, noc_regions_df_preview = explore_olympics_data__FREE(results_df,noc_regions_df)
print(f'results_df DataFrame preview: size {results_df.shape}')
display(results_df.head(5))
print(f'\n\nnoc_regions_df DataFrame preview: size {noc_regions_df.shape}')
display(noc_regions_df_preview.head(5))
#
results_df DataFrame preview: size (271116, 15)
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN

noc_regions_df DataFrame preview: size (230, 2)
NOC region
0 AFG Afghanistan
1 AHO Curacao
2 ALB Albania
3 ALG Algeria
4 AND Andorra
In [4]:
### Test Cell - Exercise 0  


print('Passed! Please submit.')
Passed! Please submit.

Exercise 1: (4 points)

cleanse_results

Your task: define cleanse_results as follows:

'With great risk comes great reward' - Thomas Jefferson. Approach this problem with caution.

Input: results_df: A Pandas DataFrame, as described in Exercise 0

Return: sorted_df: A Pandas DataFrame that has been cleansed to remove duplicate values and sorted

Requirements: To begin, we need to clean up results_df by cleaning names, selecting relevant columns, removing duplicate rows, and sorting the data as described below.

  1. Clean up the athlete Name column: Given that most, but not all, of the names are of the form: First Middle Last or First "nickname" Middle Last, Suffix (Maiden), follow the below rules:
    1. Keep the First/Given name (including any ., -, or ' if present)
    2. Keep the Last/Family name if present (including any ., -, or ' if present)
    3. Keep the suffix if present (e.g. Jr, Sr, II, III, IV, V)
    4. Remove the nee/maiden name if present anywhere within the name e.g. (Maiden)
    5. Remove any nicknames e.g. "Susie"
    6. If there are any leading whitespace e.g. ' ' before the First Name, remove those
    7. There may contain commas , without a suffix. After the comma in this case often represents titles of nobility. Only keep the names prior to the comma.
    8. If you've done this correctly:
      • John Bartling Pearson, Jr. -> John Pearson Jr [Rules A, B, C]
      • Susan "Susie" O'Neill -> Susan O'Neill [Rules A, B, E]
      • Alfrd (Arnold-) Hajs (Guttmann-) -> Alfrd Hajs [Rules A, B, D]
      • Britt-Marie Louise Smedh (-Alshammar, -Blomberg) -> Britt-Marie Smedh [Rules A, B, D]
      • Liston Donneal Bochette, III -> Liston Bochette III [Rules A, B, C]
      • H. MacHenry -> H. MacHenry [Rules A, B]
      • Abudoureheman -> Abudoureheman [Rule A]
      • ' Gabrielle Marie "Gabby" Adcock (White-)' -> Gabrielle Adcock #note whitespace at beginning [Rules A, B, D, E, F]
      • Druart, Jr. -> Druart Jr [Rules A, C]
      • Sndor Bernt Ede Blint Kzmr, Count Trk de Szendr -> Sndor Kzmr [Rules A, B, G]
  2. Keep the following columns: ['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year', 'Season', 'City', 'Sport', 'Event', 'Medal']
  3. Remove any duplicate rows
  4. Sort by 'Year', 'Season', 'NOC', 'Name', 'Event'. All columns should be sorted in ascending order
  5. Reset the indexes
In [5]:
### Solution - Exercise 1  
def cleanse_results(results_df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    results_df2=results_df.copy()
    def regexfunction(name:str) ->str:
        import re
        name=re.sub(r'"[^"]+"','',name) #remove nickname
        name=re.sub(r'\([^\)]+\)','',name) #remove maidenname
        name=name.strip() #remove whitespace
        finalsuffix=None
        for suffix in ['Jr','Sr','III','II','IV','V']:
            if f', {suffix}' in name:
                finalsuffix=suffix
                name=re.sub(f', {suffix}.?','',name) #remove suffix from name
        #split names on comma to exclude titles of nobility. Then split to get first/last name
        names=name.split(',')[0].split()
        if len(names)>1: #if nultiple names
            fname=names[0]
            lname=names[-1]
            if finalsuffix: #add suffix if present e.g. John Bartling Pearson, Jr.
                return ' '.join([fname,lname,finalsuffix])
            return ' '.join([fname,lname]) #without suffix e.g. H. MacHenry
        elif len(names)==1: #if only one name present e.g. Druart
            if finalsuffix: #add suffix if present e.g. Druart, Jr.
                return ' '.join([names[0],finalsuffix])
            return names[0] #without suffix e.g. Abudoureheman

    results_df2.loc[:,'Name']=results_df2['Name'].apply(regexfunction)
    keep_df = results_df2[
        ['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year', 'Season', 'City', 'Sport', 'Event',
         'Medal']]
    keep_df = keep_df.drop_duplicates()
    sorted_df = keep_df.sort_values(by=['Year', 'Season', 'NOC', 'Name', 'Event']).reset_index(drop=True)
    return sorted_df
    ### END SOLUTION

### Demo function call
cleanse_results_demo = utils.load_object_from_publicdata('cleanse_results_demo.dill')
print(f'cleanse_results_demo shape: {cleanse_results_demo.shape}')
demo_result_sorted_df = cleanse_results(cleanse_results_demo)
demo_result_sorted_df
cleanse_results_demo shape: (13, 15)
Out[5]:
ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 116752 Bernard Sutton M 31.0 NaN NaN Great Britain GBR 1924 Winter Chamonix Speed Skating Speed Skating Men's Allround NaN
1 11937 Rune Bjurstrm M 35.0 NaN NaN Sweden SWE 1948 Summer London Athletics Athletics Men's 50 kilometres Walk NaN
2 98249 Murray Rae M 21.0 173.0 73.0 Harmony NZL 1960 Summer Roma Sailing Sailing Mixed Two Person Heavyweight Dinghy NaN
3 78934 Juan Toledo M 19.0 172.0 70.0 Argentina ARG 1964 Summer Tokyo Cycling Cycling Men's Team Pursuit, 4,000 metres NaN
4 37597 Siegfried Flle M 25.0 168.0 62.0 Germany GER 1964 Summer Tokyo Gymnastics Gymnastics Men's Pommelled Horse NaN
5 883 Charles Addo-Odametey M 27.0 170.0 67.0 Ghana GHA 1964 Summer Tokyo Football Football Men's Football NaN
6 883 Charles Addo-Odametey M 31.0 170.0 67.0 Ghana GHA 1968 Summer Mexico City Football Football Men's Football NaN
7 36435 Mark Foster M 18.0 198.0 93.0 Great Britain GBR 1988 Summer Seoul Swimming Swimming Men's 4 x 100 metres Freestyle Relay NaN
8 100869 John Riley Jr M 28.0 195.0 93.0 United States USA 1992 Summer Barcelona Rowing Rowing Men's Quadruple Sculls NaN
9 20066 Viktoriya Chayka F 23.0 164.0 50.0 Belarus BLR 2004 Summer Athina Shooting Shooting Women's Sporting Pistol, 25 metres NaN
10 76644 Bethanie Mattek-Sands F 31.0 168.0 66.0 United States-2 USA 2016 Summer Rio de Janeiro Tennis Tennis Mixed Doubles Gold

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [6]:
cleanse_results_demo_CORRECT=utils.load_object_from_publicdata('cleanse_results_demo_CORRECT.dill')
#display(cleanse_results_demo_CORRECT)

The demo should display this output.

ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 116752 Bernard Sutton M 31.00 NaN NaN Great Britain GBR 1924 Winter Chamonix Speed Skating Speed Skating Men's Allround NaN
1 11937 Rune Bjurstrm M 35.00 NaN NaN Sweden SWE 1948 Summer London Athletics Athletics Men's 50 kilometres Walk NaN
2 98249 Murray Rae M 21.00 173.00 73.00 Harmony NZL 1960 Summer Roma Sailing Sailing Mixed Two Person Heavyweight Dinghy NaN
3 78934 Juan Toledo M 19.00 172.00 70.00 Argentina ARG 1964 Summer Tokyo Cycling Cycling Men's Team Pursuit, 4,000 metres NaN
4 37597 Siegfried Flle M 25.00 168.00 62.00 Germany GER 1964 Summer Tokyo Gymnastics Gymnastics Men's Pommelled Horse NaN
5 883 Charles Addo-Odametey M 27.00 170.00 67.00 Ghana GHA 1964 Summer Tokyo Football Football Men's Football NaN
6 883 Charles Addo-Odametey M 31.00 170.00 67.00 Ghana GHA 1968 Summer Mexico City Football Football Men's Football NaN
7 36435 Mark Foster M 18.00 198.00 93.00 Great Britain GBR 1988 Summer Seoul Swimming Swimming Men's 4 x 100 metres Freestyle Relay NaN
8 100869 John Riley Jr M 28.00 195.00 93.00 United States USA 1992 Summer Barcelona Rowing Rowing Men's Quadruple Sculls NaN
9 20066 Viktoriya Chayka F 23.00 164.00 50.00 Belarus BLR 2004 Summer Athina Shooting Shooting Women's Sporting Pistol, 25 metres NaN
10 76644 Bethanie Mattek-Sands F 31.00 168.00 66.00 United States-2 USA 2016 Summer Rio de Janeiro Tennis Tennis Mixed Doubles Gold

The demo should display this printed output.

cleanse_results_demo shape: (13, 15)


The cell below will test your solution for cleanse_results (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [7]:
### Test Cell - Exercise 1  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=cleanse_results,
              ex_name='cleanse_results',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_results did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=cleanse_results,
              ex_name='cleanse_results',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_results did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 6.71 seconds
memory after test: 7.75 MB
memory peak during test: 43.86 MB
initial memory usage: 0.00 MB
Test duration: 6.02 seconds
memory after test: 0.54 MB
memory peak during test: 35.70 MB
Passed! Please submit.

Exercise 2: (1 points)

update_team

Your task: define update_team as follows:

Historically, teams competing in the Olympics did not always share the same name as the country they represented. For example, a team named "Vesper Boat Club" competed for the USA in the 1900 Summer Olympics. These unique team names make it difficult to analyze a country's performance over time. Let's fix this by updating the Team name in sorted_df to accurately reflect the region name for each.

Input:

  • sorted_df: The result of Exercise 1 - A sorted Pandas DataFrame containing results by athlete (provided in solution/demo below)
  • noc_regions_df: Contains region information for each NOC, as described in Exercise 0

Return: updated_df: A Pandas DataFrame containing the information in sorted_df with the Team column updated to contain the region name found in noc_regions_df

Requirements:

  • Update the Team column in sorted_df to match the corresponding region from noc_regions_df, using a left join on the NOC column.
  • Restated another way
    • Use noc_regions_df to find the region name for each NOC. Use a left join on the column NOC with noc_regions_df, keeping all rows in sorted_df
    • Return a copy of sorted_df where the Team name is overwritten with the region value corresponding to each NOC
In [8]:
### Solution - Exercise 2  
def update_team(sorted_df: pd.DataFrame, noc_regions_df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    updated_df = sorted_df.merge(noc_regions_df[['NOC','region']], how='left', on='NOC')
    updated_df['Team'] = updated_df['region']
    updated_df = updated_df.drop(columns=['region'])
    return updated_df
    ### END SOLUTION

### Demo function call
update_team_sorted_df_demo, update_team_noc_regions_df_demo = utils.load_object_from_publicdata('update_team_demo.dill')
demo_result_updated_df = update_team(update_team_sorted_df_demo, update_team_noc_regions_df_demo)
demo_result_updated_df
Out[8]:
ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 93070 Nikolaos Pentsikis M NaN NaN NaN Greece GRE 1906 Summer Athina Football Football Men's Football Bronze
1 70978 Roscoe Lockwood M 24.0 NaN NaN USA USA 1900 Summer Paris Rowing Rowing Men's Coxed Eights Gold
2 129700 Grethe Werner F 24.0 NaN NaN Norway NOR 1952 Summer Helsinki Gymnastics Gymnastics Women's Individual All-Around NaN
3 71554 Snorre Lorgen M 22.0 181.0 82.0 Norway NOR 1992 Summer Barcelona Rowing Rowing Men's Coxless Pairs NaN
4 14738 David Bratton M NaN NaN NaN USA USA 1904 Summer St. Louis Swimming Swimming Men's 4 x 50 Yard Freestyle Relay NaN
5 126721 Roberta Vinci F 33.0 164.0 58.0 Italy ITA 2016 Summer Rio de Janeiro Tennis Tennis Mixed Doubles NaN

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [9]:
update_team_demo_CORRECT=utils.load_object_from_publicdata('update_team_demo_CORRECT.dill')
#display(update_team_demo_CORRECT)

The demo should display this output.

ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 93070 Nikolaos Pentsikis M NaN NaN NaN Greece GRE 1906 Summer Athina Football Football Men's Football Bronze
1 70978 Roscoe Lockwood M 24.00 NaN NaN USA USA 1900 Summer Paris Rowing Rowing Men's Coxed Eights Gold
2 129700 Grethe Werner F 24.00 NaN NaN Norway NOR 1952 Summer Helsinki Gymnastics Gymnastics Women's Individual All-Around NaN
3 71554 Snorre Lorgen M 22.00 181.00 82.00 Norway NOR 1992 Summer Barcelona Rowing Rowing Men's Coxless Pairs NaN
4 14738 David Bratton M NaN NaN NaN USA USA 1904 Summer St. Louis Swimming Swimming Men's 4 x 50 Yard Freestyle Relay NaN
5 126721 Roberta Vinci F 33.00 164.00 58.00 Italy ITA 2016 Summer Rio de Janeiro Tennis Tennis Mixed Doubles NaN


The cell below will test your solution for update_team (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [10]:
### Test Cell - Exercise 2  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=update_team,
              ex_name='update_team',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to update_team did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=update_team,
              ex_name='update_team',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to update_team did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 5.51 seconds
memory after test: 0.52 MB
memory peak during test: 33.95 MB
initial memory usage: 0.00 MB
Test duration: 5.52 seconds
memory after test: 0.49 MB
memory peak during test: 33.96 MB
Passed! Please submit.

Exercise 3: (3 points)

rank_teams

Your task: define rank_teams as follows:

Now that we've cleaned up our DataFrame, we can start analyzing the data. Let's start with the basics and rank teams by the number of medals earned!

Input:

  • updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with the updated Team names (provided in solution/demo below)
  • list_of_teams: A list containing team names

Return: ranked_df: A Pandas DataFrame ranking the Teams found in list_of_teams by the total number of medals earned.

Requirements:

  • Filter updated_df to the rows containing the teams listed in list_of_teams
  • Create a DataFrame that contains the columns: Team, Bronze, Silver, Gold, Total with the number of medals each team won of each type. Total should be the sum of medals earned by each team. Pivot Table may be helpful.
  • If a team has not won medals of a particular type, set the value to 0
  • Bronze, Silver, Gold, and Total columns should be of data type int64
  • Sort the DataFrame by the Total in descending order, with ties broken by the Team name in ascending order
In [11]:
### Solution - Exercise 3  
def rank_teams(updated_df: pd.DataFrame, list_of_teams: list) -> pd.DataFrame:
    ### BEGIN SOLUTION
    filtered_df = updated_df[updated_df['Team'].isin(list_of_teams)]
    medal_counts = filtered_df.pivot_table(
        index='Team',
        columns='Medal',
        aggfunc='size',
        fill_value=0
    )
    medal_counts.columns.name = None
    medal_counts['Total'] = medal_counts.sum(axis=1)
    medal_counts = medal_counts.reset_index()
    medal_counts[['Bronze', 'Silver', 'Gold', 'Total']] = medal_counts[['Bronze', 'Silver', 'Gold', 'Total']].astype(
        'int64')
    ranked_df = medal_counts.sort_values(by=['Total', 'Team'], ascending=[False, True]).reset_index(drop=True)
    return ranked_df
    ### END SOLUTION

### Demo function call
rank_teams_updated_df_demo,rank_teams_list_of_teams_demo=utils.load_object_from_publicdata('rank_teams_demo.dill')
demo_result_ranked_df = rank_teams(rank_teams_updated_df_demo, rank_teams_list_of_teams_demo)
demo_result_ranked_df
Out[11]:
Team Bronze Gold Silver Total
0 USA 0 5 2 7
1 Russia 1 1 0 2
2 South Korea 1 0 1 2
3 Argentina 0 1 0 1
4 Bulgaria 1 0 0 1
5 Canada 0 1 0 1
6 Czech Republic 1 0 0 1
7 Germany 1 0 0 1
8 India 0 1 0 1

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [12]:
rank_teams_demo_CORRECT=utils.load_object_from_publicdata('rank_teams_demo_CORRECT.dill')
#display(rank_teams_demo_CORRECT)

The demo should display this output.

Team Bronze Gold Silver Total
0 USA 0 5 2 7
1 Russia 1 1 0 2
2 South Korea 1 0 1 2
3 Argentina 0 1 0 1
4 Bulgaria 1 0 0 1
5 Canada 0 1 0 1
6 Czech Republic 1 0 0 1
7 Germany 1 0 0 1
8 India 0 1 0 1


The cell below will test your solution for rank_teams (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [13]:
### Test Cell - Exercise 3  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=rank_teams,
              ex_name='rank_teams',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_teams did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=rank_teams,
              ex_name='rank_teams',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_teams did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 5.93 seconds
memory after test: 0.87 MB
memory peak during test: 42.17 MB
initial memory usage: 0.00 MB
Test duration: 5.96 seconds
memory after test: 0.64 MB
memory peak during test: 42.21 MB
Passed! Please submit.

Exercise 4: (2 points)

roman_numeral

Your task: define roman_numeral as follows:

The Olympic Games use Roman numerals to number each iteration. For example, the Paris 2024 Olympics were officially called the "Games of the XXXIII Olympiad". Sounds fancier than saying "the 33rd Summer Olympic Games", right? Let's now find the corresponding Roman numeral for a given Olympic game!

Input:

  • updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)
  • year: An integer containing the desired year
  • season: A string containing the desired season (either 'Summer' or 'Winter')

Return: roman_numeral_result: A string containing the Roman numeral for that Olympics.

Requirements:

  • Using updated_df and the year/season provided, find how many Olympics have occurred up to the year provided for that season. Let's call this your Olympic count
  • Use the roman_numeral_dict provided below to convert this Olympic count into the correct Roman numeral

Hint: While other approaches exist, here is a simple algorithm for converting a number to its Roman Numeral equivalent:

  1. Iterate over the key/value pairs in roman_numeral_dict in the same order as listed
  2. While a key is less than your Olympic count, add that key's value to your roman_numeral_result string, and then subtract the key from your Olympic count

Hint: Each season's Olympic counts should be counted separately. Technically, the first Summer Olympics held in 1896 is 1, and the first Winter Olympics held in 1924 has a count of 1. The testcases will have different combinations of seasons and years, so your function should account for those combinations and calculate the Roman numeral based on what is in the input updated_df.

In [14]:
### Solution - Exercise 4  
def roman_numeral(updated_df: pd.DataFrame, year: int, season: str) -> str:
    roman_numeral_dict = {1000: "M", 900: "CM", 500: "D",  400: "CD", 100: "C",  90: "XC", 50: "L", 40: "XL", 10: "X", 9: "IX", 5: "V", 4: "IV", 1: "I"}

    ### BEGIN SOLUTION
    rm_df = updated_df[updated_df['Season']==season]
    rm_df = rm_df[rm_df['Year']<=year]
    rm_df = rm_df[['Year','Season']].drop_duplicates()
    olympics_num = rm_df['Year'].size

    r_num = ''
    for k, v in roman_numeral_dict.items():
      while k <= olympics_num:
        r_num += v
        olympics_num -= k

    return r_num
    ### END SOLUTION

### Demo function call
roman_numeral_updated_df_demo,roman_numeral_year_demo,roman_numeral_season_demo=utils.load_object_from_publicdata('roman_numeral_demo.dill')
demo_result_roman_numeral_result = roman_numeral(roman_numeral_updated_df_demo, roman_numeral_year_demo, roman_numeral_season_demo)
display(demo_result_roman_numeral_result)
'IV'

The demo should display this printed output.

IV


The cell below will test your solution for roman_numeral (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [15]:
### Test Cell - Exercise 4  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=roman_numeral,
              ex_name='roman_numeral',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to roman_numeral did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=roman_numeral,
              ex_name='roman_numeral',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to roman_numeral did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 5.54 seconds
memory after test: 0.65 MB
memory peak during test: 41.83 MB
initial memory usage: 0.00 MB
Test duration: 5.04 seconds
memory after test: 0.60 MB
memory peak during test: 41.81 MB
Passed! Please submit.

Exercise 5: (2 points)

going_for_gold

Your task: define going_for_gold as follows:

Many athletes have won multiple gold medals. The real question is: Who has won the most gold medals relative to the number of times they competed? Calculate each athlete's "gold percentage" as the number of gold medals earned divided by the number of times they competed in any event at the Olympics.

Input:

  • updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)

Return: A new Pandas DataFrame containing the following columns: Name, Opportunity Count, Golds, Gold Pct

Requirements:

  • For each athlete, determine the number of times they competed in any event at the Olympics (this will be the number of rows for that athlete in updated_df). Call this column Opportunity Count
  • For each athlete, determine the number of gold medals they won. Call this column Golds, and this should be of data type int64
  • For each athlete, calculate their gold percentage as number of gold medals earned / number of times they competed. Call this column Gold Pct, and this should be of data type float64
  • Convert any NaN values to 0
  • Sort your resulting Pandas DataFrame by Gold Pct in descending order, with ties broken by Golds in descending order and then by Name in ascending order
  • Reset the indexes
In [16]:
### Solution - Exercise 5  
def going_for_gold(updated_df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    df6=updated_df[['Name','Medal']]
    df6=df6.groupby('Name').agg({'Medal':['size',lambda x: (x=='Gold').sum()]}).reset_index()
    df6.columns=['Name','Opportunity Count','Golds']
    df6['Gold Pct']=1.0*df6['Golds']/df6['Opportunity Count']
    df6['Gold Pct']=df6['Gold Pct'].fillna(0.0)
    df6 = df6.sort_values(['Gold Pct','Golds','Name'],ascending=[False,False,True]).reset_index(drop=True)
    return df6
    ### END SOLUTION

### Demo function call
going_for_gold_updated_df_demo = utils.load_object_from_publicdata('going_for_gold_demo.dill')
demo_result_gold_rankings_df = going_for_gold(going_for_gold_updated_df_demo)
demo_result_gold_rankings_df
Out[16]:
Name Opportunity Count Golds Gold Pct
0 Kristin Otto 6 6 1.000000
1 Danuta Kozk 6 5 0.833333
2 Yin Jian 2 1 0.500000

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [17]:
going_for_gold_demo_CORRECT=utils.load_object_from_publicdata('going_for_gold_demo_CORRECT.dill')
#display(going_for_gold_demo_CORRECT)

The demo should display this output.

Name Opportunity Count Golds Gold Pct
0 Kristin Otto 6 6 1.00
1 Danuta Kozk 6 5 0.83
2 Yin Jian 2 1 0.50


The cell below will test your solution for going_for_gold (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [18]:
### Test Cell - Exercise 5  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=going_for_gold,
              ex_name='going_for_gold',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to going_for_gold did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=going_for_gold,
              ex_name='going_for_gold',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to going_for_gold did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 1.76 seconds
memory after test: 0.22 MB
memory peak during test: 4.95 MB
initial memory usage: 0.00 MB
Test duration: 1.77 seconds
memory after test: 0.11 MB
memory peak during test: 4.99 MB
Passed! Please submit.

Exercise 6: (2 points)

olympic_costs

Your task: define olympic_costs as follows:

Hosting the Olympics can be staggeringly expensive, with recent editions costing billions of dollars. But has it always been this way? Let's use inflation data to convert historic olympic costs and find out!

Input: None

Return: A Python string containing a SQLite query that converts olympic costs to most recent year USD, using inflation data

Requirements:

  • Your function must return a Python string containing a SQLite query
  • There are 2 tables present within the database:
    • olympic_costs (with columns year, country, city, cost)
    • inflation (with columns year, interest_rate, cpi_index)
  • Your query must return the following columns:
    • year: from olympic_costs table
    • country: from olympic_costs table
    • cost: from olympic_costs table. Exclude NULL/NaN values
    • maxyear_cpi_index: this represents the cpi_index for the max(year) in the inflation table. In other words, the cpi_index for the latest year in the inflation table.
    • olympicyear_cpi_index: this represents the cpi_index for the year of the Olympics in the inflation table
    • recentyearUSD: this represents $$recentyearUSD = cost * maxyear\_cpi\_index \div olympicyear\_cpi\_index$$
  • Return the top 10 results
  • Order by
    • recentyearUSD in descending order
    • year in ascending order

Hint: Subqueries/CTEs might be helpful on this exercise!

In [19]:
### Solution - Exercise 6  
def olympic_costs() ->str:
    ### BEGIN SOLUTION
    query=f"""
    select c.year,c.country,c.cost
    ,m.cpi_index maxyear_cpi_index
    ,i.cpi_index olympicyear_cpi_index
    ,1.0*c.cost*m.cpi_index/i.cpi_index recentyearUSD
    from olympic_costs c
    join inflation i on i.year=c.year
    join ( select a.cpi_index
            from inflation a
            join (select max(year) maxyear from inflation)b on a.year=b.maxyear
        )m
    where c.cost is not null
    order by recentyearUSD desc,c.year
    limit 10
    """
    return query
    ### END SOLUTION

### Demo function call
demo_result_olympic_costs_results = pd.read_sql(olympic_costs(),conn)
demo_result_olympic_costs_results
Out[19]:
year country cost maxyear_cpi_index olympicyear_cpi_index recentyearUSD
0 2014 Russia 5.100000e+10 1780.850595 1362.086956 6.667958e+10
1 2008 China 4.400000e+10 1780.850595 1219.224262 6.426826e+10
2 2004 Greece 1.500000e+10 1780.850595 1103.962662 2.419716e+10
3 2020 Japan 1.540000e+10 1780.850595 1511.379619 1.814574e+10
4 2016 Brazil 1.310000e+10 1780.850595 1400.425618 1.665861e+10
5 2012 United Kingdom 8.361285e+09 1780.850595 1331.307135 1.118465e+10
6 1988 South Korea 4.000000e+09 1780.850595 698.705866 1.019514e+10
7 1972 West Germany 1.062908e+09 1780.850595 246.928322 7.665708e+09
8 2000 Australia 4.290000e+09 1780.850595 1008.059612 7.578767e+09
9 2022 China 3.900000e+09 1780.850595 1722.292645 4.032600e+09

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [20]:
olympic_costs_demo_CORRECT=utils.load_object_from_publicdata('olympic_costs_demo_CORRECT.dill')
#display(olympic_costs_demo_CORRECT)

The demo should display this output.

year country cost maxyear_cpi_index olympicyear_cpi_index recentyearUSD
0 2014 Russia 51000000000.00 1780.85 1362.09 66679575719.43
1 2008 China 44000000000.00 1780.85 1219.22 64268263552.70
2 2004 Greece 15000000000.00 1780.85 1103.96 24197157965.84
3 2020 Japan 15400000000.00 1780.85 1511.38 18145738380.00
4 2016 Brazil 13100000000.00 1780.85 1400.43 16658608993.91
5 2012 United Kingdom 8361284628.83 1780.85 1331.31 11184645759.10
6 1988 South Korea 4000000000.00 1780.85 698.71 10195137511.27
7 1972 West Germany 1062908000.00 1780.85 246.93 7665707723.59
8 2000 Australia 4290000000.00 1780.85 1008.06 7578767131.20
9 2022 China 3900000000.00 1780.85 1722.29 4032600000.00


The cell below will test your solution for olympic_costs (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [21]:
### Test Cell - Exercise 6  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_costs),
              ex_name='olympic_costs',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to olympic_costs did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_costs),
              ex_name='olympic_costs',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to olympic_costs did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.90 seconds
memory after test: 0.51 MB
memory peak during test: 5.80 MB
initial memory usage: 0.00 MB
Test duration: 0.89 seconds
memory after test: 0.32 MB
memory peak during test: 5.86 MB
Passed! Please submit.

Exercise 7: (3 points)

olympic_countryrank

Your task: define olympic_countryrank as follows:

Input: None

Return: A SQL query that returns the 3 most recent years and cities for each host country. Additionally, it ranks the results ordinally and tallies a running cost for each host country.

Requirements:

  • Your function must return a SQL query
  • You will need the following table present within the database:
    • olympic_costs (year, country, city, cost)
  • The query must return the following columns:
    • country: from the table
    • year: from the table
    • city: from the table
    • countryrank: for each host country, this represents an ordinal ranking of that host country's most recent Olympics
    • countryrunningcosts: for each host country, this represents a running cost (or a cumulative cost across the host country's Olympics)
  • Only keep the host country's last 3 Olympics. If the host country has not hosted 3 Olympics, keep all Olympics the country has hosted.
  • Remove any occurrences where cost is NULL
  • Order by:
    • country in ascending order
    • countryrank in ascending order

Hints:

  • Looking specifically at Australia, we see that the 3 most recenly hosted Olympics in olympic_costs table include Brisbane (2032), Sydney (2000), and Melbourne (1956)
  • To calculate countryrank, the 2032 Olympics would get a countryrank of 1, the 2000 Olympics would get 2, and the 1956 would get 3
  • To calculate countryrunningcosts, we see that the country spent:
      - USD 4.5 billion in 2032
      - USD 4.29 billion in 2000
      - USD 3.51 million in 1956 
  • Therefore, countryrunningcosts would be:
      - 4.5e9 for 2032
      - 4.5e9+4.29e9=8.79e9 for 2000 
      - 4.5e9+4.29e9+3.51e6=8.79351e9 for 1956
  • SQLite Window functions and SQLite Rank functions may be helpful.
  • Subqueries/CTEs might also be helpful on this exercise!
In [22]:
### Solution - Exercise 7  
def olympic_countryrank() ->str:
    ### BEGIN SOLUTION
    query=f"""
    select country,year,city,countryrank,countryrunningcosts
    from (
        select country,year,city
        ,RANK() OVER(PARTITION BY country ORDER BY year desc) countryrank
        ,SUM(cost) OVER(PARTITION BY country ORDER BY year desc) countryrunningcosts
        from olympic_costs
        where cost is NOT NULL
    )a
    where countryrank<=3
    order by country,countryrank
    """
    return query
    ### END SOLUTION

### Demo function call
demo_result_olympic_countryrank_results = pd.read_sql(olympic_countryrank(),conn)
# display(demo_result_olympic_countryrank)
demo_result_olympic_countryrank_results
Out[22]:
country year city countryrank countryrunningcosts
0 Australia 2032 Brisbane 1 4.500000e+09
1 Australia 2000 Sydney 2 8.790000e+09
2 Australia 1956 Melbourne 3 8.793510e+09
3 Brazil 2016 Rio de Janeiro 1 1.310000e+10
4 Canada 2010 Vancouver 1 1.260000e+09
5 Canada 1988 Calgary 2 1.570980e+09
6 Canada 1976 Montreal 3 1.717950e+09
7 China 2022 Beijing 1 3.900000e+09
8 China 2008 Beijing 2 4.790000e+10
9 France 2030 French Alps 1 2.300000e+09
10 France 2024 Paris 2 1.050000e+10
11 Greece 2004 Athens 1 1.500000e+10
12 Italy 2026 Milan and Cortina d'Ampezzo 1 1.600000e+09
13 Italy 2006 Turin 2 2.300000e+09
14 Japan 2020 Tokyo 1 1.540000e+10
15 Japan 1964 Tokyo 2 1.547200e+10
16 Mexico 1968 Mexico City 1 1.760000e+08
17 Norway 1994 Lillehammer 1 1.100000e+09
18 Russia 2014 Sochi 1 5.100000e+10
19 Socialist Federal Republic of Yugoslavia 1984 Sarajevo 1 5.540000e+07
20 South Korea 2018 Pyeongchang 1 2.190000e+09
21 South Korea 1988 Seoul 2 6.190000e+09
22 Soviet Union 1980 Moscow 1 2.310000e+08
23 Spain 1992 Barcelona 1 8.500000e+08
24 United Kingdom 2012 London 1 8.361285e+09
25 United Kingdom 1948 London 2 8.362252e+09
26 United States 2034 Salt Lake City–Utah 1 3.900000e+09
27 United States 2028 Los Angeles 2 1.070000e+10
28 United States 2002 Salt Lake City 3 1.270000e+10
29 West Germany 1972 Munich 1 1.062908e+09

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [23]:
olympic_countryrank_demo_CORRECT=utils.load_object_from_publicdata('olympic_countryrank_demo_CORRECT.dill')
#display(olympic_countryrank_demo_CORRECT)

The demo should display this output.

country year city countryrank countryrunningcosts
0 Australia 2032 Brisbane 1 4500000000.00
1 Australia 2000 Sydney 2 8790000000.00
2 Australia 1956 Melbourne 3 8793510000.00
3 Brazil 2016 Rio de Janeiro 1 13100000000.00
4 Canada 2010 Vancouver 1 1260000000.00
5 Canada 1988 Calgary 2 1570980000.00
6 Canada 1976 Montreal 3 1717950000.00
7 China 2022 Beijing 1 3900000000.00
8 China 2008 Beijing 2 47900000000.00
9 France 2030 French Alps 1 2300000000.00
10 France 2024 Paris 2 10500000000.00
11 Greece 2004 Athens 1 15000000000.00
12 Italy 2026 Milan and Cortina d'Ampezzo 1 1600000000.00
13 Italy 2006 Turin 2 2300000000.00
14 Japan 2020 Tokyo 1 15400000000.00
15 Japan 1964 Tokyo 2 15472000000.00
16 Mexico 1968 Mexico City 1 176000000.00
17 Norway 1994 Lillehammer 1 1100000000.00
18 Russia 2014 Sochi 1 51000000000.00
19 Socialist Federal Republic of Yugoslavia 1984 Sarajevo 1 55400000.00
20 South Korea 2018 Pyeongchang 1 2190000000.00
21 South Korea 1988 Seoul 2 6190000000.00
22 Soviet Union 1980 Moscow 1 231000000.00
23 Spain 1992 Barcelona 1 850000000.00
24 United Kingdom 2012 London 1 8361284628.83
25 United Kingdom 1948 London 2 8362251972.59
26 United States 2034 Salt Lake City–Utah 1 3900000000.00
27 United States 2028 Los Angeles 2 10700000000.00
28 United States 2002 Salt Lake City 3 12700000000.00
29 West Germany 1972 Munich 1 1062908000.00


The cell below will test your solution for olympic_countryrank (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [24]:
### Test Cell - Exercise 7  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_countryrank),
              ex_name='olympic_countryrank',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to olympic_countryrank did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(olympic_countryrank),
              ex_name='olympic_countryrank',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to olympic_countryrank did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.72 seconds
memory after test: 0.27 MB
memory peak during test: 3.11 MB
initial memory usage: 0.00 MB
Test duration: 0.74 seconds
memory after test: 0.43 MB
memory peak during test: 3.26 MB
Passed! Please submit.

Exercise 8: (2 points)

label_encoding

Your task: define label_encoding as follows:

There are different techniques for converting categorical data, such as strings, into a numerical representation. One such technique is label encoding, in which each unique category is assigned a unique integer, thereby converting categorical data into numerical values. Let's practice this encoding technique on our updated_df, which contains athlete results over time.

Input:

  • updated_df: The result of Exercise 2 - A sorted Pandas DataFrame containing results by athlete with updated Team names (provided in solution/demo below)

Return: A tuple containing the following:

  • encoding_map: A dictionary of dictionaries where the outermost keys are the categorical column names, and their values are dictionaries mapping the unique values in the column to their numerical encoding.
  • encoded_df: A new Pandas DataFrame where all categorical columns have been encoded using the mapping you generated. All column names should remain the same.

Requirements:

  1. Determine which columns in the input DataFrame are categorical. Categorical columns are columns with a data type of object
  2. Replace any NaN values in these categorical columns with the string 'NONE'
  3. For each of these categorical columns, find the unique values in that column
  4. Sort the unique values in that column in ascending alphabetical order
  5. Map the sorted, unique values to numerical values starting with 0 for the first value, then 1, then 2, and so on
  6. Create a dictionary where each key is the name of a categorical column, and each value is a dictionary that maps the column's unique values to their corresponding integer encodings
  7. Use the mappings to encode updated_df
  8. Return a tuple containing both the mapping dictionary and the encoded DataFrame
In [25]:
### Solution - Exercise 8  
def label_encoding(updated_df: pd.DataFrame) -> tuple:
    ### BEGIN SOLUTION
    categorical_col_list = []
    df = updated_df.copy()
    mapping_dict = {}

    for col in df.columns:
        if df[col].dtype == 'object':
            categorical_col_list.append(col)
            df[col] = df[col].fillna('NONE')
            unique_vals_sorted = list(df[col].drop_duplicates().sort_values())
            col_dict = dict(zip(unique_vals_sorted, range(len(unique_vals_sorted))))
            df[col] = df[col].map(col_dict)
            mapping_dict[col] = col_dict

    return mapping_dict, df.reset_index(drop=True)
    ### END SOLUTION

### Demo function call
label_encoding_updated_df_demo = utils.load_object_from_publicdata('label_encoding_demo.dill')
(demo_result_encoding_map, demo_result_encoded_df) = label_encoding(label_encoding_updated_df_demo)
print('Your Encoding Map: ')
pprint(demo_result_encoding_map)
print('\nYour Encoded DataFrame: ')
demo_result_encoded_df
Your Encoding Map: 
{'City': {'Atlanta': 0,
          'Beijing': 1,
          'Berlin': 2,
          'London': 3,
          'Moskva': 4,
          'Sydney': 5},
 'Event': {"Athletics Men's Discus Throw": 0,
           "Basketball Men's Basketball": 1,
           "Canoeing Men's Kayak Fours, 1,000 metres": 2,
           "Cycling Men's Road Race, Individual": 3,
           "Cycling Men's Team Pursuit, 4,000 metres": 4,
           "Gymnastics Men's Team All-Around": 5,
           "Rowing Men's Quadruple Sculls": 6},
 'Medal': {'Bronze': 0, 'NONE': 1},
 'NOC': {'AUS': 0, 'GBR': 1, 'ITA': 2, 'MEX': 3, 'ROU': 4, 'RUS': 5, 'UZB': 6},
 'Name': {'Aleksandr Kolobnev': 0,
          'Alessandro Corona': 1,
          'Aurelian Georgescu': 2,
          'Charles King': 3,
          'Ignacio Porchas': 4,
          'Robert Lee': 5,
          'Roman Poltoratsky': 6},
 'Season': {'Summer': 0},
 'Sex': {'M': 0},
 'Sport': {'Athletics': 0,
           'Basketball': 1,
           'Canoeing': 2,
           'Cycling': 3,
           'Gymnastics': 4,
           'Rowing': 5},
 'Team': {'Australia': 0,
          'Italy': 1,
          'Mexico': 2,
          'Romania': 3,
          'Russia': 4,
          'UK': 5,
          'Uzbekistan': 6}}

Your Encoded DataFrame: 
Out[25]:
ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 60470 3 0 25.0 NaN NaN 5 1 1936 0 2 3 4 0
1 102361 4 0 24.0 NaN NaN 2 3 1948 0 3 1 1 1
2 95936 6 0 28.0 195.0 120.0 6 6 2000 0 5 0 0 1
3 23136 1 0 24.0 195.0 95.0 1 2 1996 0 0 5 6 1
4 62253 0 0 27.0 174.0 64.0 4 5 2008 0 1 3 3 0
5 39579 2 0 21.0 174.0 64.0 3 4 1980 0 4 4 5 1
6 68088 5 0 24.0 171.0 70.0 0 0 1980 0 4 2 2 1

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [26]:
encoding_map_demo_CORRECT, encoded_df_demo_CORRECT=utils.load_object_from_publicdata('label_encoding_demo_CORRECT.dill')
# pprint(encoding_map_demo_CORRECT)
# display(encoded_df_demo_CORRECT)

The demo should display this output.

ID Name Sex Age Height Weight Team NOC Year Season City Sport Event Medal
0 60470 3 0 25.00 NaN NaN 5 1 1936 0 2 3 4 0
1 102361 4 0 24.00 NaN NaN 2 3 1948 0 3 1 1 1
2 95936 6 0 28.00 195.00 120.00 6 6 2000 0 5 0 0 1
3 23136 1 0 24.00 195.00 95.00 1 2 1996 0 0 5 6 1
4 62253 0 0 27.00 174.00 64.00 4 5 2008 0 1 3 3 0
5 39579 2 0 21.00 174.00 64.00 3 4 1980 0 4 4 5 1
6 68088 5 0 24.00 171.00 70.00 0 0 1980 0 4 2 2 1

The demo should display this printed output.

Your Encoding Map: 
{'City': {'Atlanta': 0,
          'Beijing': 1,
          'Berlin': 2,
          'London': 3,
          'Moskva': 4,
          'Sydney': 5},
 'Event': {"Athletics Men's Discus Throw": 0,
           "Basketball Men's Basketball": 1,
           "Canoeing Men's Kayak Fours, 1,000 metres": 2,
           "Cycling Men's Road Race, Individual": 3,
           "Cycling Men's Team Pursuit, 4,000 metres": 4,
           "Gymnastics Men's Team All-Around": 5,
           "Rowing Men's Quadruple Sculls": 6},
 'Medal': {'Bronze': 0, 'NONE': 1},
 'NOC': {'AUS': 0, 'GBR': 1, 'ITA': 2, 'MEX': 3, 'ROU': 4, 'RUS': 5, 'UZB': 6},
 'Name': {'Aleksandr Kolobnev': 0,
          'Alessandro Corona': 1,
          'Aurelian Georgescu': 2,
          'Charles King': 3,
          'Ignacio Porchas': 4,
          'Robert Lee': 5,
          'Roman Poltoratsky': 6},
 'Season': {'Summer': 0},
 'Sex': {'M': 0},
 'Sport': {'Athletics': 0,
           'Basketball': 1,
           'Canoeing': 2,
           'Cycling': 3,
           'Gymnastics': 4,
           'Rowing': 5},
 'Team': {'Australia': 0,
          'Italy': 1,
          'Mexico': 2,
          'Romania': 3,
          'Russia': 4,
          'UK': 5,
          'Uzbekistan': 6}}

Your Encoded DataFrame:


The cell below will test your solution for label_encoding (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [27]:
### Test Cell - Exercise 8  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=label_encoding,
              ex_name='label_encoding',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to label_encoding did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=label_encoding,
              ex_name='label_encoding',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to label_encoding did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 7.95 seconds
memory after test: 0.52 MB
memory peak during test: 38.66 MB
initial memory usage: 0.00 MB
Test duration: 7.95 seconds
memory after test: 0.52 MB
memory peak during test: 38.64 MB
Passed! Please submit.

Exercise 9: (2 points)

build_traintest

Your task: define build_traintest as follows:

Now that we've encoded our Pandas DataFrame in the prior exercise, we can split our data into random training and testing subsets.

Input:

  • encoded_df: The result of Exercise 8 - A label encoded Pandas DataFrame (provided in solution/demo below)
  • featurecol: a string representing which column contains the response variable
  • testsize: a float representing the train/test split
  • random_state: an integer representing a random seed

Return: A tuple containing the following:

  • A Pandas DataFrame representing the independent (aka feature) variables (Xs) for the trainset
  • A Pandas DataFrame representing the independent (aka feature) variables (Xs) for the testset
  • A Python list representing the dependent (aka response) variables (Ys) for the trainset
  • A Python list representing the dependent (aka response) variables (Ys) for the testset

Requirements:

  • Use the train_test_split function within sklearn.model_selection module
  • The parameter featurecol must be excluded from the independent variables and included in the dependent variables
In [28]:
### Solution - Exercise 9  
def build_traintest(encoded_df: pd.DataFrame, featurecol: str, testsize: float = .2, random_state: int = 6040) -> tuple:
    from sklearn.model_selection import train_test_split
    ### BEGIN SOLUTION
    X = encoded_df.drop(featurecol, axis=1)
    Y = encoded_df[featurecol]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=testsize, random_state=random_state)
    return (X_train, X_test, list(Y_train), list(Y_test))
    ### END SOLUTION

### Demo function call
build_traintest_encoded_df_demo, featurecol_demo, testsize_demo, random_state_demo = utils.load_object_from_publicdata('build_traintest_demo.dill')
(X_train, X_test, Y_train, Y_test) = build_traintest(build_traintest_encoded_df_demo, featurecol_demo, testsize_demo, random_state_demo)
print('Your X_train: ')
display(X_train)
print('\nYour X_test: ')
display(X_test)
print('\nYour Y_train: ')
print(Y_train)
print('\nYour Y_test: ')
print(Y_test)
Your X_train: 
Sex Age Height Weight NOC Year Season City Sport Event
171438 0 31.0 180.0 72.0 69 1996 0 4 43 455
72946 1 23.0 179.0 66.0 90 1964 0 39 22 324
115541 1 24.0 180.0 84.0 90 1980 0 22 43 437
261108 0 23.0 171.0 60.0 69 2016 0 27 54 677
180003 1 25.0 198.0 89.0 216 1996 0 4 63 712
32650 1 25.0 NaN NaN 79 1936 0 7 28 390
200531 1 30.0 173.0 68.0 104 2002 1 29 11 168
57171 1 26.0 178.0 67.0 122 1956 0 19 5 81
178775 0 24.0 172.0 65.0 192 1996 0 4 14 222
119342 1 20.0 182.0 81.0 227 1980 0 22 43 449
107599 1 28.0 177.0 77.0 152 1976 0 21 14 210
95677 1 NaN 186.0 75.0 125 1972 0 23 24 348
71075 1 14.0 173.0 68.0 46 1964 0 39 54 650
54540 1 25.0 179.0 83.0 12 1956 0 19 63 712
203924 0 19.0 147.0 51.0 29 2004 0 3 27 389
228995 0 19.0 170.0 44.0 129 2008 0 6 5 141
Your X_test: 
Sex Age Height Weight NOC Year Season City Sport Event
84269 0 21.0 164.0 54.0 109 1968 0 20 62 711
89431 1 26.0 167.0 68.0 193 1968 1 12 1 2
122072 0 21.0 163.0 48.0 22 1984 0 18 3 38
103730 1 30.0 184.0 82.0 55 1976 0 21 46 491
Your Y_train: 
[0, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Your Y_test: 
[2, 2, 2, 1]

Example: A correct implementation should produce, for the demo, the following output. Run the cell below to view the expected output.

In [29]:
build_traintest_demo_CORRECT=utils.load_object_from_publicdata('build_traintest_demo_CORRECT.dill')
# print('Demo Correct X_train: ')
# display(build_traintest_demo_CORRECT[0])
# print('\nDemo Correct X_test: ')
# display(build_traintest_demo_CORRECT[1])
# print('\nDemo Correct Y_train: ')
# print(build_traintest_demo_CORRECT[2])
# print('\nDemo Correct Y_test: ')
# print(build_traintest_demo_CORRECT[3])

The demo should display this printed output.

Your X_train: 
        Sex   Age  Height  Weight  NOC  Year  Season  City  Sport  Event
171438    0  31.0   180.0    72.0   69  1996       0     4     43    455
72946     1  23.0   179.0    66.0   90  1964       0    39     22    324
115541    1  24.0   180.0    84.0   90  1980       0    22     43    437
261108    0  23.0   171.0    60.0   69  2016       0    27     54    677
180003    1  25.0   198.0    89.0  216  1996       0     4     63    712
32650     1  25.0     NaN     NaN   79  1936       0     7     28    390
200531    1  30.0   173.0    68.0  104  2002       1    29     11    168
57171     1  26.0   178.0    67.0  122  1956       0    19      5     81
178775    0  24.0   172.0    65.0  192  1996       0     4     14    222
119342    1  20.0   182.0    81.0  227  1980       0    22     43    449
107599    1  28.0   177.0    77.0  152  1976       0    21     14    210
95677     1   NaN   186.0    75.0  125  1972       0    23     24    348
71075     1  14.0   173.0    68.0   46  1964       0    39     54    650
54540     1  25.0   179.0    83.0   12  1956       0    19     63    712
203924    0  19.0   147.0    51.0   29  2004       0     3     27    389
228995    0  19.0   170.0    44.0  129  2008       0     6      5    141

Your X_test: 
        Sex   Age  Height  Weight  NOC  Year  Season  City  Sport  Event
84269     0  21.0   164.0    54.0  109  1968       0    20     62    711
89431     1  26.0   167.0    68.0  193  1968       1    12      1      2
122072    0  21.0   163.0    48.0   22  1984       0    18      3     38
103730    1  30.0   184.0    82.0   55  1976       0    21     46    491

Your Y_train: 
[0, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Your Y_test: 
[2, 2, 2, 1]


The cell below will test your solution for build_traintest (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [30]:
### Test Cell - Exercise 9  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=build_traintest,
              ex_name='build_traintest',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to build_traintest did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=build_traintest,
              ex_name='build_traintest',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to build_traintest did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 6.99 seconds
memory after test: 0.61 MB
memory peak during test: 73.83 MB
initial memory usage: 0.00 MB
Test duration: 7.22 seconds
memory after test: 0.73 MB
memory peak during test: 73.83 MB
Passed! Please submit.

Exercise 10: (1 points)

evaluation_results

Your task: define evaluation_results as follows:

Input:

  • Y_test: a list object which is one of the results from Exercise 11 (provided in solution/demo below)
  • Y_pred: a list object which contains the results from a model in the form of 0s and 1s

Return: A dictionary {key:value} containing the following:

  • accuracy: numpy.float64, Rounded to 3 decimal places
  • precision: numpy.float64, Weighted average, Zero division set to 0, Rounded to 3 decimal places
  • recall: numpy.float64, Weighted average, Rounded to 3 decimal places
  • f1_score: numpy.float64, Weighted average, Rounded to 3 decimal places
  • confusion_matrix: Converted to a list

Requirements:

  • From sklearn.metrics, return the evaluation results mentioned above
  • You must convert the confusion_matrix to a list
In [31]:
### Solution - Exercise 10  
def evaluation_results(ytest: list,ypred: list) -> dict:
    Y_test=pd.Series(ytest)
    Y_pred=pd.Series(ypred)
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

    ### BEGIN SOLUTION
    test_accuracy = round(accuracy_score(Y_test, Y_pred),3)
    precision = round(precision_score(Y_test, Y_pred, average='weighted', zero_division=0),3)
    recall = round(recall_score(Y_test, Y_pred, average='weighted'),3)
    f1 = round(f1_score(Y_test, Y_pred, average='weighted'),3)
    conf_matrix = confusion_matrix(Y_test, Y_pred).tolist()
    return {'accuracy':test_accuracy,'precision':precision,'recall':recall,'f1_score':f1,'confusion_matrix':conf_matrix}
    ### END SOLUTION

### Demo function call
ypred, ytest = utils.load_object_from_publicdata('evaluation_results_demo.dill')
demo_results = evaluation_results(ytest, ypred)
pprint(demo_results)
{'accuracy': 0.8,
 'confusion_matrix': [[6, 1], [1, 2]],
 'f1_score': 0.8,
 'precision': 0.8,
 'recall': 0.8}
In [32]:
evaluation_results_demo_CORRECT=utils.load_object_from_publicdata('evaluation_results_demo_CORRECT.dill')
#pprint(evaluation_results_demo_CORRECT)

The demo should display this printed output.

{'accuracy': 0.8,
 'confusion_matrix': [[6, 1], [1, 2]],
 'f1_score': 0.8,
 'precision': 0.8,
 'recall': 0.8}


The cell below will test your solution for evaluation_results (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [33]:
### Test Cell - Exercise 10  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=evaluation_results,
              ex_name='evaluation_results',
              key=b'KO0oYn6hHN95j_MEy1en0APR6T34clbtyG3dPYA30GI=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to evaluation_results did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=evaluation_results,
              ex_name='evaluation_results',
              key=b'tMhX5OazHXD9h8KreWYQpiL3U_KGme-WDdBzhytfWwk=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to evaluation_results did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.38 seconds
memory after test: 0.04 MB
memory peak during test: 1.44 MB
initial memory usage: 0.00 MB
Test duration: 0.38 seconds
memory after test: 0.03 MB
memory peak during test: 1.43 MB
Passed! Please submit.

FIN

Congratulations. Remember to submit your notebook.

Postscript

We wrote all these functions, so what? It's possible to make numerous predictions but let's try to predict who would win medals at the 2016 Olympics using our data that we've manipulated. In particular, we're going to use the columns ['Sex','Age','Height','Weight','NOC','Year','Season','City','Sport','Event'] as our independent or feature variables. And we'll use the type of medal as the dependent or response variable. Code inspiration drawn from Kaggle and Kaggle.

Like above, we use cleansed data to start. Using data from 1992-2016, we then did one-hot-encoding for the categorical variables (somewhat similar to the function you wrote above). Next, we created train/test sets from 1992-2014 and imputed any variables that may have been missing values using sklearn's IterativeImputer and scaled the data using sklearn's StandardScaler. We then ran the data through a robust suite of algorithms including: Decision Trees, Random Forests, XGBoost, LightGBM, Logistic Regression, K-Nearest Neighbors, Naive Bayes, AdaBoost, and Gradient Boosting.

Random Forests and LightGBM appear to have performed the best.

alt text

The 2016 Olympics served as our true holdout test set. Once the models were properly tuned on data from 1992 to 2014, we used them to make predictions on the 2016 data. It is great to see that there isn't much deviation from the evaluation metrics, indicating our models may be robust. Once again, Random Forests and LightGBM appear to perform best.

alt text

The next step is to tune the models' hyperparameters. We'll leave that for future work. The code for this analysis is mostly below, with the One-Hot Encoding, Train/Test, and Evaluation Metrics functionalities missing.

modeldata=sorted_df.copy(deep=True)
keepcols=['Sex','Age','Height','Weight','NOC','Year','Season','City','Sport','Event','Medal']
modeldata=modeldata[keepcols]
modeldata=modeldata[(modeldata['Year']>=1992)]
#########################################
#One-Hot Encoding code HERE
###

#########################################
#Train/Test sets
olympics_non2016_medal_winners=alldata[alldata['Year']!=2016].copy(deep=True)
X = olympics_non2016_medal_winners.drop('Medal', axis=1)
Y = olympics_non2016_medal_winners['Medal']
###
#Train/Test Set code HERE

#########################################
#Data Imputation 1992-2014
from sklearn.preprocessing import StandardScaler
from sklearn.impute import IterativeImputer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
from tqdm import tqdm

# **Check for NaN values before imputation**
print("NaN values in X_train before imputation:", np.isnan(X_train).sum())
print("NaN values in X_test before imputation:", np.isnan(X_test).sum())

# **Advanced Imputation using IterativeImputer**
imputer = IterativeImputer(random_state=6040)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# **Check for NaN values after imputation**
print("NaN values in X_train after imputation:", np.isnan(X_train_imputed).sum())
print("NaN values in X_test after imputation:", np.isnan(X_test_imputed).sum())

# Scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

#########################################
#Data Imputation and 2016 split
olympics_2016_medal_winners=alldata[alldata['Year']==2016].copy(deep=True)
olympics_2016_medal_winners.head()
# Define the categorical columns
categorical_columns = ['Sex', 'NOC', 'Season', 'City', 'Sport', 'Event']

# Filter the 2016 data for prediction
X_test_2016 = olympics_2016_medal_winners.drop('Medal', axis=1)
actual_medals_2016 = olympics_2016_medal_winners['Medal']

# **Check for NaN values before imputation**
print("NaN values in X_test_2016 before imputation:", np.isnan(X_test_2016).sum())

# **Advanced Imputation using IterativeImputer**
imputer = IterativeImputer(random_state=6040)
a = imputer.fit_transform(X_train)
X_test_2016_imputed = imputer.transform(X_test_2016)

# **Check for NaN values after imputation**
print("NaN values in X_test_2016 after imputation:", np.isnan(X_test_2016_imputed).sum())

# Scale the data using StandardScaler
scaler = StandardScaler()
b = scaler.fit_transform(X_train_imputed)
X_test_2016_scaled = scaler.transform(X_test_2016_imputed) 

#########################################
# Modeling
#########################################
# Initialize the models with hyperparameters to be tuned
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=6040, class_weight='balanced'),
    "Random Forest": RandomForestClassifier(random_state=6040, class_weight='balanced'),
    "XGBoost": XGBClassifier(random_state=6040, use_label_encoder=False, eval_metric='logloss'),
    "LightGBM": lgb.LGBMClassifier(random_state=6040),
    "Logistic Regression": LogisticRegression(random_state=6040, max_iter=1000, class_weight='balanced'),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "AdaBoost": AdaBoostClassifier(random_state=6040),
    "Gradient Boosting": GradientBoostingClassifier(random_state=6040)
}

# Hyperparameter grids for GridSearchCV
param_grids = {
    "Decision Tree": {
        'max_depth': [3, 5],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 4]
    },
    "Random Forest": {
        'n_estimators': [100],
        'max_depth': [None, 10],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1]
    },
    "XGBoost": {
        'n_estimators': [100],
        'max_depth': [3, 5],
        'learning_rate': [0.01, 0.1]
    },
    "LightGBM": {
        'n_estimators': [100],
        'num_leaves': [31, 50],
        'learning_rate': [0.01, 0.1]
    },
    "Logistic Regression": {
        'C': [0.1, 1.0],
        'solver': ['lbfgs', 'liblinear']
    },
    "K-Nearest Neighbors": {
        'n_neighbors': [3, 5],
        'weights': ['uniform', 'distance']
    },
    "Neural Network": {
        'hidden_layer_sizes': [(50, 50)],
        'activation': ['relu', 'tanh'],
        'solver': ['adam'],
        'alpha': [0.001]
    },
    "AdaBoost": {
        'n_estimators': [50],
        'learning_rate': [0.01, 0.1]
    },
    "Gradient Boosting": {
        'n_estimators': [100],
        'max_depth': [3, 5],
        'learning_rate': [0.01, 0.1]
    }
}

# Cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=6040)

#########################################
#Store Evaluation Matricies as lists

# Training the models with hyperparameter tuning using GridSearchCV
for model_name, model in models.items():
    print(f"\nTuning {model_name}...")
    with tqdm(total=1, desc=model_name) as pbar:
        grid_search = GridSearchCV(model, param_grids.get(model_name, {}), cv=cv, scoring='accuracy')
        grid_search.fit(X_train_scaled, Y_train)

        # Get the best model after hyperparameter tuning
        best_model = grid_search.best_estimator_

        # Evaluate train accuracy with cross-validation
        train_accuracy = grid_search.best_score_

        # Predict on the test data
        Y_pred = best_model.predict(X_test_scaled)

        #########################################
        # Calculate and store evaluation metrics for 1992-2014 data as well as 2016 holdout data.
        # YOUR CODE from evaluation metrics function would go HERE
        # Append to lists initialized above

        # Update progress bar
        pbar.update(1)


#Show Evaluation Metrics as Pandas Dataframe
dd={}
for i in range(len(evaluation_metrics['Model'])):
    dd[i]={v:evaluation_metrics[v][i] for v in ['Model','Test Accuracy','Precision','Recall','F1-Score']}
d=pd.DataFrame.from_dict(dd,orient='index')
display(d)