Final Exam, Spring 2022: March (through May) Madness

Version 1.0

All of the header information is important. Please read it..

Topics, number of exercises: This problem builds on your knowledge of Pandas, SQL, and numeric computation. It has 9 exercises, numbered 0 to 8. There are 19 available points. However, to earn 100% the threshold is 12 points. (Therefore, once you hit 12 points, you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0: 1 point
  • Exercise 1: 3 point
  • Exercise 2: 1 point
  • Exercise 3: 3 point
  • Exercise 4: 2 point
  • Exercise 5: 3 point
  • Exercise 6: 3 point
  • Exercise 7: 1 point
  • Exercise 8: 2 point

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed, and take short breaks at your leisure

Basketball basics

In this notebook we want to predict which team will win a basketball game based on their past performance in a given season. You do not have to know anything about basketball aside from the background below to complete this notebook.

  • Games. A basketball game is played between two teams.
  • In most games, there is a "home" team and an "away" team. However sometimes games are played at a neutral site, in which case neither team is "home" nor "away."
  • Scoring and winning. In a game, the team that scores more points wins. There are no ties.
  • Possessions. A possession is an event where one team continuously controls the ball.
In [1]:
import sqlite3
import numpy as np
import pandas as pd
import re
conn = sqlite3.connect('resource/asnlib/publicdata/basketball_db.sqlite')

Exercise 0 - (1 Points):

We have a big collection of real data from several seasons of men's NCAA basketball. We are most interested in one table of this database, named MRegularSeasonDetailedResults.

Complete the function get_cols(conn) to return a list of the columns in the MRegularSeasonDetailedResults table in the db connection conn. The order of the columns in the list should be the same as the order they appear in the database table.

In [2]:
### Define get_cols
def get_cols(conn):
    ###
    ### YOUR CODE HERE
    ###
    return pd.read_sql('''select * from MRegularSeasonDetailedResults limit 1''', conn)\
        .columns\
        .tolist()

The demo cell below should display the following output:

['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
In [3]:
### call demo funtion
print(get_cols(conn))
['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']

The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [4]:
### test_cell_ex0
from tester_fw.testers import Tester_ex0
tester = Tester_ex0()
for _ in range(20):
    try:
        tester.run_test(get_cols)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Abriged data dictionary

In the MRegularSeasonDetailedResults table, each record is one game. All columns prefixed by 'W' refer to the winning team, and columns prefixed by 'L' refer to the losing team. We are primarily interested in the following columns:

Column Type Has Prefix Description
Season Integer No Identifies the calendar year when a season ends.
DayNum Integer No Identifies the day in a season when a game occurred.
TeamID Integer W or L Identifies the teams participating in a game.
Score Integer W or L Number of points a team scored in a game.
Loc Char W Only Identifies whether the winning team was home ('H'), away ('A'), or the game was played at a neutral site ('N').
FGA Integer W or L Number of field goal attempts occurring when a team had in a game
FTA Integer W or L Number of free throw attempts a team had in a game.
TO Integer W or L Number of turnovers a team had in a game.
OR Integer W or L Number of offensive rebounds a team had in a game.

For example, the column with the winning team's ID will be WTeamID, and the column with the losing team's free-throw attempts will be LFTA.

Exercise 1 - (3 Points):

We want to extract some information about each team involved in a game from that team's perspective. We will use the term "primary team" to refer to this team and "primary team's opponent" to refer to the other team in the game. Do not infer any other meaning from this term. Since there are two participants in each game there will be a record from the perspective of both teams after the extraction. (i.e. one record where the winning team is the primary team and one record where the losing team is the primary team will be in the result.

Complete the function get_pace_adjusted(conn) to query the table MRegularSeasonDetailedResults using the db connection conn and return a Pandas DataFrame with the columns outlined below. The "Winning Team" and "Losing Team" headers in the chart below indicate which columns in MRegularSeasonDetailedResults to use in your computations.

Column Name dtype description Special Considerations Winning Team Losing Team
Won int64 1 if the primary team won the game, 0 if the primary team lost 1 0
Season int64 Current Season
DayNum int64 Current DayNum
TeamID int64 Team id for primary team WTeamID LTeamID
OppID int64 Team id for primary team's opponent LTeamID WTeamID
Home int64 1 if the primary team is at home for the game, 0 otherwise. WLoc is 'H' WLoc is 'A'
Pos float64 Estimated number of possessions for primary team Round to 5 decimal places W prefix L prefix
OppPos float64 Estimated number of possessions for primary team's opponent Round to 5 decimal places L prefix W prefix
PtsForPer100 float64 100 × primary team's score ÷ estimate of primary team's possessions Round to 5 decimal places (only round after division) W prefix L prefix
PtsAgstPer100 float64 100 × primary team's opponent's score ÷ estimate of primary team's opponent's possessions Round to 5 decimal places (only round after division) L prefix W prefix

There is no column for possessions in our source data. We will need to estimate it! The formula below is widely used in sports analytics to estimate the number of possessions a team had in a basketball game. (POS is # of possessions):

POS = FGA + 0.44(FTA) + TO - OR

You can derive all required results from these columns: Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO.

Notes:

  • Each record in the database table MRegularSeasonDetailedResults will correspond to two records in the result. One record where the winning team is the primary team and one record where the losing team is the primary team.
  • For neutral site games, neither team will be home or away. The WLoc column will have an 'N' entry.
  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas.
    • The UNION ALL phrase may be helpful if you are using SQL.
    • The pd.concat() method may be helpful if you are using Pandas.
  • One strategy that will work is making a table/df for all the winning teams and another for all the losing teams then vertically "stacking" them.
In [5]:
### Define get_pace_adjusted
def get_pace_adjusted(conn):
    ###
    ### YOUR CODE HERE
    ###
    return pd.read_sql('''
    -- winning team
    select
        1 as Won,
        season, DayNum,
        wteamid as TeamID,
        lteamid as OppID,
        wloc='H' as Home,
        wFGA + 0.44*(wFTA) + wTO - wOR as Pos,
        lFGA + 0.44*(lFTA) + lTO - lOR as OppPos,
        100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsForPer100,
        100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsAgstPer100
    from
        mregularseasondetailedresults
    
    union all
    
    -- losing team
    select
        0 as Won,
        season, DayNum,
        lteamid as TeamID,
        wteamid as OppID,
        wloc='A' as Home,
        lFGA + 0.44*(lFTA) + lTO - lOR as Pos,
        wFGA + 0.44*(wFTA) + wTO - wOR as OppPos,
        100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsForPer100, 
        100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsAgstPer100
        
    from
        mregularseasondetailedresults
    
    ''', conn).apply(lambda f: round(f, 5))

This is the demo input:

   Season  DayNum  WTeamID  WScore  LTeamID  LScore WLoc  WFGA  WFTA  WOR  WTO  LFGA  LFTA  LOR  LTO 
0    2003      10     1104      68     1328      62    N    58    18   14   23    53    22   10   18

The demo cell below should display the following output:

      Won  Season  DayNum  TeamID  OppID  Home    Pos  OppPos  PtsForPer100     PtsAgstPer100  
   0    1    2003      10    1104   1328     0  74.92   70.68      90.76348          87.71930
   1    0    2003      10    1328   1104     0  70.68   74.92      87.71930          90.76348

Note that there are two rows, but they come from a single input row.

In [6]:
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_1 = pd.read_sql('select Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO from MRegularSeasonDetailedResults limit 1', conn)
demo_conn_1 = dfs_to_conn({'MRegularSeasonDetailedResults': demo_df_1})
# demo_df_1
In [7]:
### call demo funtion
print(get_pace_adjusted(demo_conn_1))
   Won  Season  DayNum  TeamID  OppID  Home    Pos  OppPos  PtsForPer100  \
0    1    2003      10    1104   1328     0  74.92   70.68      90.76348   
1    0    2003      10    1328   1104     0  70.68   74.92      87.71930   

   PtsAgstPer100  
0       87.71930  
1       90.76348  

The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [8]:
### test_cell_ex1
from tester_fw.testers import Tester_ex1
tester = Tester_ex1()
for _ in range(20):
    try:
        tester.run_test(get_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Exercise 2 - (1 Points):

Our next task is to compute daily totals of Pos, OppPos, PtsForPer100, and PtsAgstPer100 as well as a count of the number of teams participating in games that day.

Complete the function daily_totals(conn) to query the table PaceAdjusted (structured the same as the output of the previous exercise) in the db connection conn and return a Pandas DataFrame with the following columns:

Column Name dtype description
Season int64
DayNum int64
Pos float64 Sum of values in Pos column of PaceAdjusted occuring on a unique combination of Season and DayNum
Pph float64 Sum of values in PtsForPer100 column of PaceAdjusted occuring on a unique combination of Season and DayNum
Count int64 Count of rows in PaceAdjusted with a unique combination of Season and DayNum

Notes:

  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas.
In [9]:
### Define daily_totals
def daily_totals(conn):
    ###
    ### YOUR CODE HERE
    ###
    return pd.read_sql('''
        select
            Season, DayNum,
            sum(pos) as Pos,
            sum(ptsforper100) as Pph,
            count(*) as Count
        from 
            paceadjusted
        group by
            season, daynum
    ''', conn)

The demo cell below should display the following output:

   Season  DayNum     Pos        Pph  Count
0    2010      94  277.28  433.33581      4
1    2010     108  281.64  415.62023      4
2    2010     122  265.88  395.26418      4
3    2016      49  365.04  467.93858      5
4    2016     115  204.08  309.66380      3
In [10]:
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_2 = pd.read_sql('select * from PaceAdjustedSample', conn)
demo_conn_2 = dfs_to_conn({'PaceAdjusted': demo_df_2})
# demo_df_2.sort_values(['Season', 'DayNum']).reset_index(drop=True)
In [11]:
### call demo funtion
print(daily_totals(demo_conn_2))
   Season  DayNum     Pos        Pph  Count
0    2010      94  277.28  433.33581      4
1    2010     108  281.64  415.62023      4
2    2010     122  265.88  395.26418      4
3    2016      49  365.04  467.93858      5
4    2016     115  204.08  309.66380      3

The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [12]:
### test_cell_ex2
from tester_fw.testers import Tester_ex2
tester = Tester_ex2()
for _ in range(20):
    try:
        tester.run_test(daily_totals)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

On Window Functions...

This is useful information

Window functions define a slice of data relative to each row in a dataset and then perform an aggregate or order-based calculation on the slice. For example, say you want to perform the following calculation for each row i in DailyTotals:

SumPosi := Sum of the Pos column for all rows where Season = Seasoni and DayNum DayNumi

Window functions can do this. Below are examples of this computation in Pandas and SQL. First, we will load some data to demonstrate.

In [13]:
# define demo data
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
daily_totals_df = demo_conn_dfs_ex3['DailyTotals']

The code below will return the sample data with a new column "SumPos".

In [14]:
### Pandas Example
def pandas_window_helper(group):
    group = group.sort_values('DayNum')
    group['SumPos'] = group['Pos'].expanding(1).sum()
    return group
pandas_ex = daily_totals_df.groupby('Season', as_index=False).apply(pandas_window_helper)

The code below will return the sample data with a new column "SumPos".

In [15]:
### SQL Example
query = '''
    SELECT 
        *,
        SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as SumPos
    FROM
        DailyTotals
    ORDER BY
        Season, DayNum
'''
sql_ex = pd.read_sql(query, demo_conn_ex3)

The code below will print the result and verify that the two approaches give equivalent results.

In [16]:
display(sql_ex)
print(f'''SQL and Pandas results match? {
((sql_ex.reset_index(drop=True) - pandas_ex.reset_index(drop=True)).abs() <= 0.00001).all().all()}''')
Season DayNum Pos Pph Count SumPos
0 2014 12 8725.84 13482.43561 128 8725.84
1 2014 15 6146.96 9299.04948 88 14872.80
2 2014 18 7301.64 11764.86247 108 22174.44
3 2014 41 4452.52 6702.86892 66 26626.96
4 2014 44 5381.96 8625.92495 82 32008.92
5 2014 59 8673.96 13265.97248 128 40682.88
6 2014 102 1400.72 2062.01746 20 42083.60
7 2014 127 1696.20 2684.90264 26 43779.80
8 2021 29 4150.56 5600.57650 58 4150.56
9 2021 96 13485.76 19887.61275 194 17636.32
SQL and Pandas results match? True

Exercise 3 - (3 Points):

Our next task is to compute the weighted running average for Pos and Pph for all games played in a Season up to and including each DayNum. We want a snapshot of these averages as if it were that particular day and we had no knowledge of events occurring after that day.

Complete the function get_running_avg_pace(conn) to query the table DailyTotals (structured the same as the output of the previous exercise) using the db connection conn and return a Pandas DataFrame with the columns mentioned below. You should calculate the intermediate values then use division to get the end result.

Intermediate values

  • SumPos - Running sum of the "Pos" column for current "Season" up to and including the current "DayNum".
  • SumPph - Running sum of the "Pph" column for current "Season" up to and including the current "DayNum".
  • SumCount - Running sum of the "Count" column for current "Season" up to and including the current "DayNum".
Column Name dtype description Special Considerations
Season int64 Current Season
DayNum int64 Current DayNum
AvgPace float64 SumPos ÷ SumCount Round to 5 decimal places
AvgPPH float64 SumPph ÷ SumCount Round to 5 decimal places

Notes:

  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas.
    • See "On Window Functions..." above this exercise for more info on computing a running sum.
    • For Pandas pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful in calculating intermediate values.
    • For SQL the "window function" syntax may be useful in calculating intermediate values.
      • SUM(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
In [17]:
### Define def get_running_avg_pace
def get_running_avg_pace(conn):
    def window_helper(agg_col, partition_col, sort_col):
        return f'SUM({agg_col}) OVER(PARTITION BY {partition_col} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)'
#     return window_helper('Pos', 'Season', 'DayNum')
    intermediate_result = pd.read_sql(f'''
    select
        Season, DayNum,
        {window_helper('Pos', 'Season', 'DayNum')} SumPos,
        {window_helper('Pph', 'Season', 'DayNum')} SumPph,
        {window_helper('Count', 'Season', 'DayNum')} SumCount
    from 
        DailyTotals
    ''', conn)
    df = intermediate_result
    df['AvgPace'] = df['SumPos'] / df['SumCount']
    df['AvgPPH'] = df['SumPph'] / df['SumCount']
    return df.drop(columns=['SumPos', 'SumPph', 'SumCount'])

The demo cell below should display the following output:

   Season  DayNum   AvgPace     AvgPPH
0    2014      12  68.17063  105.33153
1    2014      15  68.85556  105.46984
2    2014      18  68.43963  106.62453
3    2014      41  68.27426  105.76722
4    2014      44  67.81551  105.66767
5    2014      59  67.80480  105.23519
6    2014     102  67.87677  105.16634
7    2014     127  67.77059  105.08984
8    2021      29  71.56138   96.56166
9    2021      96  69.98540  101.14361
In [18]:
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
# demo_conn_dfs_ex3['DailyTotals'].sort_values(['Season', 'DayNum']).reset_index(drop=True)
In [19]:
### call demo funtion
print(get_running_avg_pace(demo_conn_ex3))
   Season  DayNum    AvgPace      AvgPPH
0    2014      12  68.170625  105.331528
1    2014      15  68.855556  105.469838
2    2014      18  68.439630  106.624530
3    2014      41  68.274256  105.767222
4    2014      44  67.815508  105.667673
5    2014      59  67.804800  105.235190
6    2014     102  67.876774  105.166341
7    2014     127  67.770588  105.089836
8    2021      29  71.561379   96.561664
9    2021      96  69.985397  101.143608

The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [20]:
### test_cell_ex3
from tester_fw.testers import Tester_ex3
tester = Tester_ex3()
for _ in range(20):
    try:
        tester.run_test(get_running_avg_pace)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Exercise 4 - (2 Points):

Suppose you are given two SQL tables:

  1. PaceAdjusted, structured the same as the output of Exercise 1.
  2. RunningAvgPace, structured the same as the output of Exercise 3.

Complete the function get_center_pace_adjusted(conn) to query tables RunningAvgPace and PaceAdjusted and return a Pandas DataFrame with the following columns:

Column Name dtype description
Won int64 Unchanged from PaceAdjusted
Season int64 Unchanged from PaceAdjusted
DayNum int64 Unchanged from PaceAdjusted
TeamID int64 Unchanged from PaceAdjusted
OppID int64 Unchanged from PaceAdjusted
Home int64 Unchanged from PaceAdjusted
Pos float64 PaceAdjusted.Pos - RunningAvgPace.AvgPace for corresponding Season and DayNum
OppPos float64 PaceAdjusted.OppPos - RunningAvgPace.AvgPace for corresponding Season and DayNum
PtsAgstPer100 float64 PaceAdjusted.PtsAgstPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum
PtsForPer100 float64 PaceAdjusted.PtsForPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum

Notes:

  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas.
In [21]:
### Define get_center_pace_adjusted
def get_center_pace_adjusted(conn):
    ###
    ### YOUR CODE HERE
    ###
    return pd.read_sql('''
    select
        pa.Won, pa.Season, pa.DayNum, pa.TeamID, pa.OppID, pa.Home,
        pa.pos - rap.avgpace as Pos,
        pa.opppos - rap.avgpace as OppPos,
        pa.ptsagstper100 - rap.avgpph as PtsAgstPer100,
        pa.ptsforper100 - rap.avgpph PtsForPer100
    from
        PaceAdjusted pa
        join
        RunningAvgPace rap
        on
            pa.season = rap.season and pa.daynum = rap.daynum
    ''', conn)

The demo cell below should display the following output:

   Won  Season  DayNum  TeamID  OppID  Home       Pos    OppPos  \
0    1    2014      94    1374   1396     1  -3.86907  -2.18907   
1    0    2016     119    1224   1313     0   2.65129   5.57129   
2    0    2016      90    1353   1277     0   4.05371   2.13371   
3    1    2016     119    1300   1366     1  -8.78871  -9.30871   
4    1    2014      94    1383   1384     1  -2.30907  -4.66907   
5    1    2014      94    1190   1297     0  -9.42907  -9.62907   
6    0    2014      94    1223   1101     0   0.29093   0.69093   
7    0    2014      94    1218   1364     1  -1.98907  -3.58907   
8    0    2014      92    1159   1221     1 -13.28500 -12.96500   
9    0    2014      81    1297   1252     1   1.52248   1.76248   

   PtsAgstPer100  PtsForPer100  
0      -25.32271      13.22592  
1        7.78985     -12.71682  
2       29.99420     -19.84044  
3      -31.11146      -3.56274  
4      -22.13333      14.99904  
5      -18.41211       5.59099  
6       22.95005      14.80074  
7       12.70182      -7.11914  
8       22.07960      20.97894  
9        3.17351       2.09897
In [22]:
### define demo inputs
### use naming convention demo_varname_ex_* to name demo variables
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex4 = {}
demo_conn_dfs_ex4['PaceAdjusted'] = pd.read_sql('select * from PaceAdjustedSampleTwo', conn)
demo_conn_dfs_ex4['RunningAvgPace'] = pd.read_sql('select * from RunningAvgPaceSample', conn)
demo_conn_ex4 = dfs_to_conn(demo_conn_dfs_ex4)
# print('PaceAdjusted')
# display(demo_conn_dfs_ex4['PaceAdjusted'])
# print('RunningAvgPace')
# display(demo_conn_dfs_ex4['RunningAvgPace'])
In [23]:
### call demo funtion
print(get_center_pace_adjusted(demo_conn_ex4))
   Won  Season  DayNum  TeamID  OppID  Home       Pos    OppPos  \
0    1    2014      94    1374   1396     1  -3.86907  -2.18907   
1    0    2016     119    1224   1313     0   2.65129   5.57129   
2    0    2016      90    1353   1277     0   4.05371   2.13371   
3    1    2016     119    1300   1366     1  -8.78871  -9.30871   
4    1    2014      94    1383   1384     1  -2.30907  -4.66907   
5    1    2014      94    1190   1297     0  -9.42907  -9.62907   
6    0    2014      94    1223   1101     0   0.29093   0.69093   
7    0    2014      94    1218   1364     1  -1.98907  -3.58907   
8    0    2014      92    1159   1221     1 -13.28500 -12.96500   
9    0    2014      81    1297   1252     1   1.52248   1.76248   

   PtsAgstPer100  PtsForPer100  
0      -25.32271      13.22592  
1        7.78985     -12.71682  
2       29.99420     -19.84044  
3      -31.11146      -3.56274  
4      -22.13333      14.99904  
5      -18.41211       5.59099  
6       22.95005      14.80074  
7       12.70182      -7.11914  
8       22.07960      20.97894  
9        3.17351       2.09897  

The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [24]:
### test_cell_ex4
from tester_fw.testers import Tester_ex4
tester = Tester_ex4()
for _ in range(20):
    try:
        tester.run_test(get_center_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Exercise 5 - (3 Points):

Now we have centered and pace adjusted stats for each Season, DayNum, and TeamID, it's time to compute running averages to set as inputs to our model. Since we are using these stats as inputs to a predictive model, we need to introduce a lag of 1 game so that the results of the game we are trying to predict are not part of these inputs.

Complete the function get_team_stats(conn) to query the PaceAdjustedCentered (structured the same as the output of the previous exercise) table with the db connection conn and return a Pandas DataFrame with the following columns:

Column Name dtype description Special Considerations
Won int64 Unchanged from PaceAdjustedCentered
Season int64 Current Season Unchanged from PaceAdjustedCentered
DayNum int64 Current Day Unchanged from PaceAdjustedCentered
TeamID int64 Current TeamID Unchanged from PaceAdjustedCentered
OppID int64 Unchanged from PaceAdjustedCentered
Home int64 Unchanged from PaceAdjustedCentered
Pos float64 Running Average of PaceAdjustedCentered.Pos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places
OppPos float64 Running Average of PaceAdjustedCentered.OppPos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places
PtsAgstPer100 float64 Running Average of PaceAdjustedCentered.PtsAgstPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places
PtsForPer100 float64 Running Average of PaceAdjustedCentered.PtsForPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places
WinPct float64 Running Average of PaceAdjustedCentered.Won including all rows with the current Season and TeamID with DayNum < PaceAdjustedCentered.DayNum Round to 5 decimal places

Notes:

  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas.
    • See "On Window Functions..." above Exercise 3 for more info on computing a running average.
    • For Pandas pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful.
    • For SQL the "window function" syntax may be useful.
      • AVG(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
In [25]:
### Define get_team_stats
def get_team_stats(conn):
    def window_helper(agg_col, partition_cols, sort_col):
        return f'AVG({agg_col}) OVER(PARTITION BY {partition_cols} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PRECEDING AND 1 preceding)'
        
    return pd.read_sql(f'''
    select
        Won, Season, DayNum, TeamID, OppID, Home,
        {window_helper('pos', 'season, teamid', 'daynum')} Pos,
        {window_helper('OppPos', 'season, teamid', 'daynum')} OppPos,
        {window_helper('PtsAgstPer100', 'season, teamid', 'daynum')} PtsAgstPer100,
        {window_helper('PtsForPer100', 'season, teamid', 'daynum')} PtsForPer100,
        {window_helper('won', 'season, teamid', 'daynum')} WinPct
    from
        PaceAdjustedCentered
    ''', conn)

The demo cell below should display the following output:

    Won  Season  DayNum  TeamID  OppID  Home      Pos    OppPos  \
0     0    2012      17    1108   1452     0      NaN       NaN   
1     0    2012      25    1108   1408     0  8.25697   8.53697   
2     0    2012      26    1108   1360     0  5.65314   5.93314   
3     1    2012     105    1108   1115     1  2.76368   2.09702   
4     0    2012      16    1355   1208     0      NaN       NaN   
5     1    2012      31    1355   1237     0 -5.81233  -5.33233   
6     0    2012      45    1355   1315     0  5.07403   4.81403   
7     1    2012     124    1355   1237     0  4.12049   4.84049   
8     0    2012      29    1431   1332     0      NaN       NaN   
9     1    2012      52    1431   1155     0 -4.58353  -5.26353   
10    0    2012      72    1431   1409     0 -5.65785  -7.11785   
11    1    2012      79    1431   1187     1 -6.81332  -8.17332   
12    1    2012     103    1431   1408     1 -6.71588  -7.96588   
13    1    2017      47    1252   1399     0      NaN       NaN   
14    1    2017      50    1252   1122     0  9.78901  10.22901   
15    1    2017      68    1252   1244     1  7.09400   8.63400   
16    0    2017      82    1252   1316     0  6.62264   8.36930   
17    1    2017      86    1252   1367     1  7.55313   8.32313   
18    0    2017     108    1252   1316     1  6.32893   6.93693   
19    0    2017     122    1252   1316     1  6.91365   7.39365   

    PtsAgstPer100  PtsForPer100   WinPct  
0             NaN           NaN      NaN  
1        26.44115     -18.48766  0.00000  
2        21.51181     -30.28407  0.00000  
3        16.34005     -20.09929  0.00000  
4             NaN           NaN      NaN  
5        14.27057      -2.29566  0.00000  
6        13.42293       6.86026  0.50000  
7        16.60255       4.68752  0.33333  
8             NaN           NaN      NaN  
9         2.30485      -6.73367  0.00000  
10       -7.95942      -3.23226  0.50000  
11       -4.29258      -8.20323  0.33333  
12       -5.03189      -2.51522  0.50000  
13            NaN           NaN      NaN  
14       -5.62638      -2.59159  1.00000  
15        0.33659      13.16336  1.00000  
16       -0.57178      10.26290  1.00000  
17        2.37803       7.85893  0.75000  
18        2.68629       9.02187  0.80000  
19        4.37327       7.30692  0.66667
In [26]:
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex5 = {}
demo_conn_dfs_ex5['PaceAdjustedCentered'] = pd.read_sql('select * from PaceAdjustedCenteredSample', conn).sort_values(['Season', 'TeamID', 'DayNum'])
demo_conn_ex5 = dfs_to_conn(demo_conn_dfs_ex5)
# display(demo_conn_dfs_ex5['PaceAdjustedCentered'])
In [27]:
### call demo funtion
print(get_team_stats(demo_conn_ex5))
    Won  Season  DayNum  TeamID  OppID  Home       Pos     OppPos  \
0     0    2012      17    1108   1452     0       NaN        NaN   
1     0    2012      25    1108   1408     0  8.256970   8.536970   
2     0    2012      26    1108   1360     0  5.653140   5.933140   
3     1    2012     105    1108   1115     1  2.763683   2.097017   
4     0    2012      16    1355   1208     0       NaN        NaN   
5     1    2012      31    1355   1237     0 -5.812330  -5.332330   
6     0    2012      45    1355   1315     0  5.074030   4.814030   
7     1    2012     124    1355   1237     0  4.120487   4.840487   
8     0    2012      29    1431   1332     0       NaN        NaN   
9     1    2012      52    1431   1155     0 -4.583530  -5.263530   
10    0    2012      72    1431   1409     0 -5.657850  -7.117850   
11    1    2012      79    1431   1187     1 -6.813323  -8.173323   
12    1    2012     103    1431   1408     1 -6.715882  -7.965883   
13    1    2017      47    1252   1399     0       NaN        NaN   
14    1    2017      50    1252   1122     0  9.789010  10.229010   
15    1    2017      68    1252   1244     1  7.093995   8.633995   
16    0    2017      82    1252   1316     0  6.622637   8.369303   
17    1    2017      86    1252   1367     1  7.553130   8.323130   
18    0    2017     108    1252   1316     1  6.328930   6.936930   
19    0    2017     122    1252   1316     1  6.913647   7.393647   

    PtsAgstPer100  PtsForPer100    WinPct  
0             NaN           NaN       NaN  
1       26.441150    -18.487660  0.000000  
2       21.511805    -30.284065  0.000000  
3       16.340047    -20.099290  0.000000  
4             NaN           NaN       NaN  
5       14.270570     -2.295660  0.000000  
6       13.422930      6.860260  0.500000  
7       16.602553      4.687520  0.333333  
8             NaN           NaN       NaN  
9        2.304850     -6.733670  0.000000  
10      -7.959420     -3.232260  0.500000  
11      -4.292583     -8.203230  0.333333  
12      -5.031885     -2.515215  0.500000  
13            NaN           NaN       NaN  
14      -5.626380     -2.591590  1.000000  
15       0.336590     13.163360  1.000000  
16      -0.571780     10.262900  1.000000  
17       2.378032      7.858930  0.750000  
18       2.686288      9.021872  0.800000  
19       4.373267      7.306922  0.666667  

The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [28]:
### test_cell_ex5
from tester_fw.testers import Tester_ex5
tester = Tester_ex5()
for _ in range(20):
    try:
        tester.run_test(get_team_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
In [29]:
returned_output_vars['df'].dtypes
Out[29]:
Won                int64
Season             int64
DayNum             int64
TeamID             int64
OppID              int64
Home               int64
Pos              float64
OppPos           float64
PtsAgstPer100    float64
PtsForPer100     float64
WinPct           float64
dtype: object
In [30]:
true_output_vars['df'].dtypes
Out[30]:
Won                int64
Season             int64
DayNum             int64
TeamID             int64
OppID              int64
Home               int64
Pos              float64
OppPos           float64
PtsAgstPer100    float64
PtsForPer100     float64
WinPct           float64
dtype: object

Exercise 6 - (3 Points):

Up to this point we have calculated our metrics for each team before they play each game. We're almost ready to start modeling! There is still one last task. Remember that there were two records in PaceAdjusted for each game. This is also true of LaggedTeamStats, which is a running average of that table. If "Team A" and "Team B" played a game on in some Season on some DayNum then there will be one record where "Team A" is the primary team and "Team B" is the opponent team and one record where "Team B" is the pirmary team and "Team A" is the opponent team. We need to join these records together so that we will have one record identifying each game.

Complete the function get_matchup_stats(conn) to query the LaggedTeamStats table using the db connection conn and return a Pandas DataFrame with the following columns:

Column Name dtype description
Won int64 1 if primary team (_x suffix) won the matchup 0 otherwise
Season int64
DayNum int64
TeamID int64 Primary team - associated with _x suffixes
OppID int64 Opponent team - associated with _y suffixes
Home_x int64 Stats associated with primary team
Pos_x float64 Stats associated with primary team
OppPos_x float64 Stats associated with primary team
PtsAgstPer100_x float64 Stats associated with primary team
PtsForPer100_x float64 Stats associated with primary team
WinPct_x float64 Stats associated with primary team
Home_y int64 Stats associated with opponent team
Pos_y float64 Stats associated with opponent team
OppPos_y float64 Stats associated with opponent team
PtsAgstPer100_y float64 Stats associated with opponent team
PtsForPer100_y float64 Stats associated with opponent team
WinPct_y float64 Stats associated with opponent team

Notes:

  • The columns should be in the exact order given above, but the records can be sorted any way you like.
  • The dtypes attribute of your result must match exactly.
  • This question can be answered using either SQL or Pandas. Pandas will give the required suffixes by default though.

Implementation Strategy

  • You should start by joining/merging LaggedTeamStats to itself where the OppID in the left table is the same as the TeamID in the right table. The Season and DayNum should be the same for both tables.
  • After joining/merging:
    • Remove any rows where the "left" TeamID is larger than the "left" OppID. This will avoid duplicating the data.
    • Rows occurring before DayNum 56 should not be included in the result to make sure there is a reasonable sample of games so that the stats are meaningful.
    • Drop any unnecessary columns.
    • Rename the columns as necessary to meet the requirements above.
    • Any rows containing NULL or NaN values should be dropped.
In [31]:
### Define get_matchup_stats
def get_matchup_stats(conn):
    df = pd.read_sql('select * from laggedteamstats', conn)
    return df.merge(df,
                   left_on=['Season', 'DayNum', 'OppID'],
                   right_on=['Season', 'DayNum', 'TeamID'])\
                .query('TeamID_x < OppID_x')\
                .query('DayNum >= 56')\
                .drop(columns=['Won_y', 'TeamID_y', 'OppID_y'])\
                .rename(columns={'Won_x':'Won', 'TeamID_x':'TeamID', 'OppID_x':'OppID'})\
                .dropna()

The demo cell below should display the following output:

   Won  Season  DayNum  TeamID  OppID  Home_x    Pos_x  OppPos_x  \
2    1    2019      78    1210   1323       1 -0.90244  -1.03355   
4    1    2011      72    1253   1413       1  2.31956   1.98356   

   PtsAgstPer100_x  PtsForPer100_x  WinPct_x  Home_y    Pos_y  OppPos_y  \
2        -11.59260        -3.70151   0.55556       0 -2.65178  -2.09178   
4          4.31931        -0.18928   0.46667       0 -2.84824  -2.20252   

   PtsAgstPer100_y  PtsForPer100_y  WinPct_y  
2         -2.28041         6.55518   0.61111  
4          3.37227        -0.80974   0.28571
In [32]:
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex6 = {}
demo_conn_dfs_ex6['LaggedTeamStats'] = pd.read_sql('select * from LaggedTeamStatsSample', conn)
demo_conn_ex6 = dfs_to_conn(demo_conn_dfs_ex6)
# display(pd.read_sql('select * from LaggedTeamStats', demo_conn_ex6))
In [33]:
### call demo funtion
print(get_matchup_stats(demo_conn_ex6))
   Won  Season  DayNum  TeamID  OppID  Home_x    Pos_x  OppPos_x  \
2    1    2019      78    1210   1323       1 -0.90244  -1.03355   
4    1    2011      72    1253   1413       1  2.31956   1.98356   

   PtsAgstPer100_x  PtsForPer100_x  WinPct_x  Home_y    Pos_y  OppPos_y  \
2        -11.59260        -3.70151   0.55556       0 -2.65178  -2.09178   
4          4.31931        -0.18928   0.46667       0 -2.84824  -2.20252   

   PtsAgstPer100_y  PtsForPer100_y  WinPct_y  
2         -2.28041         6.55518   0.61111  
4          3.37227        -0.80974   0.28571  

The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [34]:
### test_cell_ex6
from tester_fw.testers import Tester_ex6
tester = Tester_ex6()
for _ in range(20):
    try:
        tester.run_test(get_matchup_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Aside - Random Forest classification model

Feel free to skip to the next exercise.

Logistic regression relies on a few assumptions about the predictive variables which are not met here (namely normality and each one individually having a significant effect on the response variable). Instead, we will use the off the shelf Scikit-Learn modules to implement the random forest modeling technique. We're building the model for you, but you will have to do some analysis of the predictions it makes.

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
import warnings
import pickle
import os

### Example code showing how we got X and Y from the output of `get_matchup_stats` 

# stats_df = get_matchup_stats(conn)
# # Horizontally partition our stats into predictive (we hope!) variables and the response variable
# X = stats_df.drop(columns=['Won', 'Season', 'DayNum', 'TeamID', 'OppID']) # Remove the response and identification data
# y = stats_df['Won'] # response variable. There are two classes, 0 and 1, which correspond to a loss and a win for the primary team.

# load pre-computed X and y
with open('resource/asnlib/publicdata/X.pkl', 'rb') as f:
    X = pickle.load(f)
with open('resource/asnlib/publicdata/y.pkl', 'rb') as f:
    y = pickle.load(f)

# Vertically partition data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, random_state=6040)

# Check to see if we already saved the model
if not os.path.exists('resource/asnlib/publicdata/clf.pkl'): # if not - train and save the model
    clf = GridSearchCV(RandomForestClassifier(),
        scoring='roc_auc',
        param_grid={
            'max_depth': [2,4,6,8,9,10]
        },
        n_jobs=-1,
        verbose=2).fit(X_train,y_train)
    with open('resource/asnlib/publicdata/clf.pkl', 'wb') as f:
        pickle.dump(clf, f)
# open the saved model from a pickle file
with open('resource/asnlib/publicdata/clf.pkl', 'rb') as f:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        clf = pickle.load(f)

# this is how you predict the probability that the response variable belongs to each class
# probs[:, 0] is the estimated probabilites of losing and probs[:, 1] is the estimated probabilities of winning.
probs = clf.predict_proba(X_test) 

# this is how you predict which class each record belongs to
preds = clf.predict(X_test)

# printing some metrics about how well the model fits the test set
print(f'''
Log-Loss:\t\t{round(log_loss(y_test, probs),4)} - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:\t{round(roc_auc_score(y_test, probs[:, 1]),4)} - 1 is perfect and 0.5 is considered "non-informative".
''')
Log-Loss:		0.5674 - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:	0.7775 - 1 is perfect and 0.5 is considered "non-informative".

Exercise 7 - (1 Points):

Without advanced statistical training, these fit metrics would probably not make sense. We see that they are both somewhere in between "perfect" and "non-informative", but what is that telling us? One understandable metric we can calculate is the prediction accuracy.

Complete the function pred_accuracy(obs, preds) to calculate the relative frequency (#matches#observations) that the predicted value preds is the same as the ovserved value obs. You can assume that both will be array-like, 1-dimensional, the same length and that they only contain the integers 0 and 1. Round your result to 5 decimal places.

Note: There will be no type check. If your number matches our number you will pass.

In [36]:
### Define pred_accuracy
def pred_accuracy(obs, preds):
    return round(np.mean(obs==preds), 5)

The demo cell below should display the following output:

0.53333
In [37]:
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex7 = rng.integers(0, 2, 15)
demo_preds_ex7 = rng.integers(0, 2, 15)
# print('obs:  ', demo_obs_ex7)
# print('preds:', demo_preds_ex7)
In [38]:
### call demo funtion
pred_accuracy(demo_obs_ex7, demo_preds_ex7)
Out[38]:
0.53333

The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [39]:
### test_cell_ex7
from tester_fw.testers import Tester_ex7
tester = Tester_ex7()
for _ in range(20):
    try:
        tester.run_test(pred_accuracy)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.

Exercise 8 - (2 Points):

Another way to evaluate our model is to check how accuracy of the probabilities it generates. Suppose you have a list of the observations obs (0's and 1's indicating whether the primary team wins (1) or not(0)) and a list of the predicted probability probs that the primary team won for each observation. We can bucket the predictions based on the probabilities and then calculate the accuracy of our model within that bucket by taking the mean of the observations. If the observed win % for each bucket is between or close to its bounds, we have a decent model.

Complete the function bucket_evaluation(obs, prob, n_buckets) to partition the interval [0,1) into n_buckets equal intervals and compute the observed relative win percentage for the events in that bucket. You should return your result as a Pandas DataFrame with the following columns:

Column Name dtype description Special Considerations
lower float64 lower bound of bucket round to 2 decimal places
upper float64 upper bound of bucket round to 2 decimal places
obs_win_pct float64 win % for observations where lower estimated win probability < upper round to 2 decimal places (as a percent)
n int64 number of observations where lower estimated win probability < upper

Notes:

  • The columns should be in the exact order given above, and the records should be sorted with the 'lower' column in ascending order and a reset index.
  • The dtypes attribute of your result must match exactly.
  • The bounds should be rounded before deciding which bucket each observation belongs in.
  • The "obs_win_pct" column should have NaN entries for any "empty" buckets which have no observations.
    • Your solution raising a warning will not cause it to fail the test cell.

Implementation Strategy

  • np.linspace() can be helpful for calculating the bounds for each bucket. Don't forget to round!
  • The function zip(bounds[:-1], bounds[1:] can be iterated over to sequentially examine each pair of bounds (bounds would be the output of linspace after rounding).
    • Use each pair of bounds to filter the obs to only include observations where the prob is between the lower and upper bounds. You can calculate the observed relative frequency and count from the filtered results.
    • Keep track of the endpoints, relative frequency, and count for each interval.
  • After evaluating for all intervals, construct a DataFrame using the stored results for each interval.
In [49]:
### Define bucket_evaluation
def bucket_evaluation(obs, prob, n_buckets):
    bounds = np.round(np.linspace(0,1, n_buckets+1),2)
    lod = []
    for lower, upper in zip(bounds[:-1], bounds[1:]):
        inds = np.argwhere((prob >= lower) & (prob < upper)).reshape((-1,))
        n = inds.shape[0]
        win_pct = 100*np.mean(obs[inds])
        d = {'lower': lower, 'upper': upper, 'obs_win_pct': win_pct, 'n': n}
        lod.append(d)
    return pd.DataFrame(lod).apply(lambda f: round(f, 2))

The demo cell below should display the following output:

   lower  upper  obs_win_pct   n
0   0.00   0.25        25.00   4
1   0.25   0.50        23.08  13
2   0.50   0.75        66.67   3
3   0.75   1.00        40.00   5

Unlike previous exercises the order and index must match (in addition to the usual requirements) to produce an easily understood result.

In [50]:
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex8 = rng.integers(0, 2, 25)
demo_prob_ex8 = rng.random(25)
demo_n_buckets_ex8 = 4
In [51]:
### call demo funtion
print(bucket_evaluation(demo_obs_ex8, demo_prob_ex8, demo_n_buckets_ex8))
[ 1  2  3 16]
[ 0  5  8 11 12 13 17 18 19 20 21 22 23]
[ 4  9 14]
[ 6  7 10 15 24]
None

The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [43]:
### test_cell_ex8
from tester_fw.testers import Tester_ex8
tester = Tester_ex8()
for _ in range(20):
    try:
        tester.run_test(bucket_evaluation)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
initializing tester_fw.tester_6040
Passed! Please submit.
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
In [44]:
from pandas.testing import assert_frame_equal
assert_frame_equal(returned_output_vars["df"], true_output_vars["df"])
In [138]:
returned_output_vars["df"]
Out[138]:
lower upper obs_win_pct n
0 0.00 0.08 33.33 3
1 0.08 0.17 75.00 4
2 0.17 0.25 55.56 9
3 0.25 0.33 100.00 2
4 0.33 0.42 37.50 8
5 0.42 0.50 50.00 2
6 0.50 0.58 66.67 3
7 0.58 0.67 0.00 3
8 0.67 0.75 100.00 2
9 0.75 0.83 25.00 4
10 0.83 0.92 57.14 7
11 0.92 1.00 33.33 3
In [139]:
true_output_vars["df"]
Out[139]:
lower upper obs_win_pct n
0 0.00 0.08 0.00 2
1 0.08 0.17 66.67 6
2 0.17 0.25 62.50 8
3 0.25 0.33 100.00 2
4 0.33 0.42 37.50 8
5 0.42 0.50 50.00 2
6 0.50 0.58 66.67 3
7 0.58 0.67 0.00 3
8 0.67 0.75 100.00 2
9 0.75 0.83 25.00 4
10 0.83 0.92 62.50 8
11 0.92 1.00 0.00 2

Fin. If you have made it this far, congratulations on completing the semester. Don't forget to submit!