Version 1.0
All of the header information is important. Please read it..
Topics, number of exercises: This problem builds on your knowledge of Pandas, SQL, and numeric computation. It has 9 exercises, numbered 0 to 8. There are 19 available points. However, to earn 100% the threshold is 12 points. (Therefore, once you hit 12 points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Final reminders:
In this notebook we want to predict which team will win a basketball game based on their past performance in a given season. You do not have to know anything about basketball aside from the background below to complete this notebook.
import sqlite3
import numpy as np
import pandas as pd
import re
conn = sqlite3.connect('resource/asnlib/publicdata/basketball_db.sqlite')
We have a big collection of real data from several seasons of men's NCAA basketball. We are most interested in one table of this database, named MRegularSeasonDetailedResults
.
Complete the function get_cols(conn)
to return a list
of the columns in the MRegularSeasonDetailedResults
table in the db connection conn
. The order of the columns in the list should be the same as the order they appear in the database table.
### Define get_cols
def get_cols(conn):
###
### YOUR CODE HERE
###
return pd.read_sql('''select * from MRegularSeasonDetailedResults limit 1''', conn)\
.columns\
.tolist()
The demo cell below should display the following output:
['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
### call demo funtion
print(get_cols(conn))
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex0
from tester_fw.testers import Tester_ex0
tester = Tester_ex0()
for _ in range(20):
try:
tester.run_test(get_cols)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
In the MRegularSeasonDetailedResults
table, each record is one game. All columns prefixed by 'W' refer to the winning team, and columns prefixed by 'L' refer to the losing team. We are primarily interested in the following columns:
Column | Type | Has Prefix | Description |
---|---|---|---|
Season | Integer | No | Identifies the calendar year when a season ends. |
DayNum | Integer | No | Identifies the day in a season when a game occurred. |
TeamID | Integer | W or L | Identifies the teams participating in a game. |
Score | Integer | W or L | Number of points a team scored in a game. |
Loc | Char | W Only | Identifies whether the winning team was home ('H'), away ('A'), or the game was played at a neutral site ('N'). |
FGA | Integer | W or L | Number of field goal attempts occurring when a team had in a game |
FTA | Integer | W or L | Number of free throw attempts a team had in a game. |
TO | Integer | W or L | Number of turnovers a team had in a game. |
OR | Integer | W or L | Number of offensive rebounds a team had in a game. |
For example, the column with the winning team's ID will be WTeamID, and the column with the losing team's free-throw attempts will be LFTA.
We want to extract some information about each team involved in a game from that team's perspective. We will use the term "primary team" to refer to this team and "primary team's opponent" to refer to the other team in the game. Do not infer any other meaning from this term. Since there are two participants in each game there will be a record from the perspective of both teams after the extraction. (i.e. one record where the winning team is the primary team and one record where the losing team is the primary team will be in the result.
Complete the function get_pace_adjusted(conn)
to query the table MRegularSeasonDetailedResults
using the db connection conn
and return a Pandas DataFrame with the columns outlined below. The "Winning Team" and "Losing Team" headers in the chart below indicate which columns in MRegularSeasonDetailedResults
to use in your computations.
Column Name | dtype | description | Special Considerations | Winning Team | Losing Team |
---|---|---|---|---|---|
Won | int64 | 1 if the primary team won the game, 0 if the primary team lost | 1 | 0 | |
Season | int64 | Current Season |
|||
DayNum | int64 | Current DayNum |
|||
TeamID | int64 | Team id for primary team | WTeamID | LTeamID | |
OppID | int64 | Team id for primary team's opponent | LTeamID | WTeamID | |
Home | int64 | 1 if the primary team is at home for the game, 0 otherwise. | WLoc is 'H' |
WLoc is 'A' |
|
Pos | float64 | Estimated number of possessions for primary team | Round to 5 decimal places | W prefix |
L prefix |
OppPos | float64 | Estimated number of possessions for primary team's opponent | Round to 5 decimal places | L prefix |
W prefix |
PtsForPer100 | float64 | 100 primary team's score estimate of primary team's possessions | Round to 5 decimal places (only round after division) | W prefix |
L prefix |
PtsAgstPer100 | float64 | 100 primary team's opponent's score estimate of primary team's opponent's possessions | Round to 5 decimal places (only round after division) | L prefix |
W prefix |
There is no column for possessions in our source data. We will need to estimate it! The formula below is widely used in sports analytics to estimate the number of possessions a team had in a basketball game. ( is # of possessions):
= + 0.44() + -
You can derive all required results from these columns: Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO
.
Notes:
MRegularSeasonDetailedResults
will correspond to two records in the result. One record where the winning team is the primary team and one record where the losing team is the primary team.WLoc
column will have an 'N'
entry.dtypes
attribute of your result must match exactly.UNION ALL
phrase may be helpful if you are using SQL.pd.concat()
method may be helpful if you are using Pandas.### Define get_pace_adjusted
def get_pace_adjusted(conn):
###
### YOUR CODE HERE
###
return pd.read_sql('''
-- winning team
select
1 as Won,
season, DayNum,
wteamid as TeamID,
lteamid as OppID,
wloc='H' as Home,
wFGA + 0.44*(wFTA) + wTO - wOR as Pos,
lFGA + 0.44*(lFTA) + lTO - lOR as OppPos,
100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsForPer100,
100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsAgstPer100
from
mregularseasondetailedresults
union all
-- losing team
select
0 as Won,
season, DayNum,
lteamid as TeamID,
wteamid as OppID,
wloc='A' as Home,
lFGA + 0.44*(lFTA) + lTO - lOR as Pos,
wFGA + 0.44*(wFTA) + wTO - wOR as OppPos,
100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsForPer100,
100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsAgstPer100
from
mregularseasondetailedresults
''', conn).apply(lambda f: round(f, 5))
This is the demo input:
Season DayNum WTeamID WScore LTeamID LScore WLoc WFGA WFTA WOR WTO LFGA LFTA LOR LTO
0 2003 10 1104 68 1328 62 N 58 18 14 23 53 22 10 18
The demo cell below should display the following output:
Won Season DayNum TeamID OppID Home Pos OppPos PtsForPer100 PtsAgstPer100
0 1 2003 10 1104 1328 0 74.92 70.68 90.76348 87.71930
1 0 2003 10 1328 1104 0 70.68 74.92 87.71930 90.76348
Note that there are two rows, but they come from a single input row.
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_1 = pd.read_sql('select Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO from MRegularSeasonDetailedResults limit 1', conn)
demo_conn_1 = dfs_to_conn({'MRegularSeasonDetailedResults': demo_df_1})
# demo_df_1
### call demo funtion
print(get_pace_adjusted(demo_conn_1))
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
from tester_fw.testers import Tester_ex1
tester = Tester_ex1()
for _ in range(20):
try:
tester.run_test(get_pace_adjusted)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Our next task is to compute daily totals of Pos
, OppPos
, PtsForPer100
, and PtsAgstPer100
as well as a count of the number of teams participating in games that day.
Complete the function daily_totals(conn)
to query the table PaceAdjusted
(structured the same as the output of the previous exercise) in the db connection conn
and return a Pandas DataFrame with the following columns:
Column Name | dtype | description |
---|---|---|
Season | int64 | |
DayNum | int64 | |
Pos | float64 | Sum of values in Pos column of PaceAdjusted occuring on a unique combination of Season and DayNum |
Pph | float64 | Sum of values in PtsForPer100 column of PaceAdjusted occuring on a unique combination of Season and DayNum |
Count | int64 | Count of rows in PaceAdjusted with a unique combination of Season and DayNum |
Notes:
dtypes
attribute of your result must match exactly.### Define daily_totals
def daily_totals(conn):
###
### YOUR CODE HERE
###
return pd.read_sql('''
select
Season, DayNum,
sum(pos) as Pos,
sum(ptsforper100) as Pph,
count(*) as Count
from
paceadjusted
group by
season, daynum
''', conn)
The demo cell below should display the following output:
Season DayNum Pos Pph Count
0 2010 94 277.28 433.33581 4
1 2010 108 281.64 415.62023 4
2 2010 122 265.88 395.26418 4
3 2016 49 365.04 467.93858 5
4 2016 115 204.08 309.66380 3
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_2 = pd.read_sql('select * from PaceAdjustedSample', conn)
demo_conn_2 = dfs_to_conn({'PaceAdjusted': demo_df_2})
# demo_df_2.sort_values(['Season', 'DayNum']).reset_index(drop=True)
### call demo funtion
print(daily_totals(demo_conn_2))
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2
from tester_fw.testers import Tester_ex2
tester = Tester_ex2()
for _ in range(20):
try:
tester.run_test(daily_totals)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
This is useful information
Window functions define a slice of data relative to each row in a dataset and then perform an aggregate or order-based calculation on the slice. For example, say you want to perform the following calculation for each row in DailyTotals
:
SumPos
Sum of the Pos
column for all rows where Season
= Season
and DayNum
DayNum
Window functions can do this. Below are examples of this computation in Pandas and SQL. First, we will load some data to demonstrate.
# define demo data
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
daily_totals_df = demo_conn_dfs_ex3['DailyTotals']
The code below will return the sample data with a new column "SumPos".
### Pandas Example
def pandas_window_helper(group):
group = group.sort_values('DayNum')
group['SumPos'] = group['Pos'].expanding(1).sum()
return group
pandas_ex = daily_totals_df.groupby('Season', as_index=False).apply(pandas_window_helper)
The code below will return the sample data with a new column "SumPos".
### SQL Example
query = '''
SELECT
*,
SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as SumPos
FROM
DailyTotals
ORDER BY
Season, DayNum
'''
sql_ex = pd.read_sql(query, demo_conn_ex3)
The code below will print the result and verify that the two approaches give equivalent results.
display(sql_ex)
print(f'''SQL and Pandas results match? {
((sql_ex.reset_index(drop=True) - pandas_ex.reset_index(drop=True)).abs() <= 0.00001).all().all()}''')
Our next task is to compute the weighted running average for Pos
and Pph
for all games played in a Season
up to and including each DayNum
. We want a snapshot of these averages as if it were that particular day and we had no knowledge of events occurring after that day.
Complete the function get_running_avg_pace(conn)
to query the table DailyTotals
(structured the same as the output of the previous exercise) using the db connection conn
and return a Pandas DataFrame with the columns mentioned below. You should calculate the intermediate values then use division to get the end result.
Intermediate values
SumPos
- Running sum of the "Pos" column for current "Season" up to and including the current "DayNum".SumPph
- Running sum of the "Pph" column for current "Season" up to and including the current "DayNum".SumCount
- Running sum of the "Count" column for current "Season" up to and including the current "DayNum".Column Name | dtype | description | Special Considerations |
---|---|---|---|
Season | int64 | Current Season |
|
DayNum | int64 | Current DayNum |
|
AvgPace | float64 | SumPos SumCount |
Round to 5 decimal places |
AvgPPH | float64 | SumPph SumCount |
Round to 5 decimal places |
Notes:
dtypes
attribute of your result must match exactly.pd.DataFrame.groupby().apply()
and pd.Series.expanding()
may be useful in calculating intermediate values.SUM(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
### Define def get_running_avg_pace
def get_running_avg_pace(conn):
def window_helper(agg_col, partition_col, sort_col):
return f'SUM({agg_col}) OVER(PARTITION BY {partition_col} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)'
# return window_helper('Pos', 'Season', 'DayNum')
intermediate_result = pd.read_sql(f'''
select
Season, DayNum,
{window_helper('Pos', 'Season', 'DayNum')} SumPos,
{window_helper('Pph', 'Season', 'DayNum')} SumPph,
{window_helper('Count', 'Season', 'DayNum')} SumCount
from
DailyTotals
''', conn)
df = intermediate_result
df['AvgPace'] = df['SumPos'] / df['SumCount']
df['AvgPPH'] = df['SumPph'] / df['SumCount']
return df.drop(columns=['SumPos', 'SumPph', 'SumCount'])
The demo cell below should display the following output:
Season DayNum AvgPace AvgPPH
0 2014 12 68.17063 105.33153
1 2014 15 68.85556 105.46984
2 2014 18 68.43963 106.62453
3 2014 41 68.27426 105.76722
4 2014 44 67.81551 105.66767
5 2014 59 67.80480 105.23519
6 2014 102 67.87677 105.16634
7 2014 127 67.77059 105.08984
8 2021 29 71.56138 96.56166
9 2021 96 69.98540 101.14361
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
# demo_conn_dfs_ex3['DailyTotals'].sort_values(['Season', 'DayNum']).reset_index(drop=True)
### call demo funtion
print(get_running_avg_pace(demo_conn_ex3))
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex3
from tester_fw.testers import Tester_ex3
tester = Tester_ex3()
for _ in range(20):
try:
tester.run_test(get_running_avg_pace)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Suppose you are given two SQL tables:
PaceAdjusted
, structured the same as the output of Exercise 1.RunningAvgPace
, structured the same as the output of Exercise 3.Complete the function get_center_pace_adjusted(conn)
to query tables RunningAvgPace
and PaceAdjusted
and return a Pandas DataFrame with the following columns:
Column Name | dtype | description |
---|---|---|
Won | int64 | Unchanged from PaceAdjusted |
Season | int64 | Unchanged from PaceAdjusted |
DayNum | int64 | Unchanged from PaceAdjusted |
TeamID | int64 | Unchanged from PaceAdjusted |
OppID | int64 | Unchanged from PaceAdjusted |
Home | int64 | Unchanged from PaceAdjusted |
Pos | float64 | PaceAdjusted.Pos - RunningAvgPace.AvgPace for corresponding Season and DayNum |
OppPos | float64 | PaceAdjusted.OppPos - RunningAvgPace.AvgPace for corresponding Season and DayNum |
PtsAgstPer100 | float64 | PaceAdjusted.PtsAgstPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum |
PtsForPer100 | float64 | PaceAdjusted.PtsForPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum |
Notes:
dtypes
attribute of your result must match exactly.### Define get_center_pace_adjusted
def get_center_pace_adjusted(conn):
###
### YOUR CODE HERE
###
return pd.read_sql('''
select
pa.Won, pa.Season, pa.DayNum, pa.TeamID, pa.OppID, pa.Home,
pa.pos - rap.avgpace as Pos,
pa.opppos - rap.avgpace as OppPos,
pa.ptsagstper100 - rap.avgpph as PtsAgstPer100,
pa.ptsforper100 - rap.avgpph PtsForPer100
from
PaceAdjusted pa
join
RunningAvgPace rap
on
pa.season = rap.season and pa.daynum = rap.daynum
''', conn)
The demo cell below should display the following output:
Won Season DayNum TeamID OppID Home Pos OppPos \
0 1 2014 94 1374 1396 1 -3.86907 -2.18907
1 0 2016 119 1224 1313 0 2.65129 5.57129
2 0 2016 90 1353 1277 0 4.05371 2.13371
3 1 2016 119 1300 1366 1 -8.78871 -9.30871
4 1 2014 94 1383 1384 1 -2.30907 -4.66907
5 1 2014 94 1190 1297 0 -9.42907 -9.62907
6 0 2014 94 1223 1101 0 0.29093 0.69093
7 0 2014 94 1218 1364 1 -1.98907 -3.58907
8 0 2014 92 1159 1221 1 -13.28500 -12.96500
9 0 2014 81 1297 1252 1 1.52248 1.76248
PtsAgstPer100 PtsForPer100
0 -25.32271 13.22592
1 7.78985 -12.71682
2 29.99420 -19.84044
3 -31.11146 -3.56274
4 -22.13333 14.99904
5 -18.41211 5.59099
6 22.95005 14.80074
7 12.70182 -7.11914
8 22.07960 20.97894
9 3.17351 2.09897
### define demo inputs
### use naming convention demo_varname_ex_* to name demo variables
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex4 = {}
demo_conn_dfs_ex4['PaceAdjusted'] = pd.read_sql('select * from PaceAdjustedSampleTwo', conn)
demo_conn_dfs_ex4['RunningAvgPace'] = pd.read_sql('select * from RunningAvgPaceSample', conn)
demo_conn_ex4 = dfs_to_conn(demo_conn_dfs_ex4)
# print('PaceAdjusted')
# display(demo_conn_dfs_ex4['PaceAdjusted'])
# print('RunningAvgPace')
# display(demo_conn_dfs_ex4['RunningAvgPace'])
### call demo funtion
print(get_center_pace_adjusted(demo_conn_ex4))
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
from tester_fw.testers import Tester_ex4
tester = Tester_ex4()
for _ in range(20):
try:
tester.run_test(get_center_pace_adjusted)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Now we have centered and pace adjusted stats for each Season, DayNum, and TeamID, it's time to compute running averages to set as inputs to our model. Since we are using these stats as inputs to a predictive model, we need to introduce a lag of 1 game so that the results of the game we are trying to predict are not part of these inputs.
Complete the function get_team_stats(conn)
to query the PaceAdjustedCentered
(structured the same as the output of the previous exercise) table with the db connection conn
and return a Pandas DataFrame with the following columns:
Column Name | dtype | description | Special Considerations |
---|---|---|---|
Won | int64 | Unchanged from PaceAdjustedCentered |
|
Season | int64 | Current Season | Unchanged from PaceAdjustedCentered |
DayNum | int64 | Current Day | Unchanged from PaceAdjustedCentered |
TeamID | int64 | Current TeamID | Unchanged from PaceAdjustedCentered |
OppID | int64 | Unchanged from PaceAdjustedCentered |
|
Home | int64 | Unchanged from PaceAdjustedCentered |
|
Pos | float64 | Running Average of PaceAdjustedCentered.Pos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum |
Round to 5 decimal places |
OppPos | float64 | Running Average of PaceAdjustedCentered.OppPos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum |
Round to 5 decimal places |
PtsAgstPer100 | float64 | Running Average of PaceAdjustedCentered.PtsAgstPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum |
Round to 5 decimal places |
PtsForPer100 | float64 | Running Average of PaceAdjustedCentered.PtsForPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum |
Round to 5 decimal places |
WinPct | float64 | Running Average of PaceAdjustedCentered.Won including all rows with the current Season and TeamID with DayNum < PaceAdjustedCentered.DayNum |
Round to 5 decimal places |
Notes:
dtypes
attribute of your result must match exactly.pd.DataFrame.groupby().apply()
and pd.Series.expanding()
may be useful.AVG(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
### Define get_team_stats
def get_team_stats(conn):
def window_helper(agg_col, partition_cols, sort_col):
return f'AVG({agg_col}) OVER(PARTITION BY {partition_cols} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PRECEDING AND 1 preceding)'
return pd.read_sql(f'''
select
Won, Season, DayNum, TeamID, OppID, Home,
{window_helper('pos', 'season, teamid', 'daynum')} Pos,
{window_helper('OppPos', 'season, teamid', 'daynum')} OppPos,
{window_helper('PtsAgstPer100', 'season, teamid', 'daynum')} PtsAgstPer100,
{window_helper('PtsForPer100', 'season, teamid', 'daynum')} PtsForPer100,
{window_helper('won', 'season, teamid', 'daynum')} WinPct
from
PaceAdjustedCentered
''', conn)
The demo cell below should display the following output:
Won Season DayNum TeamID OppID Home Pos OppPos \
0 0 2012 17 1108 1452 0 NaN NaN
1 0 2012 25 1108 1408 0 8.25697 8.53697
2 0 2012 26 1108 1360 0 5.65314 5.93314
3 1 2012 105 1108 1115 1 2.76368 2.09702
4 0 2012 16 1355 1208 0 NaN NaN
5 1 2012 31 1355 1237 0 -5.81233 -5.33233
6 0 2012 45 1355 1315 0 5.07403 4.81403
7 1 2012 124 1355 1237 0 4.12049 4.84049
8 0 2012 29 1431 1332 0 NaN NaN
9 1 2012 52 1431 1155 0 -4.58353 -5.26353
10 0 2012 72 1431 1409 0 -5.65785 -7.11785
11 1 2012 79 1431 1187 1 -6.81332 -8.17332
12 1 2012 103 1431 1408 1 -6.71588 -7.96588
13 1 2017 47 1252 1399 0 NaN NaN
14 1 2017 50 1252 1122 0 9.78901 10.22901
15 1 2017 68 1252 1244 1 7.09400 8.63400
16 0 2017 82 1252 1316 0 6.62264 8.36930
17 1 2017 86 1252 1367 1 7.55313 8.32313
18 0 2017 108 1252 1316 1 6.32893 6.93693
19 0 2017 122 1252 1316 1 6.91365 7.39365
PtsAgstPer100 PtsForPer100 WinPct
0 NaN NaN NaN
1 26.44115 -18.48766 0.00000
2 21.51181 -30.28407 0.00000
3 16.34005 -20.09929 0.00000
4 NaN NaN NaN
5 14.27057 -2.29566 0.00000
6 13.42293 6.86026 0.50000
7 16.60255 4.68752 0.33333
8 NaN NaN NaN
9 2.30485 -6.73367 0.00000
10 -7.95942 -3.23226 0.50000
11 -4.29258 -8.20323 0.33333
12 -5.03189 -2.51522 0.50000
13 NaN NaN NaN
14 -5.62638 -2.59159 1.00000
15 0.33659 13.16336 1.00000
16 -0.57178 10.26290 1.00000
17 2.37803 7.85893 0.75000
18 2.68629 9.02187 0.80000
19 4.37327 7.30692 0.66667
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex5 = {}
demo_conn_dfs_ex5['PaceAdjustedCentered'] = pd.read_sql('select * from PaceAdjustedCenteredSample', conn).sort_values(['Season', 'TeamID', 'DayNum'])
demo_conn_ex5 = dfs_to_conn(demo_conn_dfs_ex5)
# display(demo_conn_dfs_ex5['PaceAdjustedCentered'])
### call demo funtion
print(get_team_stats(demo_conn_ex5))
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
from tester_fw.testers import Tester_ex5
tester = Tester_ex5()
for _ in range(20):
try:
tester.run_test(get_team_stats)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
returned_output_vars['df'].dtypes
true_output_vars['df'].dtypes
Up to this point we have calculated our metrics for each team before they play each game. We're almost ready to start modeling! There is still one last task. Remember that there were two records in PaceAdjusted
for each game. This is also true of LaggedTeamStats
, which is a running average of that table. If "Team A" and "Team B" played a game on in some Season
on some DayNum
then there will be one record where "Team A" is the primary team and "Team B" is the opponent team and one record where "Team B" is the pirmary team and "Team A" is the opponent team. We need to join these records together so that we will have one record identifying each game.
Complete the function get_matchup_stats(conn)
to query the LaggedTeamStats
table using the db connection conn
and return a Pandas DataFrame with the following columns:
Column Name | dtype | description |
---|---|---|
Won | int64 | 1 if primary team (_x suffix) won the matchup 0 otherwise |
Season | int64 | |
DayNum | int64 | |
TeamID | int64 | Primary team - associated with _x suffixes |
OppID | int64 | Opponent team - associated with _y suffixes |
Home_x | int64 | Stats associated with primary team |
Pos_x | float64 | Stats associated with primary team |
OppPos_x | float64 | Stats associated with primary team |
PtsAgstPer100_x | float64 | Stats associated with primary team |
PtsForPer100_x | float64 | Stats associated with primary team |
WinPct_x | float64 | Stats associated with primary team |
Home_y | int64 | Stats associated with opponent team |
Pos_y | float64 | Stats associated with opponent team |
OppPos_y | float64 | Stats associated with opponent team |
PtsAgstPer100_y | float64 | Stats associated with opponent team |
PtsForPer100_y | float64 | Stats associated with opponent team |
WinPct_y | float64 | Stats associated with opponent team |
Notes:
dtypes
attribute of your result must match exactly.Implementation Strategy
LaggedTeamStats
to itself where the OppID
in the left table is the same as the TeamID
in the right table. The Season
and DayNum
should be the same for both tables.TeamID
is larger than the "left" OppID
. This will avoid duplicating the data.DayNum
56 should not be included in the result to make sure there is a reasonable sample of games so that the stats are meaningful.NULL
or NaN
values should be dropped.### Define get_matchup_stats
def get_matchup_stats(conn):
df = pd.read_sql('select * from laggedteamstats', conn)
return df.merge(df,
left_on=['Season', 'DayNum', 'OppID'],
right_on=['Season', 'DayNum', 'TeamID'])\
.query('TeamID_x < OppID_x')\
.query('DayNum >= 56')\
.drop(columns=['Won_y', 'TeamID_y', 'OppID_y'])\
.rename(columns={'Won_x':'Won', 'TeamID_x':'TeamID', 'OppID_x':'OppID'})\
.dropna()
The demo cell below should display the following output:
Won Season DayNum TeamID OppID Home_x Pos_x OppPos_x \
2 1 2019 78 1210 1323 1 -0.90244 -1.03355
4 1 2011 72 1253 1413 1 2.31956 1.98356
PtsAgstPer100_x PtsForPer100_x WinPct_x Home_y Pos_y OppPos_y \
2 -11.59260 -3.70151 0.55556 0 -2.65178 -2.09178
4 4.31931 -0.18928 0.46667 0 -2.84824 -2.20252
PtsAgstPer100_y PtsForPer100_y WinPct_y
2 -2.28041 6.55518 0.61111
4 3.37227 -0.80974 0.28571
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex6 = {}
demo_conn_dfs_ex6['LaggedTeamStats'] = pd.read_sql('select * from LaggedTeamStatsSample', conn)
demo_conn_ex6 = dfs_to_conn(demo_conn_dfs_ex6)
# display(pd.read_sql('select * from LaggedTeamStats', demo_conn_ex6))
### call demo funtion
print(get_matchup_stats(demo_conn_ex6))
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex6
from tester_fw.testers import Tester_ex6
tester = Tester_ex6()
for _ in range(20):
try:
tester.run_test(get_matchup_stats)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Feel free to skip to the next exercise.
Logistic regression relies on a few assumptions about the predictive variables which are not met here (namely normality and each one individually having a significant effect on the response variable). Instead, we will use the off the shelf Scikit-Learn modules to implement the random forest modeling technique. We're building the model for you, but you will have to do some analysis of the predictions it makes.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
import warnings
import pickle
import os
### Example code showing how we got X and Y from the output of `get_matchup_stats`
# stats_df = get_matchup_stats(conn)
# # Horizontally partition our stats into predictive (we hope!) variables and the response variable
# X = stats_df.drop(columns=['Won', 'Season', 'DayNum', 'TeamID', 'OppID']) # Remove the response and identification data
# y = stats_df['Won'] # response variable. There are two classes, 0 and 1, which correspond to a loss and a win for the primary team.
# load pre-computed X and y
with open('resource/asnlib/publicdata/X.pkl', 'rb') as f:
X = pickle.load(f)
with open('resource/asnlib/publicdata/y.pkl', 'rb') as f:
y = pickle.load(f)
# Vertically partition data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3, random_state=6040)
# Check to see if we already saved the model
if not os.path.exists('resource/asnlib/publicdata/clf.pkl'): # if not - train and save the model
clf = GridSearchCV(RandomForestClassifier(),
scoring='roc_auc',
param_grid={
'max_depth': [2,4,6,8,9,10]
},
n_jobs=-1,
verbose=2).fit(X_train,y_train)
with open('resource/asnlib/publicdata/clf.pkl', 'wb') as f:
pickle.dump(clf, f)
# open the saved model from a pickle file
with open('resource/asnlib/publicdata/clf.pkl', 'rb') as f:
with warnings.catch_warnings():
warnings.simplefilter('ignore')
clf = pickle.load(f)
# this is how you predict the probability that the response variable belongs to each class
# probs[:, 0] is the estimated probabilites of losing and probs[:, 1] is the estimated probabilities of winning.
probs = clf.predict_proba(X_test)
# this is how you predict which class each record belongs to
preds = clf.predict(X_test)
# printing some metrics about how well the model fits the test set
print(f'''
Log-Loss:\t\t{round(log_loss(y_test, probs),4)} - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:\t{round(roc_auc_score(y_test, probs[:, 1]),4)} - 1 is perfect and 0.5 is considered "non-informative".
''')
Without advanced statistical training, these fit metrics would probably not make sense. We see that they are both somewhere in between "perfect" and "non-informative", but what is that telling us? One understandable metric we can calculate is the prediction accuracy.
Complete the function pred_accuracy(obs, preds)
to calculate the relative frequency that the predicted value preds
is the same as the ovserved value obs
. You can assume that both will be array-like, 1-dimensional, the same length and that they only contain the integers 0 and 1. Round your result to 5 decimal places.
Note: There will be no type check. If your number matches our number you will pass.
### Define pred_accuracy
def pred_accuracy(obs, preds):
return round(np.mean(obs==preds), 5)
The demo cell below should display the following output:
0.53333
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex7 = rng.integers(0, 2, 15)
demo_preds_ex7 = rng.integers(0, 2, 15)
# print('obs: ', demo_obs_ex7)
# print('preds:', demo_preds_ex7)
### call demo funtion
pred_accuracy(demo_obs_ex7, demo_preds_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
from tester_fw.testers import Tester_ex7
tester = Tester_ex7()
for _ in range(20):
try:
tester.run_test(pred_accuracy)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Another way to evaluate our model is to check how accuracy of the probabilities it generates. Suppose you have a list of the observations obs
(0's and 1's indicating whether the primary team wins (1) or not(0)) and a list of the predicted probability probs
that the primary team won for each observation. We can bucket the predictions based on the probabilities and then calculate the accuracy of our model within that bucket by taking the mean of the observations. If the observed win % for each bucket is between or close to its bounds, we have a decent model.
Complete the function bucket_evaluation(obs, prob, n_buckets)
to partition the interval into n_buckets
equal intervals and compute the observed relative win percentage for the events in that bucket. You should return your result as a Pandas DataFrame with the following columns:
Column Name | dtype | description | Special Considerations |
---|---|---|---|
lower | float64 | lower bound of bucket | round to 2 decimal places |
upper | float64 | upper bound of bucket | round to 2 decimal places |
obs_win_pct | float64 | win % for observations where lower estimated win probability < upper | round to 2 decimal places (as a percent) |
n | int64 | number of observations where lower estimated win probability < upper |
Notes:
'lower'
column in ascending order and a reset index.dtypes
attribute of your result must match exactly.NaN
entries for any "empty" buckets which have no observations.Implementation Strategy
np.linspace()
can be helpful for calculating the bounds for each bucket. Don't forget to round!zip(bounds[:-1], bounds[1:]
can be iterated over to sequentially examine each pair of bounds (bounds
would be the output of linspace
after rounding).obs
to only include observations where the prob
is between the lower and upper bounds. You can calculate the observed relative frequency and count from the filtered results.### Define bucket_evaluation
def bucket_evaluation(obs, prob, n_buckets):
bounds = np.round(np.linspace(0,1, n_buckets+1),2)
lod = []
for lower, upper in zip(bounds[:-1], bounds[1:]):
inds = np.argwhere((prob >= lower) & (prob < upper)).reshape((-1,))
n = inds.shape[0]
win_pct = 100*np.mean(obs[inds])
d = {'lower': lower, 'upper': upper, 'obs_win_pct': win_pct, 'n': n}
lod.append(d)
return pd.DataFrame(lod).apply(lambda f: round(f, 2))
The demo cell below should display the following output:
lower upper obs_win_pct n
0 0.00 0.25 25.00 4
1 0.25 0.50 23.08 13
2 0.50 0.75 66.67 3
3 0.75 1.00 40.00 5
Unlike previous exercises the order and index must match (in addition to the usual requirements) to produce an easily understood result.
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex8 = rng.integers(0, 2, 25)
demo_prob_ex8 = rng.random(25)
demo_n_buckets_ex8 = 4
### call demo funtion
print(bucket_evaluation(demo_obs_ex8, demo_prob_ex8, demo_n_buckets_ex8))
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
from tester_fw.testers import Tester_ex8
tester = Tester_ex8()
for _ in range(20):
try:
tester.run_test(bucket_evaluation)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
from pandas.testing import assert_frame_equal
assert_frame_equal(returned_output_vars["df"], true_output_vars["df"])
returned_output_vars["df"]
true_output_vars["df"]
Fin. If you have made it this far, congratulations on completing the semester. Don't forget to submit!