Version 1.0
All of the header information is important. Please read it..
Topics, number of exercises: This problem builds on your knowledge of Pandas, SQL, and numeric computation. It has 9 exercises, numbered 0 to 8. There are 19 available points. However, to earn 100% the threshold is 12 points. (Therefore, once you hit 12 points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Final reminders:
In this notebook we want to predict which team will win a basketball game based on their past performance in a given season. You do not have to know anything about basketball aside from the background below to complete this notebook.
import sqlite3
import numpy as np
import pandas as pd
import re
conn = sqlite3.connect('resource/asnlib/publicdata/basketball_db.sqlite')
We have a big collection of real data from several seasons of men's NCAA basketball. We are most interested in one table of this database, named MRegularSeasonDetailedResults.
Complete the function get_cols(conn) to return a list of the columns in the MRegularSeasonDetailedResults table in the db connection conn. The order of the columns in the list should be the same as the order they appear in the database table.
### Define get_cols
def get_cols(conn):
    ### BEGIN SOLUTION
    return pd.read_sql('select * from MRegularSeasonDetailedResults limit 1', conn).columns.tolist()
    ### END SOLUTION
The demo cell below should display the following output:
['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
### call demo funtion
print(get_cols(conn))
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex0
from tester_fw.testers import Tester_ex0
tester = Tester_ex0()
for _ in range(20):
    try:
        tester.run_test(get_cols)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex0(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_cols)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
In the MRegularSeasonDetailedResults table, each record is one game. All columns prefixed by 'W' refer to the winning team, and columns prefixed by 'L' refer to the losing team. We are primarily interested in the following columns:
| Column | Type | Has Prefix | Description | 
|---|---|---|---|
| Season | Integer | No | Identifies the calendar year when a season ends. | 
| DayNum | Integer | No | Identifies the day in a season when a game occurred. | 
| TeamID | Integer | W or L | Identifies the teams participating in a game. | 
| Score | Integer | W or L | Number of points a team scored in a game. | 
| Loc | Char | W Only | Identifies whether the winning team was home ('H'), away ('A'), or the game was played at a neutral site ('N'). | 
| FGA | Integer | W or L | Number of field goal attempts occurring when a team had in a game | 
| FTA | Integer | W or L | Number of free throw attempts a team had in a game. | 
| TO | Integer | W or L | Number of turnovers a team had in a game. | 
| OR | Integer | W or L | Number of offensive rebounds a team had in a game. | 
For example, the column with the winning team's ID will be WTeamID, and the column with the losing team's free-throw attempts will be LFTA.
We want to extract some information about each team involved in a game from that team's perspective. We will use the term "primary team" to refer to this team and "primary team's opponent" to refer to the other team in the game. Do not infer any other meaning from this term. Since there are two participants in each game there will be a record from the perspective of both teams after the extraction. (i.e. one record where the winning team is the primary team and one record where the losing team is the primary team will be in the result.
Complete the function get_pace_adjusted(conn) to query the table MRegularSeasonDetailedResults using the db connection conn and return a Pandas DataFrame with the columns outlined below. The "Winning Team" and "Losing Team" headers in the chart below indicate which columns in MRegularSeasonDetailedResults to use in your computations.
| Column Name | dtype | description | Special Considerations | Winning Team | Losing Team | 
|---|---|---|---|---|---|
| Won | int64 | 1 if the primary team won the game, 0 if the primary team lost | 1 | 0 | |
| Season | int64 | Current Season | 
|||
| DayNum | int64 | Current DayNum | 
|||
| TeamID | int64 | Team id for primary team | WTeamID | LTeamID | |
| OppID | int64 | Team id for primary team's opponent | LTeamID | WTeamID | |
| Home | int64 | 1 if the primary team is at home for the game, 0 otherwise. | WLoc is 'H' | 
WLoc is 'A' | 
|
| Pos | float64 | Estimated number of possessions for primary team | Round to 5 decimal places | W prefix | 
L prefix | 
| OppPos | float64 | Estimated number of possessions for primary team's opponent | Round to 5 decimal places | L prefix | 
W prefix | 
| PtsForPer100 | float64 | 100 $\times$ primary team's score $\div$ estimate of primary team's possessions | Round to 5 decimal places (only round after division) | W prefix | 
L prefix | 
| PtsAgstPer100 | float64 | 100 $\times$ primary team's opponent's score $\div$ estimate of primary team's opponent's possessions | Round to 5 decimal places (only round after division) | L prefix | 
W prefix | 
There is no column for possessions in our source data. We will need to estimate it! The formula below is widely used in sports analytics to estimate the number of possessions a team had in a basketball game. ($POS$ is # of possessions):
$POS$ = $FGA$ + 0.44($FTA$) + $TO$ - $OR$
You can derive all required results from these columns: Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO.
Notes:
MRegularSeasonDetailedResults will correspond to two records in the result. One record where the winning team is the primary team and one record where the losing team is the primary team.WLoc column will have an 'N' entry.dtypes attribute of your result must match exactly.UNION ALL phrase may be helpful if you are using SQL.pd.concat() method may be helpful if you are using Pandas.### Define get_pace_adjusted
def get_pace_adjusted(conn):
    ### BEGIN SOLUTION
    query = '''
    select
        1 as Won,
        Season, DayNum,
        WTeamID as TeamID,
        LTeamID as OppID,
        WLoc='H' as Home,
        WFGA + 0.44*WFTA + WTO - WOR as Pos,
        LFGA + 0.44*LFTA + LTO - LOR as OppPos,
        100*WScore/(WFGA + 0.44*WFTA + WTO - WOR) as PtsForPer100,
        100*LScore/(LFGA + 0.44*LFTA + LTO - LOR) as PtsAgstPer100
    from
        MRegularSeasonDetailedResults
    UNION ALL
    select
        0 as Won,
        Season, DayNum,
        LTeamID as TeamID,
        WTeamID as OppID,    
        WLoc='A' as Home,
        LFGA + 0.44*LFTA + LTO - LOR as Pos,
        WFGA + 0.44*WFTA + WTO - WOR as OppPos,    
        100*LScore/(LFGA + 0.44*LFTA + LTO - LOR) as PtsForPer100,
        100*WScore/(WFGA + 0.44*WFTA + WTO - WOR) as PtsAgstPer100
    from
        MRegularSeasonDetailedResults
    '''
    return pd.read_sql(query, conn).apply(lambda x: round(x, 5))
    ### END SOLUTION
This is the demo input:
   Season  DayNum  WTeamID  WScore  LTeamID  LScore WLoc  WFGA  WFTA  WOR  WTO  LFGA  LFTA  LOR  LTO 
0    2003      10     1104      68     1328      62    N    58    18   14   23    53    22   10   18
The demo cell below should display the following output:
      Won  Season  DayNum  TeamID  OppID  Home    Pos  OppPos  PtsForPer100     PtsAgstPer100  
   0    1    2003      10    1104   1328     0  74.92   70.68      90.76348          87.71930
   1    0    2003      10    1328   1104     0  70.68   74.92      87.71930          90.76348
Note that there are two rows, but they come from a single input row.
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_1 = pd.read_sql('select Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO from MRegularSeasonDetailedResults limit 1', conn)
demo_conn_1 = dfs_to_conn({'MRegularSeasonDetailedResults': demo_df_1})
# demo_df_1
### call demo funtion
print(get_pace_adjusted(demo_conn_1))
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
from tester_fw.testers import Tester_ex1
tester = Tester_ex1()
for _ in range(20):
    try:
        tester.run_test(get_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex1(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Our next task is to compute daily totals of Pos, OppPos, PtsForPer100, and PtsAgstPer100 as well as a count of the number of teams participating in games that day.
Complete the function daily_totals(conn) to query the table PaceAdjusted (structured the same as the output of the previous exercise) in the db connection conn and return a Pandas DataFrame with the following columns:
| Column Name | dtype | description | 
|---|---|---|
| Season | int64 | |
| DayNum | int64 | |
| Pos | float64 | Sum of values in Pos column of PaceAdjusted occuring on a unique combination of Season and DayNum | 
| Pph | float64 | Sum of values in PtsForPer100 column of PaceAdjusted occuring on a unique combination of Season and DayNum | 
| Count | int64 | Count of rows in PaceAdjusted with a unique combination of Season and DayNum | 
Notes:
dtypes attribute of your result must match exactly.### Define daily_totals
def daily_totals(conn):
    ### BEGIN SOLUTION
    query = '''
     select
        Season, DayNum,
        sum(pos) as Pos,
        sum(PtsForPer100) as Pph,
        count() as Count
    from 
        paceadjusted
    group by
        season, daynum
    '''
    return pd.read_sql(query, conn)
    ### END SOLUTION
The demo cell below should display the following output:
   Season  DayNum     Pos        Pph  Count
0    2010      94  277.28  433.33581      4
1    2010     108  281.64  415.62023      4
2    2010     122  265.88  395.26418      4
3    2016      49  365.04  467.93858      5
4    2016     115  204.08  309.66380      3
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_df_2 = pd.read_sql('select * from PaceAdjustedSample', conn)
demo_conn_2 = dfs_to_conn({'PaceAdjusted': demo_df_2})
# demo_df_2.sort_values(['Season', 'DayNum']).reset_index(drop=True)
### call demo funtion
print(daily_totals(demo_conn_2))
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2
from tester_fw.testers import Tester_ex2
tester = Tester_ex2()
for _ in range(20):
    try:
        tester.run_test(daily_totals)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex2(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(daily_totals)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
This is useful information
Window functions define a slice of data relative to each row in a dataset and then perform an aggregate or order-based calculation on the slice. For example, say you want to perform the following calculation for each row $i$ in DailyTotals:
SumPos$_i$ $:=$ Sum of the Pos column for all rows where Season = Season$_i$ and DayNum $\le$ DayNum$_i$
Window functions can do this. Below are examples of this computation in Pandas and SQL. First, we will load some data to demonstrate.
# define demo data
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
daily_totals_df = demo_conn_dfs_ex3['DailyTotals']
The code below will return the sample data with a new column "SumPos".
### Pandas Example
def pandas_window_helper(group):
    group = group.sort_values('DayNum')
    group['SumPos'] = group['Pos'].expanding(1).sum()
    return group
pandas_ex = daily_totals_df.groupby('Season', as_index=False).apply(pandas_window_helper)
The code below will return the sample data with a new column "SumPos".
### SQL Example
query = '''
    SELECT 
        *,
        SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as SumPos
    FROM
        DailyTotals
    ORDER BY
        Season, DayNum
'''
sql_ex = pd.read_sql(query, demo_conn_ex3)
The code below will print the result and verify that the two approaches give equivalent results.
display(sql_ex)
print(f'''SQL and Pandas results match? {
((sql_ex.reset_index(drop=True) - pandas_ex.reset_index(drop=True)).abs() <= 0.00001).all().all()}''')
Our next task is to compute the weighted running average for Pos and Pph for all games played in a Season up to and including each DayNum. We want a snapshot of these averages as if it were that particular day and we had no knowledge of events occurring after that day.
Complete the function get_running_avg_pace(conn) to query the table DailyTotals (structured the same as the output of the previous exercise) using the db connection conn and return a Pandas DataFrame with the columns mentioned below. You should calculate the intermediate values then use division to get the end result.
Intermediate values
SumPos - Running sum of the "Pos" column for current "Season" up to and including the current "DayNum".SumPph - Running sum of the "Pph" column for current "Season" up to and including the current "DayNum".SumCount - Running sum of the "Count" column for current "Season" up to and including the current "DayNum".| Column Name | dtype | description | Special Considerations | 
|---|---|---|---|
| Season | int64 | Current Season | 
|
| DayNum | int64 | Current DayNum | 
|
| AvgPace | float64 | SumPos $\div$ SumCount | 
Round to 5 decimal places | 
| AvgPPH | float64 | SumPph $\div$ SumCount | 
Round to 5 decimal places | 
Notes:
dtypes attribute of your result must match exactly.pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful in calculating intermediate values.SUM(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ### Define def get_running_avg_pace
def get_running_avg_pace(conn):
    ### BEGIN SOLUTION
    query = '''
    SELECT
        Season, DayNum,
        SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total_pos,
        SUM(pph) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total_pph,
        SUM(count) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_count
    FROM
        DailyTotals
    '''
    df = pd.read_sql_query(query, conn)
    df['AvgPace'] = df['running_total_pos']/df['running_count']
    df['AvgPPH'] = df['running_total_pph']/df['running_count']
    return df[['Season', 'DayNum', 'AvgPace', 'AvgPPH']].apply(lambda x: round(x, 5))
    ### END SOLUTION
The demo cell below should display the following output:
   Season  DayNum   AvgPace     AvgPPH
0    2014      12  68.17063  105.33153
1    2014      15  68.85556  105.46984
2    2014      18  68.43963  106.62453
3    2014      41  68.27426  105.76722
4    2014      44  67.81551  105.66767
5    2014      59  67.80480  105.23519
6    2014     102  67.87677  105.16634
7    2014     127  67.77059  105.08984
8    2021      29  71.56138   96.56166
9    2021      96  69.98540  101.14361
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
# demo_conn_dfs_ex3['DailyTotals'].sort_values(['Season', 'DayNum']).reset_index(drop=True)
### call demo funtion
print(get_running_avg_pace(demo_conn_ex3))
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex3
from tester_fw.testers import Tester_ex3
tester = Tester_ex3()
for _ in range(20):
    try:
        tester.run_test(get_running_avg_pace)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex3(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_running_avg_pace)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Suppose you are given two SQL tables:
PaceAdjusted, structured the same as the output of Exercise 1.RunningAvgPace, structured the same as the output of Exercise 3.Complete the function get_center_pace_adjusted(conn) to query tables RunningAvgPace  and PaceAdjusted  and return a Pandas DataFrame with the following columns:
| Column Name | dtype | description | 
|---|---|---|
| Won | int64 | Unchanged from PaceAdjusted | 
| Season | int64 | Unchanged from PaceAdjusted | 
| DayNum | int64 | Unchanged from PaceAdjusted | 
| TeamID | int64 | Unchanged from PaceAdjusted | 
| OppID | int64 | Unchanged from PaceAdjusted | 
| Home | int64 | Unchanged from PaceAdjusted | 
| Pos | float64 | PaceAdjusted.Pos - RunningAvgPace.AvgPace for corresponding Season and DayNum | 
| OppPos | float64 | PaceAdjusted.OppPos - RunningAvgPace.AvgPace for corresponding Season and DayNum | 
| PtsAgstPer100 | float64 | PaceAdjusted.PtsAgstPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum | 
| PtsForPer100 | float64 | PaceAdjusted.PtsForPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum | 
Notes:
dtypes attribute of your result must match exactly.### Define get_center_pace_adjusted
def get_center_pace_adjusted(conn):
    ### BEGIN SOLUTION
    query = '''
    select
        Won, PA.Season, PA.DayNum, TeamID, OppID, Home,
        Pos - AvgPace as Pos,
        OppPos - AvgPace as OppPos,
        PtsAgstPer100 - AvgPPH as PtsAgstPer100,
        PtsForPer100 - AvgPPH as PtsForPer100
    from
        PaceAdjusted as PA 
    left join 
        RunningAvgPace as RAP 
    on
        PA.Season = RAP.Season
        and PA.DayNum = RAP.DayNum
    '''
    return pd.read_sql(query, conn)
    ### END SOLUTION
The demo cell below should display the following output:
   Won  Season  DayNum  TeamID  OppID  Home       Pos    OppPos  \
0    1    2014      94    1374   1396     1  -3.86907  -2.18907   
1    0    2016     119    1224   1313     0   2.65129   5.57129   
2    0    2016      90    1353   1277     0   4.05371   2.13371   
3    1    2016     119    1300   1366     1  -8.78871  -9.30871   
4    1    2014      94    1383   1384     1  -2.30907  -4.66907   
5    1    2014      94    1190   1297     0  -9.42907  -9.62907   
6    0    2014      94    1223   1101     0   0.29093   0.69093   
7    0    2014      94    1218   1364     1  -1.98907  -3.58907   
8    0    2014      92    1159   1221     1 -13.28500 -12.96500   
9    0    2014      81    1297   1252     1   1.52248   1.76248   
   PtsAgstPer100  PtsForPer100  
0      -25.32271      13.22592  
1        7.78985     -12.71682  
2       29.99420     -19.84044  
3      -31.11146      -3.56274  
4      -22.13333      14.99904  
5      -18.41211       5.59099  
6       22.95005      14.80074  
7       12.70182      -7.11914  
8       22.07960      20.97894  
9        3.17351       2.09897
### define demo inputs
### use naming convention demo_varname_ex_* to name demo variables
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex4 = {}
demo_conn_dfs_ex4['PaceAdjusted'] = pd.read_sql('select * from PaceAdjustedSampleTwo', conn)
demo_conn_dfs_ex4['RunningAvgPace'] = pd.read_sql('select * from RunningAvgPaceSample', conn)
demo_conn_ex4 = dfs_to_conn(demo_conn_dfs_ex4)
# print('PaceAdjusted')
# display(demo_conn_dfs_ex4['PaceAdjusted'])
# print('RunningAvgPace')
# display(demo_conn_dfs_ex4['RunningAvgPace'])
### call demo funtion
print(get_center_pace_adjusted(demo_conn_ex4))
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
from tester_fw.testers import Tester_ex4
tester = Tester_ex4()
for _ in range(20):
    try:
        tester.run_test(get_center_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex4(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_center_pace_adjusted)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Now we have centered and pace adjusted stats for each Season, DayNum, and TeamID, it's time to compute running averages to set as inputs to our model. Since we are using these stats as inputs to a predictive model, we need to introduce a lag of 1 game so that the results of the game we are trying to predict are not part of these inputs.
Complete the function get_team_stats(conn) to query the PaceAdjustedCentered (structured the same as the output of the previous exercise) table with the db connection conn and return a Pandas DataFrame with the following columns:
| Column Name | dtype | description | Special Considerations | 
|---|---|---|---|
| Won | int64 | Unchanged from PaceAdjustedCentered | 
|
| Season | int64 | Current Season | Unchanged from PaceAdjustedCentered | 
| DayNum | int64 | Current Day | Unchanged from PaceAdjustedCentered | 
| TeamID | int64 | Current TeamID | Unchanged from PaceAdjustedCentered | 
| OppID | int64 | Unchanged from PaceAdjustedCentered | 
|
| Home | int64 | Unchanged from PaceAdjustedCentered | 
|
| Pos | float64 | Running Average of PaceAdjustedCentered.Pos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum | 
Round to 5 decimal places | 
| OppPos | float64 | Running Average of PaceAdjustedCentered.OppPos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum | 
Round to 5 decimal places | 
| PtsAgstPer100 | float64 | Running Average of PaceAdjustedCentered.PtsAgstPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum | 
Round to 5 decimal places | 
| PtsForPer100 | float64 | Running Average of PaceAdjustedCentered.PtsForPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum | 
Round to 5 decimal places | 
| WinPct | float64 | Running Average of PaceAdjustedCentered.Won including all rows with the current Season and TeamID with DayNum < PaceAdjustedCentered.DayNum | 
Round to 5 decimal places | 
Notes:
dtypes attribute of your result must match exactly.pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful.AVG(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) ### Define get_team_stats
def get_team_stats(conn):
    ### BEGIN SOLUTION
    query = '''
    select
        Won, Season, DayNum, TeamID, OppID, Home, 
        avg(Pos) over(partition by Season, TeamID order by DayNum rows between unbounded preceding and 1 preceding) as Pos,
        avg(OppPos) over(partition by Season, TeamID order by DayNum rows between unbounded preceding and 1 preceding) as OppPos,
        avg(PtsAgstPer100) over(partition by Season, TeamID order by DayNum rows between unbounded preceding and 1 preceding) as PtsAgstPer100,
        avg(PtsForPer100) over(partition by Season, TeamID order by DayNum rows between unbounded preceding and 1 preceding) as PtsForPer100,
        avg(Won) over(partition by Season, TeamID order by DayNum rows between unbounded preceding and 1 preceding) as WinPct
    from
        PaceAdjustedCentered
    '''
    return pd.read_sql_query(query, conn).apply(lambda x: round(x, 5), axis=0)
    ### END SOLUTION
The demo cell below should display the following output:
    Won  Season  DayNum  TeamID  OppID  Home      Pos    OppPos  \
0     0    2012      17    1108   1452     0      NaN       NaN   
1     0    2012      25    1108   1408     0  8.25697   8.53697   
2     0    2012      26    1108   1360     0  5.65314   5.93314   
3     1    2012     105    1108   1115     1  2.76368   2.09702   
4     0    2012      16    1355   1208     0      NaN       NaN   
5     1    2012      31    1355   1237     0 -5.81233  -5.33233   
6     0    2012      45    1355   1315     0  5.07403   4.81403   
7     1    2012     124    1355   1237     0  4.12049   4.84049   
8     0    2012      29    1431   1332     0      NaN       NaN   
9     1    2012      52    1431   1155     0 -4.58353  -5.26353   
10    0    2012      72    1431   1409     0 -5.65785  -7.11785   
11    1    2012      79    1431   1187     1 -6.81332  -8.17332   
12    1    2012     103    1431   1408     1 -6.71588  -7.96588   
13    1    2017      47    1252   1399     0      NaN       NaN   
14    1    2017      50    1252   1122     0  9.78901  10.22901   
15    1    2017      68    1252   1244     1  7.09400   8.63400   
16    0    2017      82    1252   1316     0  6.62264   8.36930   
17    1    2017      86    1252   1367     1  7.55313   8.32313   
18    0    2017     108    1252   1316     1  6.32893   6.93693   
19    0    2017     122    1252   1316     1  6.91365   7.39365   
    PtsAgstPer100  PtsForPer100   WinPct  
0             NaN           NaN      NaN  
1        26.44115     -18.48766  0.00000  
2        21.51181     -30.28407  0.00000  
3        16.34005     -20.09929  0.00000  
4             NaN           NaN      NaN  
5        14.27057      -2.29566  0.00000  
6        13.42293       6.86026  0.50000  
7        16.60255       4.68752  0.33333  
8             NaN           NaN      NaN  
9         2.30485      -6.73367  0.00000  
10       -7.95942      -3.23226  0.50000  
11       -4.29258      -8.20323  0.33333  
12       -5.03189      -2.51522  0.50000  
13            NaN           NaN      NaN  
14       -5.62638      -2.59159  1.00000  
15        0.33659      13.16336  1.00000  
16       -0.57178      10.26290  1.00000  
17        2.37803       7.85893  0.75000  
18        2.68629       9.02187  0.80000  
19        4.37327       7.30692  0.66667
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex5 = {}
demo_conn_dfs_ex5['PaceAdjustedCentered'] = pd.read_sql('select * from PaceAdjustedCenteredSample', conn).sort_values(['Season', 'TeamID', 'DayNum'])
demo_conn_ex5 = dfs_to_conn(demo_conn_dfs_ex5)
# display(demo_conn_dfs_ex5['PaceAdjustedCentered'])
### call demo funtion
print(get_team_stats(demo_conn_ex5))
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
from tester_fw.testers import Tester_ex5
tester = Tester_ex5()
for _ in range(20):
    try:
        tester.run_test(get_team_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex5(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_team_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Up to this point we have calculated our metrics for each team before they play each game. We're almost ready to start modeling! There is still one last task. Remember that there were two records in PaceAdjusted for each game. This is also true of LaggedTeamStats, which is a running average of that table. If "Team A" and "Team B" played a game on in some Season on some DayNum then there will be one record where "Team A" is the primary team and "Team B" is the opponent team and one record where "Team B" is the pirmary team and "Team A" is the opponent team. We need to join these records together so that we will have one record identifying each game.
Complete the function get_matchup_stats(conn) to query the LaggedTeamStats table using the db connection conn and return a Pandas DataFrame with the following columns:
| Column Name | dtype | description | 
|---|---|---|
| Won | int64 | 1 if primary team (_x suffix) won the matchup 0 otherwise | 
| Season | int64 | |
| DayNum | int64 | |
| TeamID | int64 | Primary team - associated with _x suffixes | 
| OppID | int64 | Opponent team - associated with _y suffixes | 
| Home_x | int64 | Stats associated with primary team | 
| Pos_x | float64 | Stats associated with primary team | 
| OppPos_x | float64 | Stats associated with primary team | 
| PtsAgstPer100_x | float64 | Stats associated with primary team | 
| PtsForPer100_x | float64 | Stats associated with primary team | 
| WinPct_x | float64 | Stats associated with primary team | 
| Home_y | int64 | Stats associated with opponent team | 
| Pos_y | float64 | Stats associated with opponent team | 
| OppPos_y | float64 | Stats associated with opponent team | 
| PtsAgstPer100_y | float64 | Stats associated with opponent team | 
| PtsForPer100_y | float64 | Stats associated with opponent team | 
| WinPct_y | float64 | Stats associated with opponent team | 
Notes:
dtypes attribute of your result must match exactly.Implementation Strategy
LaggedTeamStats to itself where the OppID in the left table is the same as the TeamID in the right table. The Season and DayNum should be the same for both tables.TeamID is larger than the "left" OppID. This will avoid duplicating the data.DayNum 56 should not be included in the result to make sure there is a reasonable sample of games so that the stats are meaningful.NULL or NaN values should be dropped.### Define get_matchup_stats
def get_matchup_stats(conn):
    ### BEGIN SOLUTION
    lagged_team_stats_df = pd.read_sql_query('select * from LaggedTeamStats', conn)
    return lagged_team_stats_df\
        .merge(lagged_team_stats_df, left_on=['Season', 'DayNum', 'OppID'], right_on=['Season', 'DayNum', 'TeamID'])\
        .query('TeamID_x < TeamID_y')\
        .query('DayNum >= 56')\
        .drop(columns=['Won_y',	'TeamID_y',	'OppID_y'])\
        .rename(columns={'Won_x': 'Won', 'TeamID_x': 'TeamID', 'OppID_x':'OppID'})\
        .dropna()
    ### END SOLUTION
The demo cell below should display the following output:
   Won  Season  DayNum  TeamID  OppID  Home_x    Pos_x  OppPos_x  \
2    1    2019      78    1210   1323       1 -0.90244  -1.03355   
4    1    2011      72    1253   1413       1  2.31956   1.98356   
   PtsAgstPer100_x  PtsForPer100_x  WinPct_x  Home_y    Pos_y  OppPos_y  \
2        -11.59260        -3.70151   0.55556       0 -2.65178  -2.09178   
4          4.31931        -0.18928   0.46667       0 -2.84824  -2.20252   
   PtsAgstPer100_y  PtsForPer100_y  WinPct_y  
2         -2.28041         6.55518   0.61111  
4          3.37227        -0.80974   0.28571
### define demo inputs
from tester_fw.test_utils import dfs_to_conn
demo_conn_dfs_ex6 = {}
demo_conn_dfs_ex6['LaggedTeamStats'] = pd.read_sql('select * from LaggedTeamStatsSample', conn)
demo_conn_ex6 = dfs_to_conn(demo_conn_dfs_ex6)
# display(pd.read_sql('select * from LaggedTeamStats', demo_conn_ex6))
### call demo funtion
print(get_matchup_stats(demo_conn_ex6))
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex6
from tester_fw.testers import Tester_ex6
tester = Tester_ex6()
for _ in range(20):
    try:
        tester.run_test(get_matchup_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex6(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(get_matchup_stats)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Feel free to skip to the next exercise.
Logistic regression relies on a few assumptions about the predictive variables which are not met here (namely normality and each one individually having a significant effect on the response variable). Instead, we will use the off the shelf Scikit-Learn modules to implement the random forest modeling technique. We're building the model for you, but you will have to do some analysis of the predictions it makes.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
import warnings
import pickle
import os
### Example code showing how we got X and Y from the output of `get_matchup_stats` 
# stats_df = get_matchup_stats(conn)
# # Horizontally partition our stats into predictive (we hope!) variables and the response variable
# X = stats_df.drop(columns=['Won', 'Season', 'DayNum', 'TeamID', 'OppID']) # Remove the response and identification data
# y = stats_df['Won'] # response variable. There are two classes, 0 and 1, which correspond to a loss and a win for the primary team.
# load pre-computed X and y
with open('resource/asnlib/publicdata/X.pkl', 'rb') as f:
    X = pickle.load(f)
with open('resource/asnlib/publicdata/y.pkl', 'rb') as f:
    y = pickle.load(f)
# Vertically partition data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, random_state=6040)
# Check to see if we already saved the model
if not os.path.exists('resource/asnlib/publicdata/clf.pkl'): # if not - train and save the model
    clf = GridSearchCV(RandomForestClassifier(),
        scoring='roc_auc',
        param_grid={
            'max_depth': [2,4,6,8,9,10]
        },
        n_jobs=-1,
        verbose=2).fit(X_train,y_train)
    with open('resource/asnlib/publicdata/clf.pkl', 'wb') as f:
        pickle.dump(clf, f)
# open the saved model from a pickle file
with open('resource/asnlib/publicdata/clf.pkl', 'rb') as f:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        clf = pickle.load(f)
# this is how you predict the probability that the response variable belongs to each class
# probs[:, 0] is the estimated probabilites of losing and probs[:, 1] is the estimated probabilities of winning.
probs = clf.predict_proba(X_test) 
# this is how you predict which class each record belongs to
preds = clf.predict(X_test)
# printing some metrics about how well the model fits the test set
print(f'''
Log-Loss:\t\t{round(log_loss(y_test, probs),4)} - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:\t{round(roc_auc_score(y_test, probs[:, 1]),4)} - 1 is perfect and 0.5 is considered "non-informative".
''')
Without advanced statistical training, these fit metrics would probably not make sense. We see that they are both somewhere in between "perfect" and "non-informative", but what is that telling us? One understandable metric we can calculate is the prediction accuracy.
Complete the function pred_accuracy(obs, preds) to calculate the relative frequency $\left (\frac{\# matches}{\# observations} \right )$ that the predicted value preds is the same as the ovserved value obs. You can assume that both will be array-like, 1-dimensional, the same length and that they only contain the integers 0 and 1. Round your result to 5 decimal places.
Note: There will be no type check. If your number matches our number you will pass.
### Define pred_accuracy
def pred_accuracy(obs, preds):
    ### BEGIN SOLUTION
    return round(np.mean(obs == preds), 5)
    ### END SOLUTION
The demo cell below should display the following output:
0.53333
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex7 = rng.integers(0, 2, 15)
demo_preds_ex7 = rng.integers(0, 2, 15)
# print('obs:  ', demo_obs_ex7)
# print('preds:', demo_preds_ex7)
### call demo funtion
pred_accuracy(demo_obs_ex7, demo_preds_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
from tester_fw.testers import Tester_ex7
tester = Tester_ex7()
for _ in range(20):
    try:
        tester.run_test(pred_accuracy)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex7(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(pred_accuracy)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Another way to evaluate our model is to check how accuracy of the probabilities it generates. Suppose you have a list of the observations obs (0's and 1's indicating whether the primary team wins (1) or not(0)) and a list of the predicted probability probs that the primary team won for each observation. We can bucket the predictions based on the probabilities and then calculate the accuracy of our model within that bucket by taking the mean of the observations. If the observed win % for each bucket is between or close to its bounds, we have a decent model.
Complete the function bucket_evaluation(obs, prob, n_buckets) to partition the interval $[0, 1)$ into n_buckets equal intervals and compute the observed relative win percentage for the events in that bucket. You should return your result as a Pandas DataFrame with the following columns:
| Column Name | dtype | description | Special Considerations | 
|---|---|---|---|
| lower | float64 | lower bound of bucket | round to 2 decimal places | 
| upper | float64 | upper bound of bucket | round to 2 decimal places | 
| obs_win_pct | float64 | win % for observations where lower $\le$ estimated win probability < upper | round to 2 decimal places (as a percent) | 
| n | int64 | number of observations where lower $\le$ estimated win probability < upper | 
Notes:
'lower' column in ascending order and a reset index.dtypes attribute of your result must match exactly.NaN entries for any "empty" buckets which have no observations.Implementation Strategy
np.linspace() can be helpful for calculating the bounds for each bucket. Don't forget to round!zip(bounds[:-1], bounds[1:] can be iterated over to sequentially examine each pair of bounds (bounds would be the output of linspace after rounding).obs to only include observations where the prob is between the lower and upper bounds. You can calculate the observed relative frequency and count from the filtered results.### Define bucket_evaluation
def bucket_evaluation(obs, prob, n_buckets):
    ### BEGIN SOLUTION
    eval_df = pd.DataFrame({'true': obs, 'prob': prob})
    bounds = np.round(np.linspace(0, 1, n_buckets+1), 2)
    lod = []
    for lower, upper in zip(bounds[:-1], bounds[1:]):
        mask = (eval_df['prob'] >= lower) & (eval_df['prob'] < upper)
        lod.append({
            'lower': round(lower,2),
            'upper': round(upper,2),
            'obs_win_pct': round(100*eval_df.loc[mask, 'true'].mean(),2),
            'n': eval_df[mask].shape[0]
        })
    return pd.DataFrame(lod)
    ### END SOLUTION
The demo cell below should display the following output:
   lower  upper  obs_win_pct   n
0   0.00   0.25        25.00   4
1   0.25   0.50        23.08  13
2   0.50   0.75        66.67   3
3   0.75   1.00        40.00   5
Unlike previous exercises the order and index must match (in addition to the usual requirements) to produce an easily understood result.
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex8 = rng.integers(0, 2, 25)
demo_prob_ex8 = rng.random(25)
demo_n_buckets_ex8 = 4
### call demo funtion
print(bucket_evaluation(demo_obs_ex8, demo_prob_ex8, demo_n_buckets_ex8))
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.returned_output_vars - Outputs returned by your solution.true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
from tester_fw.testers import Tester_ex8
tester = Tester_ex8()
for _ in range(20):
    try:
        tester.run_test(bucket_evaluation)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### BEGIN HIDDEN TESTS
tester = Tester_ex8(key=b'BdInv5kO6lGx1qlRaFQt1DLlOiuSYqii48WbnGGiKk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test(bucket_evaluation)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Fin. If you have made it this far, congratulations on completing the semester. Don't forget to submit!