Golf Performance Analytics: Strokes Gained Calculations

Version 1.0.3

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of string manipulation, regular expressions, JSON processing, data validation, statistical analysis. It has 11 exercises numbered 0 to 10. There are 19 available points. However to earn 100% the threshold is 15 points. (Therefore once you hit 15 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 2 point(s)

  • Exercise 1 - : 2 point(s)

  • Exercise 2 - : 3 point(s)

  • Exercise 3 - : 1 point(s)

  • Exercise 4 - : 0 point(s)

  • Exercise 5 - : 3 point(s)

  • Exercise 6 - : 1 point(s)

  • Exercise 7 - : 2 point(s)

  • Exercise 8 - : 2 point(s)

  • Exercise 9 - : 1 point(s)

  • Exercise 10 - : 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

Golf 101

Golf is a sport where players hit a ball with a club into a series of holes using as few attempts as possible.

Key Golf Concepts

  • Stroke/Shot: Each attempt to hit the ball.
  • Hole: Both the target (a physical hole in ground) and a section of the course with a marked starting point leading up to each target.
  • Round: A complete game of golf, consisting of a sequence of holes.
  • Lie: The surface the ball is on at the start of a shot. (Tee, Fairway, Rough, Sand, Green, etc.)
  • Distance: How far the ball is from the hole.

The Problem - how "good" are the professional golfers on the PGA TOUR

This exam analyzes PGA Tour shot-level data to understand professional golf performance using a metric called "strokes gained".

What is "strokes gained"?

Strokes gained is the measure of how much a shot decreases the expected strokes to put the ball in the hole based on the lie and distance at the start and end of the shot. Strokes gained gives a more complete picture of a golfer's performance in different aspects of the sport which would not be captured in the total score.

The central limit theorem

The analysis you are about to complete hinges on the central limit theorem (CLT). In short the theorem states:

  • Given a sufficiently large sample size $n$ of some random variable $X$
  • The mean of the sample $\sum_{i=1}^n \frac{x_i}{n}$, where $x_i$ is an individual observation will be normally distributed.
  • The distribution will have mean, $\mu$, and variance $\frac{\sigma^2}{n}$, where $\mu$ and $\sigma^2$ are the true expected value and variance of $X$.
  • This implies that with a large sample we can accurately estimate the expected value of $X$.

For the purposes of this analysis the CLT allows us to accurately estimate the expected value of all possible shots as well as each player's skill by taking sample means over a large sample.

Your Overall Task

You will process and analyze golf shot data through these phases:

  1. Data preprocessing - transform the data into a form where it can be analyzed.
    • Identify and filter problematic data.
    • Re-structure the data into a form for convenient analysis.
    • Standardize categorical data (lies and distances).
  2. Baseline calculation - calculate the expected number of strokes to the hole for each lie and distance.
    • Determine mean for each lie/distance combination with enough observations
    • Fill in missing values by interpolation
    • Smooth the baseline
  3. Strokes gained calculation - use the baseline to calculate the strokes gained metric for each shot.
    • Implement the strokes gained formula
    • Apply the formula to all the shots in the data set.
  4. Analysis - calculate each player's average strokes gained for different shot categories.
    • Categorize the shots
    • Aggregate strokes gained for each player and shot category

Data Structure

In our analysis we will be starting with a dataset called raw_hole_details. This dataset contains information about each hole played by players across multiple professional golf tournaments, including details about the individual strokes taken on each hole.

The variable raw_hole_details is a list where each element is a dictionary representing a single golf hole played by a player in a tournament round.

Example Structure

[
    {
        "player_id": 30926,
        "tournament_id": "R2024016",
        "round_number": 1,
        "hole_details": {
            "hole_number": 1,
            "hole_score": "4",
            "hole_yardage": 532,
            "stroke_details": [
                {
                    "strokeNumber": 1,
                    "finalStroke": false,
                    "distanceRemaining": "156 yds",
                    "playByPlay": "373 yds to right rough, 156 yds to hole",
                    "toLocationCode": "ERR",
                    "fromLocationCode": "OTB",
                    "toLocation": "Right Rough",
                    "fromLocation": "Tee Box"
                },
                {
                    "strokeNumber": 2,
                    "finalStroke": false,
                    "distanceRemaining": "50 yds",
                    "playByPlay": "106 yds to fairway, 50 yds to hole",
                    "toLocationCode": "FWY",
                    "fromLocationCode": "ERR",
                    "toLocation": "Fairway",
                    "fromLocation": "Right Rough"
                }
                // ... more strokes
            ]
        }
    }
    // ... more hole records
]
  • Top-level keys: player_id, tournament_id, round_number, hole_details
  • hole_details: Contains hole_number, hole_score, hole_yardage, and a list of stroke_details
  • stroke_details: Each stroke is a dictionary with information about the shot, locations, and play-by-play description

This list of dictionaries contains all the data needed to perform our analysis.

In [ ]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re 
from collections import defaultdict
from statistics import mean
from statsmodels.nonparametric.smoothers_lowess import lowess
import matplotlib.pyplot as plt
from pprint import pprint

utils.add_from_file('defaultdict_to_dict_recursive', utils)

Data Preprocessing (identify and filter problematic data)

The cell below loads raw_hole_details as described above. There are some problems with the data which need to be addressed. Many of the holes are missing key pieces of information. Additionally, many of the holes involve scenarios which would require more detailed knowledge of the rules of golf than what we described in the primer.

In the exercise below, you will write some code to identify these holes so we can filter them out and not consider them in the analysis.

In [ ]:
### Run Me!!!
raw_hole_details = utils.load_object_from_publicdata('raw_hole_details')

Exercise 0: (2 points)

identify_complex_hole

Your task: define identify_complex_hole as follows:

Analyze the details of a golf hole to identify specific conditions and data completeness.

Args:

  • hole_details (dict): A dictionary containing information about a golf hole, including 'hole_score' and a list of 'stroke_details'.
    • 'hole_score' (str): The score for the hole, or an empty string if not available.
    • 'stroke_details' (list): A list of dictionaries, each representing a stroke with keys:
      • 'strokeNumber' (int): The sequential number of the stroke.
      • 'distanceRemaining' (str): The remaining distance after the stroke, or an empty string.
      • 'playByPlay' (str, optional): A description of the stroke, may include keywords like 'penalty', 'drop', or 'provisional'.

Returns:

  • dict: A dictionary with boolean flags indicating the presence of specific conditions:
    • 'penalty' (bool): True if any stroke 'playByPlay' mentions a "penalty" (case-insensitive). Otherwise, False.
    • 'drop' (bool): True if any stroke 'playByPlay' mentions a "drop" (case-insensitive). Otherwise, False.
    • 'provisional' (bool): True if any stroke 'playByPlay' mentions a "provisional" (case-insensitive). Otherwise, False.
    • 'has_distances' (bool): True if any stroke has a non-empty 'distanceRemaining'. Otherwise, False.
    • 'strokes_in_sequence' (bool): True if all strokes are in sequential order. Otherwise, False.
    • 'no_score' (bool): True if 'hole_score' is an empty string. Otherwise, False.

Note:

  • When 'stroke_details' is an empty list:
    • 'penalty', 'drop', 'provisional', 'has_distances', and 'strokes_in_sequence' are all False.

Implementation Notes

The startercode has a return statement with the correct format, and sets all of the values to False. It is up to you to update them based on the content of hole_details.

In [ ]:
### Solution - Exercise 0  
def identify_complex_hole(hole_details):
    penalty = False
    drop = False
    provisional = False
    has_distances = False
    strokes_in_sequence = False
    no_score = False
    ### BEGIN SOLUTION
    if hole_details.get('stroke_details'):
        strokes_in_sequence = True
    no_score = hole_details.get('hole_score', '') == ''
    for i, stroke in enumerate(hole_details['stroke_details']):
        if stroke['strokeNumber'] != i + 1:
            strokes_in_sequence = False
        if stroke['distanceRemaining'] != '':
            has_distances = True
        play_by_play = stroke.get('playByPlay', '').lower()
        if 'penalty' in play_by_play:
            penalty = True
        if 'drop' in play_by_play:
            drop = True
        if 'provisional' in play_by_play:
            provisional = True
    ### END SOLUTION
    return {
        'penalty': penalty,
        'drop': drop,
        'provisional': provisional,
        'has_distances': has_distances,
        'strokes_in_sequence': strokes_in_sequence,
        'no_score': no_score
    }

### Demo function call
test_holes = [
    {
        'hole_score': '4',
        'stroke_details': [
            {'strokeNumber': 1, 'distanceRemaining': '156 yds', 'playByPlay': '373 yds to fairway'},
            {'strokeNumber': 2, 'distanceRemaining': '33 ft', 'playByPlay': '165 yds to green'},
            {'strokeNumber': 3, 'distanceRemaining': '', 'playByPlay': 'In the hole'}
        ]
    },
    {
        'hole_score': '5',
        'stroke_details': [
            {'strokeNumber': 1, 'distanceRemaining': '', 'playByPlay': 'Tee shot with penalty'},
            {'strokeNumber': 2, 'distanceRemaining': '180 yds', 'playByPlay': 'After drop to fairway'}
        ]
    },
    {
        'hole_score': '',
        'stroke_details': [
            {'strokeNumber': 1, 'distanceRemaining': '', 'playByPlay': 'First shot'},
            {'strokeNumber': 3, 'distanceRemaining': '', 'playByPlay': 'Provisional ball needed'}
        ]
    }
]

results = []
for i, hole in enumerate(test_holes):
    result = identify_complex_hole(hole)
    print(f"identify_complex_hole(test_holes[{i}])")
    print(f"--> {result}")
    results.append(result)

The demo should display this printed output.

identify_complex_hole(test_holes[0])
--> {'penalty': False, 'drop': False, 'provisional': False, 'has_distances': True, 'strokes_in_sequence': True, 'no_score': False}
identify_complex_hole(test_holes[1])
--> {'penalty': True, 'drop': True, 'provisional': False, 'has_distances': True, 'strokes_in_sequence': True, 'no_score': False}
identify_complex_hole(test_holes[2])
--> {'penalty': False, 'drop': False, 'provisional': True, 'has_distances': False, 'strokes_in_sequence': False, 'no_score': True}


The cell below will test your solution for identify_complex_hole (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 0  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=identify_complex_hole,
              ex_name='identify_complex_hole',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=103)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to identify_complex_hole did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=identify_complex_hole,
              ex_name='identify_complex_hole',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=103,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to identify_complex_hole did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of identify_complex_hole

We used a correct implementation of identify_complex_hole in this code snippet to produce simple_hole_details

simple_hole_details = []
for hole in raw_hole_details:
    hole_analysis = identify_complex_hole(hole['hole_details'])
    keep_hole = (
        not hole_analysis['penalty'] and
        not hole_analysis['drop'] and
        not hole_analysis['no_score'] and
        hole_analysis['has_distances'] and
        hole_analysis['strokes_in_sequence'] and
        not hole_analysis['provisional']
    )
    if keep_hole:
        simple_hole_details.append(hole)

The pre-computed result is loaded in the cell below.

In [ ]:
### Run Me!!!
simple_hole_details = utils.load_object_from_publicdata('simple_hole_details')

Data preprocessing (re-structure data)

The data in simple_hole_details has the same structure as hole_details - a nested structure where the information about each individual shot is contained within a "hole-level" dictionary. See the example below:

{
    'player_id': 30926,
    'tournament_id': 'R2024016',
    'round_number': 1,
    'hole_details': {
        'hole_number': 1,
        'hole_score': '4',
        'hole_yardage': 532,
        'stroke_details': [
            {
                'strokeNumber': 1,
                'finalStroke': False,
                'distanceRemaining': '156 yds',
                'playByPlay': '373 yds to right rough, 156 yds to hole',
                'toLocationCode': 'ERR',
                'fromLocationCode': 'OTB',
                'toLocation': 'Right Rough',
                'fromLocation': 'Tee Box'
            },
            ... # three more shot dictionaries for the same player, tournament, round, and hole
        ]
    }
}

We want to "flatten" it into a list of simple (non-nested) dictionaries, where each dictionary represents a single shot. Each individual shot will contribute to the baseline and eventually have a strokes gained value calculated.

Exercise 1: (2 points)

extract_shot_records

Your task: define extract_shot_records as follows:

Flattens the nested dictionary structure of individual_hole_data into a list of simple dictionaries.

Args:

  • individual_hole_data (dict): A dictionary with the following keys:
    • 'player_id' (int): Unique identifier for the player.
    • 'tournament_id' (str): Unique identifier for the tournament.
    • 'round_number' (int): The round number within the tournament.
    • 'hole_details' (dict): Contains:
      • 'hole_number' (int): The number of the hole.
      • 'hole_score' (str): The score for the hole.
      • 'hole_yardage' (int): The yardage of the hole.
      • 'stroke_details' (list of dict): Each dict represents one stroke and contains the following keys:
        • 'strokeNumber' (int): The stroke number.
        • 'fromLocation' (str): The starting lie/location of the shot.
        • 'toLocation' (str): The ending lie/location of the shot.
        • 'distanceRemaining' (str): Distance remaining after the shot.
        • 'finalStroke' (bool): Whether this stroke finished the hole.

Returns:

  • list of dict: A list where each element is a dictionary representing a shot record with the following keys:
    • 'player_id' (int) - comes directly from individual_hole_data['player_id']
    • 'tournament_id' (str) - comes directly from individual_hole_data['tournament_id']
    • 'round_number' (int) - comes directly from individual_hole_data['round_number']
    • 'hole_number' (int) - comes directly from hole_details['hole_number']
    • 'score' (str) - comes directly from hole_details['hole_score']
    • 'yardage' (str) - hole_details['hole_yardage'] converted to a string
    • 'stroke_number' (int) - comes from stroke['strokeNumber']
    • 'strokes_to_hole' (int) - calculated as the score minus the stroke_number plus one
    • 'start_distance' (str) - the starting distance for the stroke:
      • for the first stroke, it is the hole's yardage
      • for subsequent strokes, it is the 'distanceRemaining' value from the previous stroke
    • 'start_lie' (str) - comes from directly from stroke['fromLocation']
    • 'end_lie' (str) - directly from stroke['toLocation'] if available, otherwise 'Hole'
      • In other words, if stroke['toLocation'] is an empty string (''), the 'end_lie' value is 'Hole'.
    • 'end_distance' (str) - stroke['distanceRemaining'] if available, otherwise '0'
      • In other words, if stroke['distanceRemaining'] is an empty string (''), the 'end_distance' value is '0'.

Implementation Notes

  • The input individual_hole_data represents one golfer playing one hole a single time.
  • The result will have one dictionary for each stroke in individual_hole_data['hole_details']['stroke_details'].
  • The each result dictionary's "stroke-level" values (those derived from a stroke) will vary with each stroke dictionary in the stroke details.
  • The each result dictionary's "hole-level" values (those not derived from a stroke) will be constant.
  • Empty strings as the end lie and distance indicate that the shot ends with the ball inside the hole.
  • The startercode initializes/returns shot_records and extracts the hole-level values. It's up to you to populate shot_records.
In [ ]:
### Solution - Exercise 1  
def extract_shot_records(individual_hole_data):
    shot_records = []
    hole_details = individual_hole_data['hole_details']
    stroke_details = hole_details['stroke_details']
    hole_level_values = {
            'player_id': individual_hole_data['player_id'],
            'tournament_id': individual_hole_data['tournament_id'],
            'round_number': individual_hole_data['round_number'],
            'hole_number': hole_details['hole_number'],
            'score': hole_details['hole_score'],
            'yardage': str(hole_details['hole_yardage']),
    }
    ### BEGIN SOLUTION
    for i, stroke in enumerate(stroke_details):
        if i == 0:
            start_distance = str(hole_details['hole_yardage'])
        else:
            start_distance = stroke_details[i-1]['distanceRemaining']
        
        end_distance = stroke['distanceRemaining'] if stroke['distanceRemaining'] else '0'
        end_lie = stroke['toLocation'] if stroke['toLocation'] else 'Hole'
        
        shot_records.append({
            **hole_level_values,
            'stroke_number': stroke['strokeNumber'],
            'strokes_to_hole': int(hole_details['hole_score']) - stroke['strokeNumber'] + 1,
            'start_distance': start_distance,
            'start_lie': stroke.get('fromLocation', ''),
            'end_lie': end_lie,
            'end_distance': end_distance,
        })
    ### END SOLUTION
    return shot_records

### Demo function call
sample_individual_hole_data = {
    'hole_details': {
        'hole_number': 1,
        'hole_score': '4',
        'hole_yardage': 532,
        'stroke_details': [
            {'distanceRemaining': '156 yds', 'strokeNumber': 1, 'fromLocation': 'Tee Box', 'toLocation': 'Right Rough'},
            {'distanceRemaining': '33 ft 7 in.', 'strokeNumber': 2, 'fromLocation': 'Primary Rough', 'toLocation': 'Right Intermediate'},
            {'distanceRemaining': '7 in', 'strokeNumber': 3, 'fromLocation': 'Intermediate Rough', 'toLocation': 'Green'},
            {'distanceRemaining': '', 'strokeNumber': 4, 'fromLocation': 'Green', 'toLocation': ''}
        ]
    },
    'player_id': 30926,
    'round_number': 1,
    'tournament_id': 'R2024016'
}

result = extract_shot_records(sample_individual_hole_data)
pprint(result)

The demo should display this printed output.

[{'end_distance': '156 yds',
  'end_lie': 'Right Rough',
  'hole_number': 1,
  'player_id': 30926,
  'round_number': 1,
  'score': '4',
  'start_distance': '532',
  'start_lie': 'Tee Box',
  'stroke_number': 1,
  'strokes_to_hole': 4,
  'tournament_id': 'R2024016',
  'yardage': '532'},
 {'end_distance': '33 ft 7 in.',
  'end_lie': 'Right Intermediate',
  'hole_number': 1,
  'player_id': 30926,
  'round_number': 1,
  'score': '4',
  'start_distance': '156 yds',
  'start_lie': 'Primary Rough',
  'stroke_number': 2,
  'strokes_to_hole': 3,
  'tournament_id': 'R2024016',
  'yardage': '532'},
 {'end_distance': '7 in',
  'end_lie': 'Green',
  'hole_number': 1,
  'player_id': 30926,
  'round_number': 1,
  'score': '4',
  'start_distance': '33 ft 7 in.',
  'start_lie': 'Intermediate Rough',
  'stroke_number': 3,
  'strokes_to_hole': 2,
  'tournament_id': 'R2024016',
  'yardage': '532'},
 {'end_distance': '0',
  'end_lie': 'Hole',
  'hole_number': 1,
  'player_id': 30926,
  'round_number': 1,
  'score': '4',
  'start_distance': '7 in',
  'start_lie': 'Green',
  'stroke_number': 4,
  'strokes_to_hole': 1,
  'tournament_id': 'R2024016',
  'yardage': '532'}]


The cell below will test your solution for extract_shot_records (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 1  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=extract_shot_records,
              ex_name='extract_shot_records',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to extract_shot_records did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=extract_shot_records,
              ex_name='extract_shot_records',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to extract_shot_records did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of extract_shot_records

We used a correct implementation of extract_shot_records to build raw_shot_records with the code snippet below.

raw_shot_records = []
for individual_hole_data in simple_hole_details:
    hole_records = extract_shot_records(individual_hole_data)
    raw_shot_records.extend(hole_records)

It is loaded in a code cell later in the exam, close to where it is used in the analysis.

Data preprocessing (standardize distances)

Distance is a continuous measure, but we want to compute expected values based on these distances. In golf, there is not much difference in difficulty between a 220 yard shot and a 221 yard shot. It makes sense to consider shots within small distance intervals as having the same distance (i.e consider the shots where $220$ yards $ \le$ distance $\lt 230$ yards as all being 230 yards from the hole). This will increase the accuracy of the mean calculations and reduce noise.

Additionally, the distances in all the data we have seen so far is given as a string containing a value and a unit. We can't do math with strings, so we will need to separate the value and unit.

In the exercise below you will parse a distance string into its value and unit, and use rounding to "bin" the values.

Exercise 2: (3 points)

parse_distance

Your task: define parse_distance as follows:

Parses a distance string and normalizes it to the nearest interval in yards or feet.

The function supports the following input formats:

  • "X yds":
    • Yards (e.g., "10 yds")
  • "X":
    • Yards without unit (e.g., "10")
  • "X ft Y in":
    • Feet and inches (e.g., "5 ft 6 in")
  • "X ft":
    • Feet only (e.g., "12 ft")
  • "X in":
    • Inches only (e.g., "18 in")
  • "0" or "":
    • Treated as 0 feet

The parsed value is rounded up to the nearest multiple of the specified interval.

Args:

  • distance_str (str): The distance string to parse.
  • yards_interval (int): The interval to round up yards values.
  • feet_interval (int): The interval to round up feet values.

Returns:

  • tuple: A tuple (value, unit), where value is the normalized distance (int)
      and unit is either 'yds' or 'ft'.

Raises:

  • ValueError: If the input string does not match any supported format.

Implementation Notes

  • X and Y in the patterns above are whole numbers.
  • 1 foot = 12 inches.
  • The startercode handles cases where distance_str is empty or '0'.
  • The startercode provides a function round_up. This can be used to round up to the next interval as required by this exercise.
In [ ]:
### Solution - Exercise 2  
def parse_distance(distance_str, yards_interval, feet_interval):
    from math import ceil
    
    if distance_str in ('0', ''):
        return 0, 'ft'

    def round_up(value, interval):
        if isinstance(value, str):
            value = int(value)
        return ceil(value / interval) * interval
    ### BEGIN SOLUTION

    yds_match = re.match(r'^(?P<yards>\d+)\s*yds$', distance_str)
    no_unit_match = re.match(r'^(?P<yards>\d+)$', distance_str)
    ft_in_match = re.match(r'^(?P<feet>\d+)\s*ft\s*(?P<inches>\d+)\s*in$', distance_str)
    ft_match = re.match(r'^(?P<feet>\d+)\s*ft$', distance_str)
    in_match = re.match(r'^(?P<inches>\d+)\s*in$', distance_str)

    value, unit = None, None
    if yds_match:
        value, unit = yds_match.group('yards'), 'yds'
    elif no_unit_match:
        value, unit = no_unit_match.group('yards'), 'yds'
    elif ft_in_match:
        value, unit = int(ft_in_match.group('feet')) + int(ft_in_match.group('inches'))/12, 'ft'
    elif ft_match:
        value, unit = ft_match.group('feet'), 'ft'
    elif in_match:
        value, unit = int(in_match.group('inches'))/12, 'ft'
    else:
        raise ValueError(f"Invalid distance format: {distance_str}")

    if unit == 'yds':
        interval = yards_interval
    elif unit == 'ft':
        interval = feet_interval

    return round_up(value, interval), unit
    ### END SOLUTION

### Demo function call
test_cases = [
    ('167 yds', 10, 1),
    ('30 ft 6 in', 10, 1), 
    ('3 ft 4 in', 5, 2),
    ('422', 25, 5),
    ('15 ft', 10, 3),
    ('8 in', 10, 1),
    ('150 meters', 10, 1),
    ('16.5 yds', 10, 1)
]

results = []
for i, (distance_str, yards_interval, feet_interval) in enumerate(test_cases):
    try:
        result = parse_distance(distance_str, yards_interval, feet_interval)
        print(f"parse_distance(test_cases[{i}][0], test_cases[{i}][1], test_cases[{i}][2])")
        print(f"--> {result}")
        results.append(result)
    except ValueError as e:
        print(f"parse_distance(test_cases[{i}][0], test_cases[{i}][1], test_cases[{i}][2])")
        print(f"--> ValueError: {e}")
        results.append(f"ValueError: {e}")

The demo should display this printed output.

parse_distance(test_cases[0][0], test_cases[0][1], test_cases[0][2])
--> (170, 'yds')
parse_distance(test_cases[1][0], test_cases[1][1], test_cases[1][2])
--> (31, 'ft')
parse_distance(test_cases[2][0], test_cases[2][1], test_cases[2][2])
--> (4, 'ft')
parse_distance(test_cases[3][0], test_cases[3][1], test_cases[3][2])
--> (425, 'yds')
parse_distance(test_cases[4][0], test_cases[4][1], test_cases[4][2])
--> (15, 'ft')
parse_distance(test_cases[5][0], test_cases[5][1], test_cases[5][2])
--> (1, 'ft')
parse_distance(test_cases[6][0], test_cases[6][1], test_cases[6][2])
--> ValueError: Invalid distance format: 150 meters
parse_distance(test_cases[7][0], test_cases[7][1], test_cases[7][2])
--> ValueError: Invalid distance format: 16.5 yds


The cell below will test your solution for parse_distance (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 2  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.error_handler(parse_distance),
              ex_name='parse_distance',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to parse_distance did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.error_handler(parse_distance),
              ex_name='parse_distance',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to parse_distance did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Data preprocessing (standardize lies)

Our shot data has many diverse start and end lies. These include "right/left" designations, descriptive names for various bad lies particular to a specific course, etc. We want to simplify these into only a few categories for our analysis (Tee, Fairway, Rough, Bunker, Green, Recovery, and Hole).

The cell below loads mappings for all observed start and end lies in the data to one of the seven standardized lies mentioned above.

It also loads the data prepared with the logic shown in the earlier cell, Application of extract_shot_records.

In the next exercise you will use provided mappings to standardize the start and end lies for a single shot.

In [ ]:
### Run Me!!!
start_lie_map = utils.load_object_from_publicdata('start_lie_map')
end_lie_map = utils.load_object_from_publicdata('end_lie_map')
raw_shot_records = utils.load_object_from_publicdata('raw_shot_records')

Exercise 3: (1 points)

standardize_lie

Your task: define standardize_lie as follows:

Standardizes the lie of a golf shot based on the provided mappings.

Args:

  • record (dict): A dictionary containing the 'start_lie' and 'end_lie' keys. It may contain other keys as well.
  • start_lie_map (dict): A mapping of starting lie values to standardized lie values.
  • end_lie_map (dict): A mapping of ending lie values to standardized lie values.

Returns:

  • dict: The record with 'start_lie' and 'end_lie' replaced with their standardized values. A new dictionary is returned, leaving the original record unchanged.

Note:

  • You can assume that the 'start_lie' and 'end_lie' values in the record will always be present in the provided maps.
In [ ]:
### Solution - Exercise 3  
def standardize_lie(record, start_lie_map, end_lie_map):
    ### BEGIN SOLUTION
    start_lie = start_lie_map[record['start_lie']]
    end_lie = end_lie_map[record['end_lie']]

    return {
        **record,
        'start_lie': start_lie,
        'end_lie': end_lie
    }
    ### END SOLUTION

### Demo function call
test_records = [
    {'start_lie': 'Tee Box', 'end_lie': 'Right Fairway', 'foo': 'bar'},
    {'start_lie': 'Primary Rough', 'end_lie': 'Green', 'foo': 'bar'},
    {'start_lie': 'Fairway', 'end_lie': 'Hole', 'foo': 'bar'},
    {'start_lie': 'Green', 'end_lie': 'Hole', 'foo': 'bar'}
]

sample_start_lie_map = {
    'Tee Box': 'Tee',
    'Primary Rough': 'Rough',
    'Fairway': 'Fairway',
    'Green': 'Green'
}

sample_end_lie_map = {
    'Right Fairway': 'Fairway',
    'Green': 'Green',
    'Hole': 'Hole'
}

results = []
for i, record in enumerate(test_records):
    result = standardize_lie(record, sample_start_lie_map, sample_end_lie_map)
    print(f"standardize_lie(test_records[{i}], sample_start_lie_map, sample_end_lie_map)")
    print(f"--> {result}")
    results.append(result)

The demo should display this printed output.

standardize_lie(test_records[0], sample_start_lie_map, sample_end_lie_map)
--> {'start_lie': 'Tee', 'end_lie': 'Fairway', 'foo': 'bar'}
standardize_lie(test_records[1], sample_start_lie_map, sample_end_lie_map)
--> {'start_lie': 'Rough', 'end_lie': 'Green', 'foo': 'bar'}
standardize_lie(test_records[2], sample_start_lie_map, sample_end_lie_map)
--> {'start_lie': 'Fairway', 'end_lie': 'Hole', 'foo': 'bar'}
standardize_lie(test_records[3], sample_start_lie_map, sample_end_lie_map)
--> {'start_lie': 'Green', 'end_lie': 'Hole', 'foo': 'bar'}


The cell below will test your solution for standardize_lie (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 3  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=standardize_lie,
              ex_name='standardize_lie',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to standardize_lie did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=standardize_lie,
              ex_name='standardize_lie',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to standardize_lie did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of parse_distance and standardize_lie

We used correct implementations of parse_distance and standardize_lie to standardize raw_shot_records with this code snippet.

standardized_shot_records = []
for record in raw_shot_records:
    standardized_record = standardize_lie(record, start_lie_map, end_lie_map)
    standardized_record['start_distance'], standardized_record['start_unit'] = parse_distance(standardized_record['start_distance'], 10, 1)
    standardized_record['end_distance'], standardized_record['end_unit'] = parse_distance(standardized_record['end_distance'], 10, 1)
    standardized_shot_records.append(standardized_record)

The code cell below loads the result into the environment.

In [ ]:
### Run Me!!!
standardized_shot_records = utils.load_object_from_publicdata('standardized_shot_records')

Baseline calculation (calculate means)

The standardized_shot_records is ready to work with, and it's time to calculate the expected value (mean strokes to hole) for each distance and lie.

There may be some anomalies where only a few observations occurred for a particular combination. We do not want to include these in our calculations because they may not illustrate the true difficulty of those shots.

Exercise 4: (0 points)

calculate_baseline

Example: we have defined calculate_baseline as follows:

This is an example. You do not need to implement anything here.

Calculates baseline average strokes to hole for each unique combination of start lie, distance unit, and distance. Groups shots by their starting lie, distance unit, and distance, then computes the mean strokes to hole for each group that meets or exceeds the specified minimum count. The result is a nested dictionary structure with rounded mean values.

Args:

  • shots_lod (list of dict): List of shot dictionaries, each containing at least the following keys:
    • 'start_lie' (str): The starting lie of the shot (e.g., "Fairway").
    • 'start_unit' (str): The unit of measurement for the starting distance (e.g., "yds" or "ft").
    • 'start_distance' (int): The starting distance to the hole (e.g., 150).
    • 'strokes_to_hole' (int): The number of strokes taken to complete the hole from the starting position.
  • min_count (int): Minimum number of shots required in a group with the same starting lie, distance unit, and distance to compute a baseline value.

Returns:

  • dict: Nested dictionary with structure baseline[lie][unit][distance](float) = mean strokes to hole (rounded to 3 decimals) for each lie, distance unit, and distance meeting the minimum count criterion.

Implementation Notes

In [ ]:
### Solution - Exercise 4  
def calculate_baseline(shots_lod, min_count):
    observations = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
    # observations[lie][unit][distance] = list of strokes_to_hole observations
    # e.g., observations['Fairway']['yds'][170] = [3, 4, 2, 5, 3]
    # This is used to accumulate all strokes_to_hole values for each unique (lie, unit, distance) combination.

    baseline = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
    # baseline[lie][unit][distance] = mean strokes_to_hole (rounded to 3 decimals)
    # e.g., baseline['Fairway']['yds'][170] = 3.5
    # This will store the final computed mean values for each (lie, unit, distance) combination.

    # populate observations
    for shot in shots_lod:
        lie = shot['start_lie']
        unit = shot['start_unit']
        dist = shot['start_distance']
        observations[lie][unit][dist].append(shot['strokes_to_hole'])

    # compute baseline means for groups meeting min_count
    for lie, unit_dict in observations.items():
        for unit, dist_dict in unit_dict.items():
            for dist, strokes in dist_dict.items():
                n = len(strokes)
                if n >= min_count:
                    baseline[lie][unit][dist] = round(float(mean(strokes)), 3)

    # Convert defaultdicts to regular dicts for the final output
    # You won't need to do this if you prefer to return defaultdicts.
    baseline = utils.defaultdict_to_dict_recursive(baseline)
    return baseline

### Demo function call
test_scenarios = [
    {
        'shots': [
            {'start_lie': 'Fairway', 'start_unit': 'yds', 'start_distance': 150, 'strokes_to_hole': 3},
            {'start_lie': 'Fairway', 'start_unit': 'yds', 'start_distance': 150, 'strokes_to_hole': 4},
            {'start_lie': 'Fairway', 'start_unit': 'yds', 'start_distance': 150, 'strokes_to_hole': 2},
            {'start_lie': 'Fairway', 'start_unit': 'yds', 'start_distance': 150, 'strokes_to_hole': 3},
            {'start_lie': 'Fairway', 'start_unit': 'yds', 'start_distance': 150, 'strokes_to_hole': 4}
        ],
        'min_count': 3
    },
    {
        'shots': [
            {'start_lie': 'Green', 'start_unit': 'ft', 'start_distance': 10, 'strokes_to_hole': 1},
            {'start_lie': 'Green', 'start_unit': 'ft', 'start_distance': 10, 'strokes_to_hole': 2},
            {'start_lie': 'Green', 'start_unit': 'ft', 'start_distance': 10, 'strokes_to_hole': 1},
            {'start_lie': 'Green', 'start_unit': 'ft', 'start_distance': 10, 'strokes_to_hole': 1},
            {'start_lie': 'Tee', 'start_unit': 'yds', 'start_distance': 400, 'strokes_to_hole': 4},
            {'start_lie': 'Tee', 'start_unit': 'yds', 'start_distance': 400, 'strokes_to_hole': 5},
            {'start_lie': 'Rough', 'start_unit': 'yds', 'start_distance': 100, 'strokes_to_hole': 3},
            {'start_lie': 'Rough', 'start_unit': 'yds', 'start_distance': 100, 'strokes_to_hole': 4},
            {'start_lie': 'Rough', 'start_unit': 'yds', 'start_distance': 100, 'strokes_to_hole': 3},
            {'start_lie': 'Rough', 'start_unit': 'yds', 'start_distance': 100, 'strokes_to_hole': 2}
        ],
        'min_count': 4
    }
]

results = []
for i, scenario in enumerate(test_scenarios):
    result = calculate_baseline(scenario['shots'], scenario['min_count'])
    print(f"calculate_baseline(test_scenarios[{i}]['shots'], test_scenarios[{i}]['min_count'])")
    print(f"--> {result}")
    results.append(result)

pprint(results)


The test cell below will always pass. Please submit to collect your free points for calculate_baseline (exercise 4).

In [ ]:
### Test Cell - Exercise 4  


print('Passed! Please submit.')

Application of calculate_baseline

A correct implementation of calculate_baseline was used in the following code snippet.

raw_baseline = calculate_baseline(standardized_shot_records, 40)
In [ ]:
### Run Me!!!
raw_baseline = utils.load_object_from_publicdata('raw_baseline')

Baseline Calculation (interpolate missing values)

Since we filtered out shot/lie combinations without enough observations, there's going to be gaps in our baseline. As it stands, we can't calculate strokes gained for those shots, or any new shots which fall in the gaps. We will resolve this with linear interpolation.

More formally,

  • let $x_k$ and $y_k$ be the distance and expected strokes at the $k$-th interval.
  • let $d$ be the length of the distance interval

If there are $n-1$ missing observations between the observation, $(x_k, y_k)$, and the next observation $(x_{k+n}, y_{k+n})$, then:

  • calculate the slope, $m = \frac{y_{k+n} - y_k}{x_{k+n} - x_k}$
  • calculate the interpolated distance and expected strokes for every $i$ between 1 and $n$ -1 as $(x_{k+i}, y_{k+i}) = (x_k + i \cdot d, y_k + m \cdot i \cdot d)$

Exercise 5: (3 points)

interpolate_distances

Your task: define interpolate_distances as follows:

Interpolates values between given distances at a specified interval. Given a dictionary mapping distances to values, this function generates a list of (distance, value) pairs, including interpolated values at regular intervals between the original distances. The interpolation is linear between each pair of consecutive distances. All values are rounded to three decimal places.

Example:

  • Given distances {10: 1.5, 30: 2.8} with interval=5, the gap between 10 and 30 is 20 units.
  • Since 20 > 5, interpolation occurs at distances 15, 20, and 25 (creating 3 new points).
  • If the gap were ≤5, no interpolation would occur between those points.

Args:

  • distance_dict (dict): Dictionary mapping distances (int) to expected strokes to hole values (float).
  • interval (int): The interval at which to interpolate values between distances.

Returns:

  • list of tuple: List of (distance (int), value (float)) pairs, including original and interpolated points.

Implementation Notes:

  • Given two points, $(x_0, y_0)$ and $(x_1, y_1)$, you can calculate the slope, $m$, of the line connecting them using the formula:
$$m = \frac{y_1-y_0}{x_1-x_0}$$
  • Given an interval, $d$, you can calculate a list of points, $(x, y)$, on the line between them with:
    • $x = (x_0 + i \cdot d)$
    • $y = (y_0 + m \cdot i \cdot d)$
    • Where $i$ is an integer such that $x_0 \le x \lt x_1$ . Including $x_0$ and excluding $x_1$ is a crucial detail.
  • If you iterate over the consecutive pairs of distance/expected_strokes "points"
    • You can calculate the list of points for every pair. Even if the only valid $i$ is $i=0$.
    • You can combine them together to get the full list of observed and interpolated points.
  • After iterating, make sure the last observation is in the result.
  • The grading accounts for floating point error. Mathematically correct implementations will pass.
In [ ]:
### Solution - Exercise 5  
def interpolate_distances(distance_dict, interval):
    ### BEGIN SOLUTION
    interpolated = []
    distances = sorted(distance_dict.keys())

    for x0, x1 in zip(distances[:-1], distances[1:]):
        y0 = distance_dict[x0]
        y1 = distance_dict[x1]
        
        n_steps = (x1 - x0) // interval
        for i in range(n_steps): 
            x = x0 + i * interval
            y = y0 + ((y1 - y0) / (x1 - x0)) * i * interval
            interpolated.append((x, round(y, 3))) 

    # Ensure the last point is included
    interpolated.append((x1, round(y1, 3)))

    return interpolated
    ### END SOLUTION

### Demo function call
test_scenarios = [
    {
        'distance_dict': {0: 0, 5: 10, 15: 30},
        'interval': 5
    },
    {
        'distance_dict': {28: 2.538, 30: 2.308, 31: 2.384, 32: 2.414, 33: 2.444, 34: 2.455, 35: 2.44, 36: 2.385, 37: 2.542, 38: 2.487},
        'interval': 1
    },
    {
        'distance_dict': {10: 1.5, 30: 2.8, 60: 4.2},
        'interval': 10
    }
]

results = []
for i, scenario in enumerate(test_scenarios):
    result = interpolate_distances(scenario['distance_dict'], scenario['interval'])
    print(f"interpolate_distances(test_scenarios[{i}]['distance_dict'], test_scenarios[{i}]['interval'])")
    print(f"--> {result}")
    results.append(result)

The demo should display this printed output.

interpolate_distances(test_scenarios[0]['distance_dict'], test_scenarios[0]['interval'])
--> [(0, 0.0), (5, 10.0), (10, 20.0), (15, 30)]
interpolate_distances(test_scenarios[1]['distance_dict'], test_scenarios[1]['interval'])
--> [(28, 2.538), (29, 2.423), (30, 2.308), (31, 2.384), (32, 2.414), (33, 2.444), (34, 2.455), (35, 2.44), (36, 2.385), (37, 2.542), (38, 2.487)]
interpolate_distances(test_scenarios[2]['distance_dict'], test_scenarios[2]['interval'])
--> [(10, 1.5), (20, 2.15), (30, 2.8), (40, 3.267), (50, 3.733), (60, 4.2)]


The cell below will test your solution for interpolate_distances (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 5  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=interpolate_distances,
              ex_name='interpolate_distances',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to interpolate_distances did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=interpolate_distances,
              ex_name='interpolate_distances',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to interpolate_distances did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of interpolate_distances

We used a correct implementation of interpolate_distances in the code snippet below to create interpolated_baseline.

interpolated_baseline = {}
for lie, distance_dict in raw_baseline.items():
    interpolated_baseline[lie] = {}
    for unit, distances in distance_dict.items():
        if unit == 'yds':
            interval = 10
        else:
            interval = 1
        interpolated_distances = interpolate_distances(distances, interval)
        interpolated_baseline[lie][unit] = dict(interpolated_distances)

Now there's no gaps, but the data is somewhat noisy. Run the code below to see it on scatterplots.

In [ ]:
interpolated_baseline = utils.load_object_from_publicdata('interpolated_baseline')

# Plot for yards units
# Plot for yards units (Interpolated Baseline)
plt.figure(figsize=(12, 6))
for lie, units in interpolated_baseline.items():
    if 'yds' in units:
        distances, values = zip(*sorted(units['yds'].items()))
        plt.scatter(distances, values, s=20, alpha=0.7, label=f"{lie} (points)")
plt.title('Interpolated Baseline by Lie (Yards)')
plt.xlabel('Distance (yds)')
plt.ylabel('Interpolated Baseline')
plt.legend()
plt.show()

# Plot for feet units (Interpolated Baseline)
plt.figure(figsize=(12, 6))
for lie, units in interpolated_baseline.items():
    if 'ft' in units:
        distances, values = zip(*sorted(units['ft'].items()))
        plt.scatter(distances, values, s=20, alpha=0.7, label=f"{lie} (points)")
plt.title('Interpolated Baseline by Lie (Feet)')
plt.xlabel('Distance (ft)')
plt.ylabel('Interpolated Baseline')
plt.legend()
plt.show()

Baseline calculation (smoothing)

Generally, we want the baseline to indicate that a shot is "easier" the closer it is to the hole with all else equal. (i.e. longer shots have higher expected strokes than shorter shots.) That's not the case for our baseline so far because it's still noisy. We want to apply an algorithm to smooth out our baseline.

This exercise is beyond the scope of this exam, but is included for completeness (and FREE!!!).

Exercise 6: (1 points)

lowess_smooth

Example: we have defined lowess_smooth as follows:

This is a FREE exercise, the solution is provided for you!

In [ ]:
### Solution - Exercise 6  
def lowess_smooth(data, frac):
    x_vals, y_vals = zip(*data)
    smoothed = lowess(y_vals, x_vals, frac=frac)
    return list((k, round(v, 3)) for k, v in smoothed)


The test cell below will always pass. Please submit to collect your free points for lowess_smooth (exercise 6).

In [ ]:
### Test Cell - Exercise 6  


print('Passed! Please submit.')

Application of lowess_smooth

The implementation of lowess_smooth above was used to create smoothed_baseline

smoothed_baseline = {}
for lie, distance_dict in interpolated_baseline.items():
    smoothed_baseline[lie] = {}
    for unit, distances in distance_dict.items():
        smoothed_distances = lowess_smooth(distances, frac=0.3)
        # We must ensure that the smoothed values are at least 1.0.
        # Logically a start position outside of the hole must take at least 1 stroke to get to the hole.
        smoothed_distances = [(k, max((v, 1.0))) for k, v in smoothed_distances]
        smoothed_baseline[lie][unit] = dict(smoothed_distances)

The precomputed smoothed_baseline is loaded in the cell below. In the subsequent cell, it is shown on a scatterplot.

In [ ]:
### Run Me!!!
smoothed_baseline = utils.load_object_from_publicdata('smoothed_baseline')
In [ ]:
# Plot for yards units (Smoothed Baseline)
plt.figure(figsize=(12, 6))
for lie, units in smoothed_baseline.items():
    if 'yds' in units:
        distances, values = zip(*sorted(units['yds'].items()))
        plt.scatter(distances, values, s=20, alpha=0.7, label=f"{lie} (points)")
plt.title('Smoothed Baseline by Lie (Yards)')
plt.xlabel('Distance (yds)')
plt.ylabel('Smoothed Baseline')
plt.legend()
plt.show()

# Plot for feet units (Smoothed Baseline)
plt.figure(figsize=(12, 6))
for lie, units in smoothed_baseline.items():
    if 'ft' in units:
        distances, values = zip(*sorted(units['ft'].items()))
        plt.scatter(distances, values, s=20, alpha=0.7, label=f"{lie} (points)")
plt.title('Smoothed Baseline by Lie (Feet)')
plt.xlabel('Distance (ft)')
plt.ylabel('Smoothed Baseline')
plt.legend()
plt.show()

Strokes gained calculation (implement the formula)

Strokes gained for a single shot is the expected strokes at the start, minus the stroke for the shot itself, minus the expected strokes at the end of the shot.

Mathematically:

Let $E(\text{lie}, \text{unit}, \text{distance})$ be the expected strokes to hole for a given lie, distance, and unit from the baseline. Then, strokes gained is calculated as follows:

$$SG = E(\text{lie}_{\text{start}}, \text{unit}_{\text{start}}, \text{distance}_{\text{start}}) - 1 - E(\text{lie}_{\text{end}}, \text{unit}_{\text{end}}, \text{distance}_{\text{end}})$$

In the exercise below, you will implement this formula.

Exercise 7: (2 points)

calc_strokes_gained_shot

Your task: define calc_strokes_gained_shot as follows:

Calculates the strokes gained for a single golf shot based on baseline expected strokes.

Args:

  • shot (dict): A dictionary containing information about the shot, including:
    • start_lie (str): The lie type at the start of the shot (e.g., 'Fairway', 'Rough').
    • start_distance (int): The distance from the hole at the start of the shot.
    • start_unit (str): The unit of distance (e.g., 'yards', 'meters').
    • end_lie (str): The lie type at the end of the shot (e.g., 'Green', 'Hole').
    • end_distance (int): The distance from the hole at the end of the shot.
    • end_unit (str): The unit of distance for the end position.
  • baseline (dict): A nested dictionary mapping starting lies (str), units (str), and distances (int) to expected strokes (float).
    • i.e., baseline[lie][unit][distance] gives the expected strokes from that position.

Returns:

  • float: The strokes gained value for the shot, rounded to three decimal places.
    • Calculated as: strokes_gained = start_expected_strokes - end_expected_strokes - 1
    • If the shot ends in the hole (end_lie == 'Hole'), then end_expected_strokes is considered 0.
    • If a KeyError occurs due to missing baseline data, the error is handled by returning None.
In [ ]:
### Solution - Exercise 7  
def calc_strokes_gained_shot(shot, baseline):
    ### BEGIN SOLUTION
    start_lie = shot['start_lie']
    start_distance = shot['start_distance']
    start_unit = shot['start_unit']
    end_lie = shot['end_lie']
    end_distance = shot['end_distance']
    end_unit = shot['end_unit']

    try:
        start_expected_strokes = baseline[start_lie][start_unit][start_distance]
        if end_lie == 'Hole':
            return round(start_expected_strokes - 1, 3)
        end_expected_strokes = baseline[end_lie][end_unit][end_distance]
        strokes_gained = start_expected_strokes - (end_expected_strokes + 1)
        return round(strokes_gained, 3)
    except KeyError:
        return None
    ### END SOLUTION

### Demo function call
test_scenarios = [
    {
        'shot': {'end_distance': 170, 'end_lie': 'Fairway', 'end_unit': 'yds', 'start_distance': 430, 'start_lie': 'Tee', 'start_unit': 'yds'},
        'baseline': {
            'Tee': {'yds': {430: 4.25}},
            'Fairway': {'yds': {170: 3.278}}
        }
    },
    {
        'shot': {'end_distance': 0, 'end_lie': 'Hole', 'end_unit': 'ft', 'start_distance': 10, 'start_lie': 'Green', 'start_unit': 'ft'},
        'baseline': {
            'Green': {'ft': {10: 1.5}}
        }
    },
    {
        'shot': {'end_distance': 50, 'end_lie': 'Rough', 'end_unit': 'yds', 'start_distance': 200, 'start_lie': 'Fairway', 'start_unit': 'yds'},
        'baseline': {
            'Tee': {'yds': {430: 4.25}}
        }
    }
]

results = []
for i, scenario in enumerate(test_scenarios):
    result = calc_strokes_gained_shot(scenario['shot'], scenario['baseline'])
    print(f"calc_strokes_gained_shot(test_scenarios[{i}]['shot'], test_scenarios[{i}]['baseline'])")
    print(f"--> {result}")
    results.append(result)

pprint(results)

The demo should display this printed output.

calc_strokes_gained_shot(test_scenarios[0]['shot'], test_scenarios[0]['baseline'])
--> -0.028
calc_strokes_gained_shot(test_scenarios[1]['shot'], test_scenarios[1]['baseline'])
--> 0.5
calc_strokes_gained_shot(test_scenarios[2]['shot'], test_scenarios[2]['baseline'])
--> None
[-0.028, 0.5, None]


The cell below will test your solution for calc_strokes_gained_shot (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 7  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=calc_strokes_gained_shot,
              ex_name='calc_strokes_gained_shot',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=102)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calc_strokes_gained_shot did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=calc_strokes_gained_shot,
              ex_name='calc_strokes_gained_shot',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=102,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calc_strokes_gained_shot did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Strokes gained calculation (apply formula)

There's no "application" given, because you have to implement that yourself! We need to apply the formula for all of our shot records.

In [ ]:
### Run Me!!!
dummy_calc_strokes_gained_shot = utils.load_object_from_publicdata('dummy_calc_strokes_gained_shot')

Exercise 8: (2 points)

calc_strokes_gained_all

Your task: define calc_strokes_gained_all as follows:

Calculates strokes gained for a list of shots and identifies invalid shots.

Args:

  • shot_records (list of dict): List of shot records, where each shot is represented as a dictionary.
  • baseline (dict): Baseline data used for strokes gained calculation. Maps starting lies (str), units (str), and distances (int) to expected strokes (float).
  • sg_calc_func (Callable): Function that calculates strokes gained for a shot given the baseline. sg_calc_func(shot_record, baseline) should return a numeric value or None if the shot is invalid.

Returns:

  • tuple: A tuple containing two elements:
    • strokes_gained (list of dict): List of shot records with strokes gained values
      • Each dictionary contains all original shot fields plus a new strokes_gained field
      • Only includes shots where sg_calc_func returned a valid numeric value
    • invalid_shots (list of dict): List of original shot records that could not be processed
      • Contains original shot dictionaries unchanged
      • Only includes shots where sg_calc_func returned None
In [ ]:
### Solution - Exercise 8  
def calc_strokes_gained_all(shot_records, baseline, sg_calc_func):
    ### BEGIN SOLUTION
    strokes_gained = []
    invalid_shots = []
    
    for shot in shot_records:
        sg_value = sg_calc_func(shot, baseline)
        if sg_value is not None:
            shot_with_sg = shot.copy()
            shot_with_sg['strokes_gained'] = sg_value
            strokes_gained.append(shot_with_sg)
        else:
            invalid_shots.append(shot)
    
    return strokes_gained, invalid_shots
    ### END SOLUTION

### Demo function call
test_scenarios = [
    {
        'shots': [
            {'end_distance': 170.0, 'end_lie': 'Fairway', 'end_unit': 'yds', 'start_distance': 430.0, 'start_lie': 'Tee', 'start_unit': 'yds', 'player_id': 45609},
            {'end_distance': 0, 'end_lie': 'Hole', 'end_unit': 'ft', 'start_distance': 10.0, 'start_lie': 'Green', 'start_unit': 'ft', 'player_id': 45609},
            {'end_distance': 20, 'end_lie': 'Green', 'end_unit': 'ft', 'start_distance': 150, 'start_lie': 'Fairway', 'start_unit': 'yds', 'player_id': 45609},
            {'end_distance': 999.0, 'end_lie': 'NonExistentLie', 'end_unit': 'yds', 'start_distance': 999.0, 'start_lie': 'InvalidLie', 'start_unit': 'yds', 'player_id': 45609},
        ],
        'baseline': {
            'Tee': {'yds': {430.0: 4.25}},
            'Green': {'ft': {10.0: 1.5, 20: 2.0}},
            'Fairway': {'yds': {150: 3.0}}
        }
    },
    {
        'shots': [],
        'baseline': {'Tee': {'yds': {200: 3.5}}}
    }
]

results = []
for i, scenario in enumerate(test_scenarios):
    result = calc_strokes_gained_all(scenario['shots'], scenario['baseline'], dummy_calc_strokes_gained_shot)
    print(f"calc_strokes_gained_all(test_scenarios[{i}]['shots'], test_scenarios[{i}]['baseline'], dummy_calc_strokes_gained_shot)")
    print(f"--> {result}")
    results.append(result)

The demo should display this printed output.

calc_strokes_gained_all(test_scenarios[0]['shots'], test_scenarios[0]['baseline'], dummy_calc_strokes_gained_shot)
--> ([{'end_distance': 170.0, 'end_lie': 'Fairway', 'end_unit': 'yds', 'start_distance': 430.0, 'start_lie': 'Tee', 'start_unit': 'yds', 'player_id': 45609, 'strokes_gained': 0.25}, {'end_distance': 0, 'end_lie': 'Hole', 'end_unit': 'ft', 'start_distance': 10.0, 'start_lie': 'Green', 'start_unit': 'ft', 'player_id': 45609, 'strokes_gained': 0.5}, {'end_distance': 20, 'end_lie': 'Green', 'end_unit': 'ft', 'start_distance': 150, 'start_lie': 'Fairway', 'start_unit': 'yds', 'player_id': 45609, 'strokes_gained': 0.0}], [{'end_distance': 999.0, 'end_lie': 'NonExistentLie', 'end_unit': 'yds', 'start_distance': 999.0, 'start_lie': 'InvalidLie', 'start_unit': 'yds', 'player_id': 45609}])
calc_strokes_gained_all(test_scenarios[1]['shots'], test_scenarios[1]['baseline'], dummy_calc_strokes_gained_shot)
--> ([], [])


The cell below will test your solution for calc_strokes_gained_all (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 8  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=calc_strokes_gained_all,
              ex_name='calc_strokes_gained_all',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calc_strokes_gained_all did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=calc_strokes_gained_all,
              ex_name='calc_strokes_gained_all',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calc_strokes_gained_all did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of calc_strokes_gained_all

We used correct implementations of calc_strokes_gained_shot and calc_strokes_gained_all in the code snippet below to create strokes_gained_records - adding the strokes gained metric to the data for each shot.

strokes_gained_records, invalid_shots = calc_strokes_gained_all(
    standardized_shot_records, smoothed_baseline, calc_strokes_gained_shot
)

There were a few invalid shots, which we did not load. The cell below loads strokes_gained_records into the environment.

In [ ]:
### Run Me!!!
strokes_gained_records = utils.load_object_from_publicdata('strokes_gained_records')

Analysis (categorize shots)

In golf there are high-level categories for shot scenarios based on the starting lie and distance from the hole:

  • Off the Tee - this category includes the initial shot for each hole on the course. (The starting area is called the "tee" and in this scenario players may place their ball on a tee instead of playing it off of the ground).
  • Putting - this category includes shots which start from the putting green, a relatively flat and closely mowed area surrounding the hole. For these shots players use a special club to roll the ball on the ground instead of hitting it up into the air.
  • Approach - this category includes shots where the player is attempting to hit shots from the fairway, rough, or bunker (but not the tee or other "bad" lies) onto the putting green from a relatively long distance.
  • Around the green - this category includes shots where the player is attempting to hit a shot from relatively close to the putting green onto the green.

In the next exercise you will categorize shots into these categories to eventually determine how good or bad a player is in each scenario based on strokes gained.

Exercise 9: (1 points)

classify_shot

Your task: define categorize_shot as follows:

Categorizes a golf shot based on its starting lie and distance unit.

Args:

  • shot (dict): A dictionary containing information about the shot.
    • Expected keys are 'start_lie' (str) and 'start_unit' (str).

Returns:

  • str: The category of the shot. Possible values are:
    • 'Off The Tee' if the shot starts from the tee.
    • 'Putting' if the shot starts from the green.
    • 'Approach' if the shot's distance unit is yards ('yds').
    • 'Around The Green' if the shot's distance unit is feet ('ft').

Note:

  • If the 'start_lie' is 'Tee', it categorizes the shot as 'Off The Tee'. (Regardless of the unit)
  • If the 'start_lie' is 'Green', it categorizes the shot as 'Putting'. (Regardless of the unit)
  • If the distance unit is 'yds', it categorizes the shot as 'Approach'. (If the lie is not 'Tee' or 'Green')
  • If the distance unit is 'ft', it categorizes the shot as 'Around The Green'. (If the lie is not 'Tee' or 'Green')
In [ ]:
### Solution - Exercise 9  
def categorize_shot(shot):
    ### BEGIN SOLUTION

    start_lie = shot['start_lie']
    start_unit = shot['start_unit']
    if start_lie == 'Tee':
        return 'Off The Tee'
    elif start_lie == 'Green':
        return 'Putting'
    elif start_unit == 'yds':
        return 'Approach'
    elif start_unit == 'ft':
        return 'Around The Green'
    ### END SOLUTION

### Demo function call
sample_shots = [
    {'start_lie': 'Tee', 'start_unit': 'yds'},
    {'start_lie': 'Fairway', 'start_unit': 'yds'},
    {'start_lie': 'Green', 'start_unit': 'ft'},
    {'start_lie': 'Rough', 'start_unit': 'ft'}
]

results = []
for i, shot in enumerate(sample_shots):
    category = categorize_shot(shot)
    print(f"categorize_shot(sample_shots[{i}])")
    print(f"--> {category}")
    results.append(category)

The demo should display this printed output.

categorize_shot(sample_shots[0])
--> Off The Tee
categorize_shot(sample_shots[1])
--> Approach
categorize_shot(sample_shots[2])
--> Putting
categorize_shot(sample_shots[3])
--> Around The Green


The cell below will test your solution for classify_shot (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 9  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=categorize_shot,
              ex_name='classify_shot',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to classify_shot did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=categorize_shot,
              ex_name='classify_shot',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to classify_shot did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of categorize_shot

The code snippet below uses a correct implementation of categorize_shot to produce categorized_strokes_gained_records - adding a category to each shot.

categorized_strokes_gained_records = []
for shot in strokes_gained_records:
    category = categorize_shot(shot)
    categorized_strokes_gained_records.append({
        **shot,
        'category': category
    })

Analysis (aggregation by player and shot category)

There are a lot of variables affecting an individual golf shot besides the player's skill. However, by taking the mean strokes gained for each player in each category we will get a good approximation for that player's skill in each scenario.

In the next exercise you will compute the mean strokes gained for each player in each category (as well as the mean across categories) to reveal each player's strengths and weaknesses relative to the other PGA TOUR players.

In [ ]:
### Run Me!!!
categorized_strokes_gained_records = utils.load_object_from_publicdata('categorized_strokes_gained_records')

Exercise 10: (2 points)

player_strokes_gained_summary

Your task: define summarize_player_strokes_gained as follows:

Summarizes strokes gained statistics for each player and category. Aggregates total strokes gained and number of shots per player and category, then computes the average strokes gained for each category and overall ("Total"). Returns a nested dictionary mapping player IDs to their category-wise average strokes gained.

Args:

  • categorized_strokes_gained_records (list of dict): List of shot records, each containing 'player_id' (str), 'category' (str), and 'strokes_gained' (float).

Returns:

  • dict: Nested dictionary where keys are player IDs, values are dictionaries mapping category names (including "Total") to average strokes gained (float) rounded to three decimal places.

Note:

  • The "Total" category aggregates all strokes gained for each player, regardless of the category. It is not the average of the category averages.
  • Each category's average is computed as total strokes gained divided by the number of shots in that category.
  • If a player has no shots in a category, the average for that category is not included in the result for that player.
  • The expected demo and test results are dictionaries. The test will also accept defaultdict results.

Implementation Notes

  • A similar strategy to the one used in the example code in exercise 4 can be used to solve this exercise.
In [ ]:
### Solution - Exercise 10  
def summarize_player_strokes_gained(categorized_strokes_gained_records):
    ### BEGIN SOLUTION
    summary = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
    for shot in categorized_strokes_gained_records:
        player_id = shot['player_id']
        category = shot['category']
        summary[player_id][category]['total_strokes_gained'] += shot['strokes_gained']
        summary[player_id][category]['num_shots'] += 1
        summary[player_id]['Total']['total_strokes_gained'] += shot['strokes_gained']
        summary[player_id]['Total']['num_shots'] += 1
    for player_id, categories in summary.items():
        for category, stats in categories.items():
            if stats['num_shots'] > 0:
                average_strokes_gained = round(
                    stats['total_strokes_gained'] / stats['num_shots'], 3)
            else:
                average_strokes_gained = 0.0
            del stats
            categories[category] = average_strokes_gained
    return utils.defaultdict_to_dict_recursive(summary)
    ### END SOLUTION

### Demo function call
sample_categorized_records = [
    {'player_id': 45609, 'category': 'Off The Tee', 'strokes_gained': 0.041},
    {'player_id': 45609, 'category': 'Approach', 'strokes_gained': -0.037},
    {'player_id': 45609, 'category': 'Putting', 'strokes_gained': -0.21},
    {'player_id': 45609, 'category': 'Off The Tee', 'strokes_gained': 0.15},
    {'player_id': 45609, 'category': 'Approach', 'strokes_gained': 0.08},
    {'player_id': 12345, 'category': 'Putting', 'strokes_gained': 0.25},
    {'player_id': 12345, 'category': 'Around The Green', 'strokes_gained': -0.10},
    {'player_id': 12345, 'category': 'Putting', 'strokes_gained': 0.18},
    {'player_id': 67890, 'category': 'Off The Tee', 'strokes_gained': 0.12},
    {'player_id': 67890, 'category': 'Approach', 'strokes_gained': 0.05},
    {'player_id': 67890, 'category': 'Around The Green', 'strokes_gained': -0.03}
]

result = summarize_player_strokes_gained(sample_categorized_records)

pprint(result)

The demo should display this printed output.

{12345: {'Around The Green': -0.1, 'Putting': 0.215, 'Total': 0.11},
 45609: {'Approach': 0.022,
         'Off The Tee': 0.096,
         'Putting': -0.21,
         'Total': 0.005},
 67890: {'Approach': 0.05,
         'Around The Green': -0.03,
         'Off The Tee': 0.12,
         'Total': 0.047}}


The cell below will test your solution for player_strokes_gained_summary (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 10  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=summarize_player_strokes_gained,
              ex_name='player_strokes_gained_summary',
              key=b'Xu3iSVjUVUiK2GstlArLkir4gmMFaLsb37QrwkeA1vE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to player_strokes_gained_summary did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=summarize_player_strokes_gained,
              ex_name='player_strokes_gained_summary',
              key=b'n2DFq7sQKymR55EWuGJD3DJTpo_CW7Hw0fqiTGQ9x_Y=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to player_strokes_gained_summary did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Application of summarize_player_strokes_gained

We used a correct implementation of summarize player_strokes_gained to create player_strokes_gained_summary - which contains the overall mean strokes gained and means within each shot category for all players in the data set.

player_strokes_gained_summary = summarize_player_strokes_gained(categorized_strokes_gained_records)
In [ ]:
### Run Me!!!
player_strokes_gained_summary = utils.load_object_from_publicdata('player_strokes_gained_summary')

Conclusion

You have successfully completed a comprehensive golf analytics pipeline, transforming raw PGA TOUR data into meaningful performance insights. You've implemented key techniques used by professional golf analysts to evaluate player performance and identify strategic patterns.

In addition to evaluating elite professionals, the same techniques are used in paid software subscriptions and by PGA professionals to evaluate golfers at all levels as part of the multi-billion dollar golfing industry.

Fin.

If you have made it this far, congratulations! Remember to submit the exam to ensure you receive all the points you have earned!