`Final Exam, Fall 2024`: `Predicting Student College Enrollment`¶

Version 1.0.1

All of the header information is important. Please read it.

Topics number of exercises: This problem builds on your knowledge of ['Basic Python', 'Pandas', 'Numpy', 'math as code', 'data cleaning', 'feature engineering', 'logistic regression', 'model evaluation']. It has 9 exercises numbered 0 to 8. There are 17 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 3 point(s)
Exercise 3 - : 1 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 1 point(s)
Exercise 6 - : 3 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 2 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

Exam Introduction¶

Your overall task. In this exam, you will work with a dataset containing metadata for students admitted to a small liberal arts college. The college seeks to predict which students are likely to commit to enrolling. This is critical for meeting enrollment targets and allocating resources effectively. The target variable in this dataset is the Gross Commit Indicator, which has two possible values: 1 if the student commits and 0 if the student does not commit. Your goal is to develop a logistic regression model to predict these outcomes.

Overview: You will follow a structured workflow to process the data, engineer features, and build a logistic regression model. The notebook is organized into three main sections:

Data Exploration and Cleaning:
- Explore the dataset to understand its structure, key features, and any potential issues such as missing or inconsistent data.
- Clean the dataset by filling in missing values and standardizing feature formats to prepare it for analysis.
Feature Engineering:
- Add meaningful derived features, such as geographic distances, that can improve the predictive power of the model.
- Transform categorical variables and other features into formats suitable for model building.
Model Building and Evaluation:
- Implement key functions for building a logistic regression model.
- These functions will be integrated into the broader logistic regression pipeline to demonstrate the structure of a complete logistic regression model.

By the end of this exercise, you will understand how to calculate and apply key components of logistic regression, and see the implementation of the training loop provided for you.

### Global imports
import dill
from cse6040_devkit import plugins, utils
import numpy as np
import pandas as pd
from collections import defaultdict

utils.add_from_file('series_handler', plugins)

admission_df = utils.load_object_from_publicdata('admission_df.dill')

Exercise 0: (1 points)¶

get_dataframe_FREE

Example: we have defined get_dataframe_FREE as follows:

The exercise is meant to introduce you to the data used throughout the exam. You will receive a free point for completing this task. Run the provided code cell to explore the dataset and familiarize yourself with its structure.

Input:

df: A Pandas DataFrame containing student admission data. This dataset includes features such as GPA, financial aid information, and extracurricular interests. (For this exercise, the input df is the admission_df provided by the exam.)

Output:

Displays the first few rows of the DataFrame and the data types for each column.

The dataset: The dataset comes from the admissions office and includes several features related to students. The goal is to predict the Gross Commit Indicator, which indicates whether a student accepted the admission offer.

Dataset Features: Below are the key features included in the dataset:

Column Name	Description
Gross Commit Indicator	Indicates if the student accepted the offer (1) or not (0)
Financial Aid Intent	Indicates if the student applied for financial aid
Scholarship	Type of scholarship received, if any
Direct Legacy? (parent)	Indicates if the student is a direct legacy
Ethnic Background	Student's ethnic background
First Gen to College	Indicates if the student is the first generation to attend college
Permanent Region	Student's permanent region
GPA	Student's GPA
HS Class Size	Size of the student's high school class
Campus Visit Indicator?	Indicates if the student visited the campus
Interview?	Indicates if the student had an interview
Sex	Student's sex
Level of Financial Need	Indicates the student's level of financial need
Reader Academic Rating	Rating given by the admissions reader

Instructions for this exercise:

Run the provided test cell, which will execute the get_dataframe_FREE function using the predefined admission_df.
Review the dataset by examining the printed output. Pay close attention to:
- The features (columns) available in the dataset.
- The presence of missing values or unexpected data types.
Use this information to guide your understanding of subsequent exercises, which involve cleaning, transforming, and analyzing this dataset.
No further action is required; completing this step awards you a free point!

### Solution - Exercise 0  
def get_dataframe_FREE(df):
    return df.head()

### Demo function call
df_head = get_dataframe_FREE(admission_df)
display(df_head)
display(df_head.dtypes)

The test cell below will always pass. Please submit to collect your free points for get_dataframe_FREE (exercise 0).

### Test Cell - Exercise 0  


print('Passed! Please submit.')

Exercise 1: (2 points)¶

data_cleaning_and_standardization

Your task: define data_cleaning_and_standardization as follows:

Process the input DataFrame by:

Filling missing values in specific columns based on predefined strategies.
Standardizing the categories in the Level of Financial Need column for consistency.

Input:

df: A Pandas DataFrame containing student admission data. (For this exercise, the input df is the provided admission_df dataframe.)

Output:

A modified Pandas DataFrame where missing values are filled and the Level of Financial Need column is standardized.

Requirements/steps:

Fill Missing Values:
- Replace missing values in Scholarship with the string "No Scholarship".
- Replace missing values in Permanent Region with the string "Unknown".
- Replace missing values in GPA with the median of non-missing GPA values.
- Replace missing values in HS Class Size with the median of non-missing class sizes.
- Replace missing values in Level of Financial Need with the string "Unknown".
Standardize Level of Financial Need:
- Replace the existing labels in Level of Financial Need with the following simplified categories:

Existing Label	Simplified Label
`No OxyS - $0`	No
`Low OxyS - $1 to $19,999`	Low
`Medium OxyS - $20,000 - $29,999`	Medium
`High OxyS - $30,000 to $45,999`	High
`Very High OxyS - $46,000 +`	Very High
`Unknown - In Progress`	Unknown

### Solution - Exercise 1  
def data_cleaning_and_standardization(df: pd.DataFrame) -> pd.DataFrame:

    ### BEGIN SOLUTION
    df = df.copy(deep=True)
    df['Scholarship'].fillna("No Scholarship", inplace=True)
    df['Permanent Region'].fillna("Unknown", inplace=True)
    df['GPA'].fillna(df['GPA'].median(), inplace=True)
    df['HS Class Size'].fillna(df['HS Class Size'].median(), inplace=True)
    fin_levels_norm = {'No OxyS - $0': "No",
                       'Low OxyS - $1 to $19,999': "Low",
                       'Medium OxyS - $20,000 - $29,999': "Medium",
                       'High OxyS - $30,000 to $45,999': "High",
                       'Very High OxyS - $46,000 +': "Very High",
                       'Unknown - In Progress': "Unknown"}
    df['Level of Financial Need'] = \
        df['Level of Financial Need'].map(fin_levels_norm) \
        .fillna("Unknown")
    return df
    ### END SOLUTION

### Demo function call
demo_df_data_cleaning_and_standardization = pd.DataFrame({
    'Scholarship': [np.nan, 'UPBW', np.nan, 'DUN', np.nan],
    'Permanent Region': ['CA', 'NY', np.nan, 'CA', np.nan],
    'GPA': [3.8, 3.6, 3.9, np.nan, 3.5],
    'HS Class Size': [np.nan, 250, 150, np.nan, 300],
    'Level of Financial Need': ['Low OxyS - $1 to $19,999', 'Medium OxyS - $20,000 - $29,999', 'Very High OxyS - $46,000 +', 'Unknown - In Progress', 'No OxyS - $0']
})

demo_output_data_cleaning_and_standardization = data_cleaning_and_standardization(demo_df_data_cleaning_and_standardization)
display(demo_output_data_cleaning_and_standardization)

Example: A correct implementation should produce the following output for the provided demo DataFrame:

	Scholarship	Permanent Region	GPA	HS Class Size	Level of Financial Need
0	No Scholarship	CA	3.8	250.0	Low
1	UPBW	NY	3.6	250.0	Medium
2	No Scholarship	Unknown	3.9	150.0	Very High
3	DUN	CA	3.7	250.0	Unknown
4	No Scholarship	Unknown	3.5	300.0	No

The cell below will test your solution for data_cleaning_and_standardization (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=data_cleaning_and_standardization,
              ex_name='data_cleaning_and_standardization',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to data_cleaning_and_standardization did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=data_cleaning_and_standardization,
              ex_name='data_cleaning_and_standardization',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to data_cleaning_and_standardization did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 2: (3 points)¶

state_coordinates

Your task: define state_coordinates as follows:

We need to tie together state data that exists in separate dictionaries. abbr_dict holds each state's abbreviation and state name. coor_dict hold each state name and the coordinates. This data needs to be joined so we have a single data structure that holds each state name, the coordinates, and the state abbreviation.

Input:

abbr_dict: A dictionary with each key the state abbreviation and each value the respective state name, each in lowercase.
coor_dict: A nested dictionary with the outer keys of latitude and longitude. Each inner dictionary holds the full state name (for example 'California') as the key and the respective coordinate float as the value.

Output:

state_data: A new list of dictionaries where each dictionary holds the data for 1 state, sorted by the state full name. Within each dictionary will be the following key-value pairs:
- state: The value of this key is the full name of the respective state (for example, 'California') as a string
- latitude: The value of this key is the latitude as a float
- longtitude: The value of this key is the longtitude as a float
- abbr: The value of this key is the upper-case abbreviation of the respective state (for example 'CA') as a string

Additional Notes:

coor_dict is guaranteed to have the keys latitude and longitude.
The latitude and longitude values in coor_dict are expected to be floats. If there are data type inconsistencies, they should be handled appropriately to maintain the expected format.

### Solution - Exercise 2  
def state_coordinates(abbr_dict: dict, coor_dict: dict) -> dict:

    ### BEGIN SOLUTION
    # step 1: process abbr dict to 1/ uppercase state abbreviation 2/ uppercase first letter in each word of state name
    abbr_dict_formatted = defaultdict()
    for key, value in abbr_dict.items():
        state_abbr = key.upper()
        state_name = value.title()
        abbr_dict_formatted[state_abbr] = state_name

    #step 2: Use value from abbr_dict_formatted to get coor_dict lat + long. Store in new dict and appending to list
    state_data = []
    for key, value in abbr_dict_formatted.items():
        state_dict = defaultdict()
        state_lat = coor_dict['latitude'][value]
        state_lon = coor_dict['longitude'][value]
        #set values
        state_dict['state'] = value
        state_dict['latitude'] = state_lat
        state_dict['longitude'] = state_lon
        state_dict['abbr'] = key
        #convert to dict before appending to list
        state_data.append(dict(state_dict))

    # step 3: sort list based on state name
    state_data = sorted(state_data, key=lambda d: d['state'])
    
    # step 4: return state_data
    return state_data
    ### END SOLUTION

### Demo function call
demo_abbr_dict_state_coordinates = {'nc': 'north carolina', 'ca': 'california'}
demo_door_dict_state_coordinates = {'latitude': {'California': 36.17, 'North Carolina': 35.6411}, 
                                    'longitude': {'California': -119.7462, 'North Carolina': -79.8431}}

demo_output_state_coordinates = state_coordinates(demo_abbr_dict_state_coordinates, demo_door_dict_state_coordinates)
display(demo_output_state_coordinates)

Example: A correct implementation should produce the following output for the provided demo DataFrame:

[{'state': 'California',
  'latitude': 36.17,
  'longitude': -119.7462,
  'abbr': 'CA'},
  {'state': 'North Carolina',
  'latitude': 35.6411,
  'longitude': -79.8431,
  'abbr': 'NC'}]

The cell below will test your solution for state_coordinates (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=state_coordinates,
              ex_name='state_coordinates',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to state_coordinates did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=state_coordinates,
              ex_name='state_coordinates',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to state_coordinates did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 3: (1 points)¶

add_lat_long

Your task: define add_lat_long as follows:

Add two new columns to the input DataFrame (latitude and longitude), based on the Permanent Region column. Use the provided state_data dictionary to map state abbreviations to their corresponding geographic coordinates.

Input:

df: A Pandas DataFrame. It must include the column Permanent Region, which contains state abbreviations (e.g., "CA", "NY"). (For this exercise, the input df is the admission_df modified through prior steps.)
state_data: A list of dictionaries, where each dictionary represents a state and includes the following fields:
- "abbr": State abbreviation (e.g., "CA", "NY")
- "latitude": Latitude of the state (float)
- "longitude": Longitude of the state (float)

Output:

A new Pandas DataFrame with two new columns added.:
- latitude: Latitude of the state based on Permanent Region
- longitude: Longitude of the state based on Permanent Region

Requirements/steps:

Create a mapping from state abbreviations to their corresponding latitude and longitude using state_data. For example, the dictionary { "CA": {"latitude": 36.7783, "longitude": -119.4179} } maps "CA" to its coordinates.
Use this mapping to populate the latitude and longitude columns based on the Permanent Region column in df.
If a value in Permanent Region doesn't match any key in the mapping, set latitude and longitude to pd.NA.

### Solution - Exercise 3  
def add_lat_long(df: pd.DataFrame, state_data) -> pd.DataFrame:
    ### BEGIN SOLUTION
    state_mapping = {item['abbr']: {'latitude': item['latitude'], 'longitude': item['longitude']} for item in state_data}
    
    df = df.copy(deep=True)
    df['latitude'] = pd.NA  
    df['longitude'] = pd.NA
    
    for abbr, coords in state_mapping.items():
        df.loc[df['Permanent Region'] == abbr, 'latitude'] = coords['latitude']
        df.loc[df['Permanent Region'] == abbr, 'longitude'] = coords['longitude']
    
    return df
    ### END SOLUTION

### Demo function call
demo_df_add_lat_long = pd.DataFrame({"Permanent Region": ["FL", "NY", "CA", "GA", "FR"]})
demo_state_data = [
    {"state": "Florida", "latitude": 27.8333, "longitude": -81.717, "abbr": "FL"},
    {"state": "New York", "latitude": 42.1497, "longitude": -74.9384, "abbr": "NY"},
    {"state": "California", "latitude": 36.17, "longitude": -119.7462, "abbr": "CA"},
    {"state": "Georgia", "latitude": 32.9866, "longitude": -83.6487, "abbr": "GA"}
]
demo_output_add_lat_long = add_lat_long(demo_df_add_lat_long, demo_state_data)
display(demo_output_add_lat_long)

Example: A correct implementation should produce the following output:

Permanent Region	latitude	longitude
FL	27.8333	-81.717
NY	42.1497	-74.9384
CA	36.17	-119.7462
GA	32.9866	-83.6487
FR	`<NA>`	`<NA>`

The cell below will test your solution for add_lat_long (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=add_lat_long,
              ex_name='add_lat_long',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to add_lat_long did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=add_lat_long,
              ex_name='add_lat_long',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to add_lat_long did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 4: (2 points)¶

calculate_distance

Your task: define calculate_distance as follows:

Calculate the distance in miles from each student's location to the school. The location of each student is determined by the latitude and longitude columns in the input DataFrame, while the school's location is specified by the coordinates latitude 34.1271 and longitude -118.2109.

Input:

df: A Pandas DataFrame containing two columns:
- latitude: Latitude of the student's location in decimal degrees.
- longitude: Longitude of the student's location in decimal degrees.

Output:

A new Pandas DataFrame that includes all columns from the input df, with an additional column:
- distance_from_school: The distance (in miles) from each student's location to the school.

Requirements/steps:

Convert Columns to Numeric:
- Use the pd.to_numeric function to convert the latitude and longitude columns to numeric values and coerce any errors. This step ensures that non-numeric or missing values are converted to NaN.
Convert Coordinates to Radians:
- Use the np.radians function to convert all latitude and longitude values (both student and school) from degrees to radians.
Apply the Haversine Formula:
- Compute the distance between each student's location and the school's location using the following formula:
  
  $$d = 2r \cdot \arcsin \left( \sqrt{\sin^2 \left( \frac{\Delta \phi}{2} \right) + \cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2 \left( \frac{\Delta \lambda}{2} \right)} \right)$$
- Definitions:
  - $\phi_1$ and $\phi_2$ are the latitudes (in radians) of the school and the student, respectively.
  - $\lambda_1$ and $\lambda_2$ are the longitudes (in radians) of the school and the student, respectively.
  - $\Delta \phi = \phi_2 - \phi_1$ (difference in latitudes).
  - $\Delta \lambda = \lambda_2 - \lambda_1$ (difference in longitudes).
  - $r$ is the Earth's radius (mean radius = 3,959 miles).
Add the New Column:
- Create a new column, distance_from_school, to store the calculated distances.

### Solution - Exercise 4  
def calculate_distance(df: pd.DataFrame, school_lat: float, school_lon: float) -> pd.DataFrame:

    ### BEGIN SOLUTION
    import numpy as np

    def haversine(lat1, lon1, lat2, lon2, earth_radius_miles=3959.0):
        lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
        dlat = lat2 - lat1
        dlon = lon2 - lon1
        a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
        c = 2 * np.arcsin(np.sqrt(a))
        distance = earth_radius_miles * c
        return distance
    
    # Ensure latitude and longitude are numeric
    df = df.copy(deep=True)
    df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
    df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')
    
    lat1 = df['latitude'].to_numpy()
    lon1 = df['longitude'].to_numpy()
    lat2 = np.full_like(lat1, school_lat)
    lon2 = np.full_like(lon1, school_lon)
    
    distances = np.where(
        np.isnan(lat1) | np.isnan(lon1),
        np.nan,
        haversine(lat1, lon1, lat2, lon2)
    )
    
    df['distance_from_school'] = distances
    return df
    ### END SOLUTION

### Demo function call
demo_df_calculate_distance = pd.DataFrame({
    "Permanent Region": ["FL", "NY", "CA", "GA", "TX"],
    "latitude": [27.8333, 42.1497, 36.17, 32.9866, "invalid"],
    "longitude": [-81.717, -74.9384, -119.7462, -83.6487, "missing"]
})

demo_output_calculate_distance = calculate_distance(
    demo_df_calculate_distance, 
    school_lat=34.1271, 
    school_lon=-118.2109
)
demo_output_calculate_distance

Example: A correct implementation should produce the following output:

	Permanent Region	latitude	longitude	distance_from_school
0	FL	27.8333	-81.717	2193.210034
1	NY	42.1497	-74.9384	2389.334519
2	CA	36.17	-119.7462	165.674547
3	GA	32.9866	-83.6487	1982.193883
4	TX	NaN	NaN	NaN

The cell below will test your solution for calculate_distance (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=calculate_distance,
              ex_name='calculate_distance',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_distance did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=calculate_distance,
              ex_name='calculate_distance',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_distance did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 5: (1 points)¶

one_hot_encode

Your task: define one_hot_encode as follows:

Perform one-hot encoding on a dataset using pandas.get_dummies.

Input:

df: A pandas DataFrame containing the columns to be one-hot encoded.

Output:

encoded_df: A pandas DataFrame containing:
- The specified columns one-hot encoded, with new columns for each unique category in the original columns.
- All other columns from the input df retained unmodified.

Requirements/steps:

One-hot encode the following columns:
- "Financial Aid Intent"
- "Scholarship"
- "Ethnic Background"
- "Permanent Region"
- "Sex"
- "Level of Financial Need"
Ensure that the resulting one-hot encoded columns are encoded as floats.
Retain all other columns in the original DataFrame without modification.

Additional Notes:

Refer to the pandas.get_dummies documentation for more information.

### Solution - Exercise 5  
def one_hot_encode(df: pd.DataFrame):

    ### BEGIN SOLUTION
    categorical_cols = [
        'Financial Aid Intent', 'Scholarship', 'Ethnic Background', 
        'Permanent Region', 'Sex', 'Level of Financial Need'
    ]
    encoded_df = pd.get_dummies(
        df, columns=categorical_cols, dtype=float
    )
    return encoded_df
    ### END SOLUTION

### Demo function call
demo_df_one_hot_encode = pd.DataFrame({
    'Financial Aid Intent': ['FAY', 'FAY', 'FAN', 'FAY', 'FAY', 'FAN', 'FAN', 'FAY', 'FAY', 'FAN'],
    'Scholarship': ['NO', 'LDRS', 'NO', 'NO', 'No Scholarship', 'DIR', 'DIR', 'NO', 'NO', 'DIR'],
    'Ethnic Background': ['White', 'Black', 'Asian', 'White', 'Multiracial', 'White', 'White', 'Black', 'Asian', 'White'],
    'Permanent Region': ['CA', 'NY', 'CA', 'CA', 'CA', 'WA', 'WA', 'NY', 'CA', 'WA'],
    'Sex': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M'],
    'Level of Financial Need': ['Low', 'Very High', 'Medium', 'Low', 'High', 'Low', 'Low', 'Medium', 'High', 'Very High']
})

demo_output_one_hot_encode = one_hot_encode(demo_df_one_hot_encode)
display(demo_output_one_hot_encode)

Example: A correct implementation should produce the following output:

Financial Aid Intent_FAN	Financial Aid Intent_FAY	Scholarship_DIR	Scholarship_LDRS	Scholarship_NO	Scholarship_No Scholarship	Ethnic Background_Asian	Ethnic Background_Black	Ethnic Background_Multiracial	Ethnic Background_White	Permanent Region_CA	Permanent Region_NY	Permanent Region_WA	Sex_F	Sex_M	Level of Financial Need_High	Level of Financial Need_Low	Level of Financial Need_Medium	Level of Financial Need_Very High
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0

The cell below will test your solution for one_hot_encode (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 5  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=one_hot_encode,
              ex_name='one_hot_encode',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to one_hot_encode did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=one_hot_encode,
              ex_name='one_hot_encode',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to one_hot_encode did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 6: (3 points)¶

balance_split_data

Your task: define balance_split_data as follows:

Balance the dataset by oversampling the minority class and split the data into training and test datasets. You will use the train_test_split and resample functions from sklearn to complete this task. The functions have been imported for you. Refer to the documentation below for more information.:

Input:

df: A pandas DataFrame containing the student admissions data.
ind_col: The indicator column containing the dependent variable encoded as binary: 0 or 1 (Gross Commit Indicator in our data).
test_size: A float between 0.0 and 1.0 representing the proportion of the dataset to include in the test split.
random_state: An integer representing the random state for reproducibility.

Output:

X_train: A pandas DataFrame containing the features for the training dataset.
X_test: A pandas DataFrame containing the features for the test dataset.
y_train: A pandas Series containing the target variable for the training dataset.
y_test: A pandas Series containing the target variable for the test dataset.

Each object represents a part of the split balanced dataset, with X_train and X_test containing the independent variables (features), and y_train and y_test containing the dependent variable (target). The features in X_train and X_test retain the same column names as the input DataFrame, excluding the indicator column (ind_col) used as the target.

Requirements/steps:

Oversample the Minority Class:
- Separate the DataFrame into two subsets: one for the minority class (ind_col = 1) and one for the majority class (ind_col = 0).
- Use sklearn.utils.resample to oversample the minority class to match the majority class size.
- Pass the random_state parameter to the resampling function for reproducibility.
- Combine the two subsets to create a balanced dataset.
Split the Balanced Data:
- Use train_test_split to split the balanced data into training and test sets.
- Pass the random_state parameter to the split function for reproducibility.
- Set the stratify parameter to maintain the class balance in both splits.
Return the features (X_train, X_test) and target variables (y_train, y_test) for both sets.

Notes:

The same random state value (from the function call) should be used for both the resampling and the split functions.

### Solution - Exercise 6  
def balance_split_data(df: pd.DataFrame, ind_col: str, test_size: float, random_state: int):
    from sklearn.model_selection import train_test_split
    from sklearn.utils import resample

    ### BEGIN SOLUTION
    # Separate majority and minority classes
    df_majority = df[df[ind_col] == 0]
    df_minority = df[df[ind_col] == 1]

    # Oversample the minority class
    df_minority_oversampled = resample(
        df_minority,
        replace=True,
        n_samples=len(df_majority),
        random_state=random_state
    )

    # Combine the majority and oversampled minority classes
    df_balanced = pd.concat([df_majority, df_minority_oversampled])

    # Split the balanced dataset
    X = df_balanced.drop(columns=[ind_col])
    y = df_balanced[ind_col]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    return X_train, X_test, y_train, y_test
    ### END SOLUTION

### Demo function call

demo_df_balance_encode_split_data = pd.DataFrame({
    'Gross Commit Indicator': [0, 0, 0, 0, 1, 1, 1, 1, 0, 1],
    "Financial Aid Intent_FAN": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],
    "Financial Aid Intent_FAY": [1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
    "Scholarship_DIR": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "Scholarship_LDRS": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Scholarship_NO": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
    "Scholarship_No Scholarship": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "Ethnic Background_Asian": [0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
    "Ethnic Background_Black": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0],
    "Ethnic Background_White": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    'GPA': [3.67, 3.76, 3.58, 4.0, 3.9, 3.72, 3.94, 3.76, 3.58, 3.67],
    'HS Class Size': [80.0, 26.0, 642.0, 303.0, 288.0, 288.0, 77.0, 77.0, 303.0, 288.0],
    "Ethnic Background_Multiracial": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Permanent Region_CA": [1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0],
    "Permanent Region_NY": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
    "Permanent Region_WA": [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "Sex_F": [0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0],
    "Sex_M": [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0],
    "Level of Financial Need_High": [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Level of Financial Need_Low": [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
    "Level of Financial Need_Medium": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
    "Level of Financial Need_Very High": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
})

X_train, X_test, y_train, y_test = balance_split_data(
    demo_df_balance_encode_split_data, 
    ind_col='Gross Commit Indicator', 
    test_size=0.2,
    random_state=42
)

print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape:  {X_test.shape}")
print(f"y_train.shape: {y_train.shape}")
print(f"y_test.shape:  {y_test.shape}")

Example: A correct implementation should produce the following output:

X_train.shape: (8, 21)
X_test.shape:  (2, 21)
y_train.shape: (8,)
y_test.shape:  (2,)

The cell below will test your solution for balance_split_data (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.series_handler(balance_split_data),
              ex_name='balance_split_data',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=30)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to balance_split_data did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.series_handler(balance_split_data),
              ex_name='balance_split_data',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to balance_split_data did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 7: (2 points)¶

calculate_f1_score

Your task: define calculate_f1_score as follows:

Calculate the F1 score for a classification model's predictions.

The F1 score is a metric that balances precision and recall, making it particularly useful for evaluating models on imbalanced datasets.

Input:

y_true: A list or NumPy array of true class labels (must contain only 0 or 1).
y_pred: A list or NumPy array of predicted class labels (must contain only 0 or 1).

Output:

f1_score: A float representing the F1 score, rounded to 3 decimal places.

Definitions:

True Positives (TP): The number of correctly predicted positive class instances.
False Positives (FP): The number of instances incorrectly predicted as positive (but actually negative).
False Negatives (FN): The number of instances incorrectly predicted as negative (but actually positive).

Formulas:

Precision ($P$): $$P = \frac{TP}{TP + FP}$$ Precision measures the proportion of positive predictions that are correct.
Recall ($R$): $$R = \frac{TP}{TP + FN}$$ Recall measures the proportion of actual positive instances that are correctly predicted.
F1 Score ($F1$): $$F1 = 2 \cdot \frac{P \cdot R}{P + R}$$ The F1 score is the harmonic mean of precision and recall.

Special Cases:

If $TP + FP = 0$ (no positive predictions), precision is undefined.
If $TP + FN = 0$ (no actual positives), recall is undefined.
In both cases, set the F1 score to 0 to handle division by zero.

Hints:

Use NumPy's vectorized operations (e.g., np.sum) to efficiently calculate $TP$, $FP$, and $FN$.

### Solution - Exercise 7  
def calculate_f1_score(y_true, y_pred):

    ### BEGIN SOLUTION
    if isinstance(y_true, pd.Series):
        y_true = y_true.values
    if isinstance(y_pred, pd.Series):
        y_pred = y_pred.values

    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0

    f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    return round(f1_score, 3)
    ### END SOLUTION

### Demo function call
demo_y_true_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 0])
demo_y_pred_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1])

demo_output_f1_score = calculate_f1_score(demo_y_true_f1_score, demo_y_pred_f1_score)
print(demo_output_f1_score)

Example: A correct implementation should produce the following output:

# Example Input
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
y_pred = [0, 1, 1, 0, 1, 0, 1, 0, 0, 1]

# Example Output
F1 Score: 0.8

The cell below will test your solution for calculate_f1_score (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=calculate_f1_score,
              ex_name='calculate_f1_score',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_f1_score did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=calculate_f1_score,
              ex_name='calculate_f1_score',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_f1_score did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 8: (2 points)¶

compute_logistic_metrics

Your task: define compute_logistic_metrics as follows:

Calculate the predictions and gradients required for training a logistic regression model.

Input:

X: A NumPy array of shape (m, n+1) containing the input features. The first column must be all ones, representing the bias term. Each row represents a single observation, and each column represents a feature.
y_true: A NumPy array of shape (m,) containing the true class labels (0 or 1) for each observation.
weights: A NumPy array of shape (n+1,) representing the model weights, including the bias term.

Output:

sigmoid_predictions: A NumPy array of shape (m,) containing the predicted probabilities for each observation. These values are computed using the sigmoid function.
gradients_w: A NumPy array of shape (n+1,) containing the gradient of the cost with respect to each weight, including the bias term.

Requirements/steps:

Compute the Linear Combination of Inputs:
- Calculate $z$, the linear combination of inputs, using the formula: $$z = X \cdot weights$$
Compute Sigmoid Predictions:
- Apply the sigmoid function to $z$ to compute predicted probabilities: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
- Note: These predicted probabilities, $\sigma(z)$, are referred to as $\hat{y}$ in logistic regression.
Compute Gradients:
- Calculate the gradient of the cost with respect to the weights: $$\frac{\partial J}{\partial w} = \frac{1}{m} X^T (\hat{y} - y)$$
- Here, $\hat{y}$ represents the predicted probabilities (sigmoid predictions), and $m$ is the number of samples in the dataset.

Hints:

Use NumPy's vectorized operations for efficient computation of gradients.
Note that $X$ includes the bias term as the first column.

### Solution - Exercise 8  
def compute_logistic_metrics(X, y_true, weights):

    ### BEGIN SOLUTION

    # Compute linear combination
    z = np.dot(X, weights)

    # Compute sigmoid predictions
    preds = 1 / (1 + np.exp(-z))

    # Compute gradients
    m = len(y_true)
    errors = preds - y_true
    gradients_w = np.dot(X.T, errors) / m

    return preds, gradients_w
    ### END SOLUTION

### Demo function call
# Example Input
demo_X_logistic_metrics = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]])  # Includes the bias column
demo_y_true_logistic_metrics = np.array([0, 1, 1])
demo_weights_logistic_metrics = np.array([0.5, 0.1, 0.2])  # Includes bias as the first weight

# Compute Outputs
demo_sigmoid_predictions, demo_gradients_w = compute_logistic_metrics(
    demo_X_logistic_metrics, demo_y_true_logistic_metrics, demo_weights_logistic_metrics
)

# Print Outputs
print(f"Sigmoid Predictions: {demo_sigmoid_predictions}")
print(f"Gradients (Weights): {demo_gradients_w}")

Example: A correct implementation should produce the following output:

# Example Input
   X = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]])  # Includes the bias column as the first column
   y_true = np.array([0, 1, 1])
   weights = np.array([0.5, 0.1, 0.2])

   # Example Output
   Sigmoid Predictions: array([0.73105858 0.83201839 0.90024951])
   Gradients (Weights): array([ 0.15444216 -0.09054624  0.06389592])

The cell below will test your solution for compute_logistic_metrics (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=compute_logistic_metrics,
              ex_name='compute_logistic_metrics',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to compute_logistic_metrics did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=compute_logistic_metrics,
              ex_name='compute_logistic_metrics',
              key=b'wOfiv2gVFgYsRRji9-GeTQLK4g6T5KSDRZIe7HkMwR8=', 
              n_iter=1,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to compute_logistic_metrics did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Fin¶

If you have made it this far, congratulations! You are done. Please submit your exam!

The remainder of this notebook combines the work you have completed above with a few addition steps to build a working logistic regression model.

Epilogue: It's Time to Build a Model¶

## Data Exploration and Cleaning

# Step 0: Fill in missing values and standardize feature formats
cleaned_df = data_cleaning_and_standardization(admission_df)

# Step 1: Create clean state_data dictionary
abbr_dict = utils.load_object_from_publicdata('abbr_dict.dill')
coor_dict = utils.load_object_from_publicdata('coor_dict.dill')
state_data = (abbr_dict, coor_dict)


## Feature Engineering

# Step 2: Use state_data to feature engineer new columns in cleaned_df
lat_long_df = add_lat_long(cleaned_df, state_data)

# Step 3: Add student distance from school
distance_df = calculate_distance(lat_long_df, 34.1271, -118.2109)

# Step 4: One-hot encode categorical features
one_hot_df = one_hot_encode(distance_df)

## Model Building and Evaluation

# Step 5: Split data into train and test while fixing class imbalance
X_train, X_test, y_train, y_test = balance_split_data(one_hot_df, 'Gross Commit Indicator', 0.2, 42)

# Step 6: Train logistic regression model using (exercise is left to the reader)
def train_logistic_regression(X_train, y_train):
    # Initialize the weights and biases to zero

    # Implement the cost and gradients from previous exercises

    # Use gradient desent to update the weights and biases

  return weights, bias


# Step 7: Make predictions
def predict_logistic_regression(X_train, y_train, X_test):
    weights, bias = train_logistic_regression(X_train, y_train)

    # Apply sigmoid function to compute predicted probabilities
    y_pred = 

    y_pred_class = (y_pred >= 0.5).astype(int)
    return y_pred_class
y_pred = predict_logistic_regression(X_train, y_train, X_test)

# Step 8: Evaluate the model
f1_score = calculate_f1_score(y_test, y_pred)
print(f"Model F1 Score: {f1_score}")

Model F1 Score: 0.595

Our model converges!

Financial Aid Intent_FAN	Financial Aid Intent_FAY	Scholarship_DIR	Scholarship_LDRS	Scholarship_NO	Scholarship_No Scholarship	Ethnic Background_Asian	Ethnic Background_Black	Ethnic Background_Multiracial	Ethnic Background_White	Permanent Region_CA	Permanent Region_NY	Permanent Region_WA	Sex_F	Sex_M	Level of Financial Need_High	Level of Financial Need_Low	Level of Financial Need_Medium	Level of Financial Need_Very High
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0

Financial Aid Intent_FAN	Financial Aid Intent_FAY	Scholarship_DIR	Scholarship_LDRS	Scholarship_NO	Scholarship_No Scholarship	Ethnic Background_Asian	Ethnic Background_Black	Ethnic Background_Multiracial	Ethnic Background_White	Permanent Region_CA	Permanent Region_NY	Permanent Region_WA	Sex_F	Sex_M	Level of Financial Need_High	Level of Financial Need_Low	Level of Financial Need_Medium	Level of Financial Need_Very High
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0

Final Exam, Fall 2024: Predicting Student College Enrollment¶

Exam Introduction¶

Exercise 0: (1 points)¶

Exercise 1: (2 points)¶

Exercise 2: (3 points)¶

Exercise 3: (1 points)¶

Exercise 4: (2 points)¶

Exercise 5: (1 points)¶

Exercise 6: (3 points)¶

Exercise 7: (2 points)¶

Exercise 8: (2 points)¶

Fin¶

Epilogue: It's Time to Build a Model¶

`Final Exam, Fall 2024`: `Predicting Student College Enrollment`¶

Financial Aid Intent_FAN	Financial Aid Intent_FAY	Scholarship_DIR	Scholarship_LDRS	Scholarship_NO	Scholarship_No Scholarship	Ethnic Background_Asian	Ethnic Background_Black	Ethnic Background_Multiracial	Ethnic Background_White	Permanent Region_CA	Permanent Region_NY	Permanent Region_WA	Sex_F	Sex_M	Level of Financial Need_High	Level of Financial Need_Low	Level of Financial Need_Medium	Level of Financial Need_Very High
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0