Final Exam, Fall 2024
: Predicting Student College Enrollment
¶Version 1.0.1
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of ['Basic Python', 'Pandas', 'Numpy', 'math as code', 'data cleaning', 'feature engineering', 'logistic regression', 'model evaluation']
. It has 9 exercises numbered 0 to 8. There are 17 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!!
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 3 point(s)
Exercise 3 - : 1 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 1 point(s)
Exercise 6 - : 3 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 2 point(s)
Final reminders:
Your overall task. In this exam, you will work with a dataset containing metadata for students admitted to a small liberal arts college. The college seeks to predict which students are likely to commit to enrolling. This is critical for meeting enrollment targets and allocating resources effectively. The target variable in this dataset is the Gross Commit Indicator
, which has two possible values: 1 if the student commits and 0 if the student does not commit. Your goal is to develop a logistic regression model to predict these outcomes.
Overview: You will follow a structured workflow to process the data, engineer features, and build a logistic regression model. The notebook is organized into three main sections:
Data Exploration and Cleaning:
Feature Engineering:
Model Building and Evaluation:
By the end of this exercise, you will understand how to calculate and apply key components of logistic regression, and see the implementation of the training loop provided for you.
### Global imports
import dill
from cse6040_devkit import plugins, utils
import numpy as np
import pandas as pd
from collections import defaultdict
utils.add_from_file('series_handler', plugins)
admission_df = utils.load_object_from_publicdata('admission_df.dill')
get_dataframe_FREE
Example: we have defined get_dataframe_FREE
as follows:
The exercise is meant to introduce you to the data used throughout the exam. You will receive a free point for completing this task. Run the provided code cell to explore the dataset and familiarize yourself with its structure.
Input:
df
: A Pandas DataFrame containing student admission data. This dataset includes features such as GPA, financial aid information, and extracurricular interests. (For this exercise, the input df
is the admission_df
provided by the exam.)Output:
The dataset: The dataset comes from the admissions office and includes several features related to students. The goal is to predict the Gross Commit Indicator
, which indicates whether a student accepted the admission offer.
Dataset Features: Below are the key features included in the dataset:
Column Name | Description |
---|---|
Gross Commit Indicator | Indicates if the student accepted the offer (1) or not (0) |
Financial Aid Intent | Indicates if the student applied for financial aid |
Scholarship | Type of scholarship received, if any |
Direct Legacy? (parent) | Indicates if the student is a direct legacy |
Ethnic Background | Student's ethnic background |
First Gen to College | Indicates if the student is the first generation to attend college |
Permanent Region | Student's permanent region |
GPA | Student's GPA |
HS Class Size | Size of the student's high school class |
Campus Visit Indicator? | Indicates if the student visited the campus |
Interview? | Indicates if the student had an interview |
Sex | Student's sex |
Level of Financial Need | Indicates the student's level of financial need |
Reader Academic Rating | Rating given by the admissions reader |
Instructions for this exercise:
get_dataframe_FREE
function using the predefined admission_df
.### Solution - Exercise 0
def get_dataframe_FREE(df):
return df.head()
### Demo function call
df_head = get_dataframe_FREE(admission_df)
display(df_head)
display(df_head.dtypes)
The test cell below will always pass. Please submit to collect your free points for get_dataframe_FREE (exercise 0).
### Test Cell - Exercise 0
print('Passed! Please submit.')
data_cleaning_and_standardization
Your task: define data_cleaning_and_standardization
as follows:
Process the input DataFrame by:
Level of Financial Need
column for consistency.Input:
df
: A Pandas DataFrame containing student admission data. (For this exercise, the input df
is the provided admission_df
dataframe.)Output:
Level of Financial Need
column is standardized.Requirements/steps:
Fill Missing Values:
Scholarship
with the string "No Scholarship"
.Permanent Region
with the string "Unknown"
.GPA
with the median of non-missing GPA values.HS Class Size
with the median of non-missing class sizes.Level of Financial Need
with the string "Unknown"
.Standardize Level of Financial Need
:
Level of Financial Need
with the following simplified categories:Existing Label | Simplified Label |
---|---|
No OxyS - $0 |
No |
Low OxyS - $1 to $19,999 |
Low |
Medium OxyS - $20,000 - $29,999 |
Medium |
High OxyS - $30,000 to $45,999 |
High |
Very High OxyS - $46,000 + |
Very High |
Unknown - In Progress |
Unknown |
### Solution - Exercise 1
def data_cleaning_and_standardization(df: pd.DataFrame) -> pd.DataFrame:
###
### YOUR CODE HERE
###
df_copy = df.copy()
df_copy['Scholarship'] = df_copy['Scholarship'].fillna('No Scholarship')
df_copy['Permanent Region'] = df_copy['Permanent Region'].fillna('Unknown')
df_copy['GPA'] = df_copy['GPA'].fillna(df['GPA'].median())
df_copy['HS Class Size'] = df_copy['HS Class Size'].fillna(df['HS Class Size'].median())
# Create dictionary for standardization of Level of Financial Need
new = {
'No OxyS - $0': 'No',
'Low OxyS - $1 to $19,999': 'Low',
'Medium OxyS - $20,000 - $29,999': 'Medium',
'High OxyS - $30,000 to $45,999': 'High',
'Very High OxyS - $46,000 +': 'Very High',
'Unknown - In Progress': 'Unknown'
}
df_copy['Level of Financial Need'] = df_copy['Level of Financial Need'].replace(new)
df_copy['Level of Financial Need'] = df_copy['Level of Financial Need'].fillna('Unknown')
return df_copy
### Demo function call
demo_df_data_cleaning_and_standardization = pd.DataFrame({
'Scholarship': [np.nan, 'UPBW', np.nan, 'DUN', np.nan],
'Permanent Region': ['CA', 'NY', np.nan, 'CA', np.nan],
'GPA': [3.8, 3.6, 3.9, np.nan, 3.5],
'HS Class Size': [np.nan, 250, 150, np.nan, 300],
'Level of Financial Need': ['Low OxyS - $1 to $19,999', 'Medium OxyS - $20,000 - $29,999', 'Very High OxyS - $46,000 +', 'Unknown - In Progress', 'No OxyS - $0']
})
demo_output_data_cleaning_and_standardization = data_cleaning_and_standardization(demo_df_data_cleaning_and_standardization)
display(demo_output_data_cleaning_and_standardization)
Example: A correct implementation should produce the following output for the provided demo DataFrame:
Scholarship | Permanent Region | GPA | HS Class Size | Level of Financial Need | |
---|---|---|---|---|---|
0 | No Scholarship | CA | 3.8 | 250.0 | Low |
1 | UPBW | NY | 3.6 | 250.0 | Medium |
2 | No Scholarship | Unknown | 3.9 | 150.0 | Very High |
3 | DUN | CA | 3.7 | 250.0 | Unknown |
4 | No Scholarship | Unknown | 3.5 | 300.0 | No |
The cell below will test your solution for data_cleaning_and_standardization (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=data_cleaning_and_standardization,
ex_name='data_cleaning_and_standardization',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to data_cleaning_and_standardization did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
state_coordinates
Your task: define state_coordinates
as follows:
We need to tie together state data that exists in separate dictionaries.
abbr_dict
holds each state's abbreviation and state name.
coor_dict
hold each state name and the coordinates. This data needs to be joined so we have a single data structure that
holds each state name, the coordinates, and the state abbreviation.
Input:
abbr_dict
: A dictionary with each key the state abbreviation and each value the respective state name, each in lowercase. coor_dict
: A nested dictionary with the outer keys of latitude
and longitude
. Each inner dictionary holds the full state name (for example 'California') as the key and the respective coordinate float as the value.Output:
state_data
: A new list of dictionaries where each dictionary holds the data for 1 state, sorted by the state full name. Within each dictionary will be the following key-value pairs:state
: The value of this key is the full name of the respective state (for example, 'California') as a stringlatitude
: The value of this key is the latitude as a floatlongtitude
: The value of this key is the longtitude as a floatabbr
: The value of this key is the upper-case abbreviation of the respective state (for example 'CA') as a stringAdditional Notes:
coor_dict
is guaranteed to have the keys latitude
and longitude
.latitude
and longitude
values in coor_dict
are expected to be floats. If there are data type inconsistencies, they should be handled appropriately to maintain the expected format.### Solution - Exercise 2
def state_coordinates(abbr_dict: dict, coor_dict: dict) -> dict:
###
### YOUR CODE HERE
###
print(abbr_dict)
#print(coor_dict)
state_data = []
# Create new dictionary with the state as key, and abbr as value
state_map = {v: k.upper() for k, v in abbr_dict.items()}
#print(state_map)
#print(coor_dict['latitude'].items())
for state, latitude, in coor_dict['latitude'].items():
#print(state)
#print(latitude)
longitude = coor_dict['longitude'].get(state)
abbr = state_map.get(state.lower())
state_data.append({
'state': state,
'latitude': float(latitude),
'longitude': float(longitude),
'abbr': abbr
})
state_data.sort(key=lambda x: x['state'])
return state_data
### Demo function call
demo_abbr_dict_state_coordinates = {'nc': 'north carolina', 'ca': 'california'}
demo_door_dict_state_coordinates = {'latitude': {'California': 36.17, 'North Carolina': 35.6411},
'longitude': {'California': -119.7462, 'North Carolina': -79.8431}}
demo_output_state_coordinates = state_coordinates(demo_abbr_dict_state_coordinates, demo_door_dict_state_coordinates)
display(demo_output_state_coordinates)
Example: A correct implementation should produce the following output for the provided demo DataFrame:
[{'state': 'California',
'latitude': 36.17,
'longitude': -119.7462,
'abbr': 'CA'},
{'state': 'North Carolina',
'latitude': 35.6411,
'longitude': -79.8431,
'abbr': 'NC'}]
The cell below will test your solution for state_coordinates (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=state_coordinates,
ex_name='state_coordinates',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to state_coordinates did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
add_lat_long
Your task: define add_lat_long
as follows:
Add two new columns to the input DataFrame (latitude
and longitude
), based on the Permanent Region
column. Use the provided state_data
dictionary to map state abbreviations to their corresponding geographic coordinates.
Input:
df
: A Pandas DataFrame. It must include the column Permanent Region
, which contains state abbreviations (e.g., "CA"
, "NY"
).
(For this exercise, the input df
is the admission_df
modified through prior steps.)state_data
: A list of dictionaries, where each dictionary represents a state and includes the following fields:"abbr"
: State abbreviation (e.g., "CA"
, "NY"
)"latitude"
: Latitude of the state (float)"longitude"
: Longitude of the state (float)Output:
latitude
: Latitude of the state based on Permanent Region
longitude
: Longitude of the state based on Permanent Region
Requirements/steps:
state_data
.
For example, the dictionary { "CA": {"latitude": 36.7783, "longitude": -119.4179} }
maps "CA"
to its coordinates.latitude
and longitude
columns based on the Permanent Region
column in df
.Permanent Region
doesn't match any key in the mapping, set latitude
and longitude
to pd.NA
.### Solution - Exercise 3
def add_lat_long(df: pd.DataFrame, state_data) -> pd.DataFrame:
###
### YOUR CODE HERE
###
#print(df.head())
state_df = pd.DataFrame(state_data)
state_df = state_df.rename(columns={'abbr': 'Permanent Region'})
#print(state_df.head())
merged_df = pd.merge(df, state_df, on='Permanent Region', how='left')
#print(merged_df.head())
merged_df['latitude'] = merged_df['latitude'].fillna(pd.NA)
merged_df['longitude'] = merged_df['longitude'].fillna(pd.NA)
#print(merged_df.head())
final = merged_df.drop('state', axis=1)
return final
### Demo function call
demo_df_add_lat_long = pd.DataFrame({"Permanent Region": ["FL", "NY", "CA", "GA", "FR"]})
demo_state_data = [
{"state": "Florida", "latitude": 27.8333, "longitude": -81.717, "abbr": "FL"},
{"state": "New York", "latitude": 42.1497, "longitude": -74.9384, "abbr": "NY"},
{"state": "California", "latitude": 36.17, "longitude": -119.7462, "abbr": "CA"},
{"state": "Georgia", "latitude": 32.9866, "longitude": -83.6487, "abbr": "GA"}
]
demo_output_add_lat_long = add_lat_long(demo_df_add_lat_long, demo_state_data)
display(demo_output_add_lat_long)
Example: A correct implementation should produce the following output:
Permanent Region | latitude | longitude |
---|---|---|
FL | 27.8333 | -81.717 |
NY | 42.1497 | -74.9384 |
CA | 36.17 | -119.7462 |
GA | 32.9866 | -83.6487 |
FR | <NA> |
<NA> |
The cell below will test your solution for add_lat_long (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=add_lat_long,
ex_name='add_lat_long',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to add_lat_long did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_distance
Your task: define calculate_distance
as follows:
Calculate the distance in miles from each student's location
to the school. The location of each student is determined by the latitude
and longitude
columns in the input DataFrame,
while the school's location is specified by the coordinates latitude 34.1271
and longitude -118.2109
.
Input:
df
: A Pandas DataFrame containing two columns:latitude
: Latitude of the student's location in decimal degrees.longitude
: Longitude of the student's location in decimal degrees.Output:
df
, with an additional column:distance_from_school
: The distance (in miles) from each student's location to the school.Requirements/steps:
pd.to_numeric
function to convert the latitude
and longitude
columns to numeric values and coerce any errors. This step ensures that non-numeric or missing values are converted to NaN
.np.radians
function to convert all latitude and longitude values (both student and school) from degrees to radians.Apply the Haversine Formula:
Compute the distance between each student's location and the school's location using the following formula:
Definitions:
distance_from_school
, to store the calculated distances.### Solution - Exercise 4
def calculate_distance(df: pd.DataFrame, school_lat: float, school_lon: float) -> pd.DataFrame:
r"""Calculate the distance in miles from each student's location
to the school. The location of each student is determined by the `latitude` and `longitude` columns in the input DataFrame,
while the school's location is specified by the coordinates latitude `34.1271` and longitude `-118.2109`.
**Input:**
- `df`: A Pandas DataFrame containing two columns:
- `latitude`: Latitude of the student's location in decimal degrees.
- `longitude`: Longitude of the student's location in decimal degrees.
**Output:**
- A new Pandas DataFrame that includes all columns from the input `df`, with an additional column:
- `distance_from_school`: The distance (in miles) from each student's location to the school.
**Requirements/steps:**
1. **Convert Columns to Numeric:**
- Use the `pd.to_numeric` function to convert the `latitude` and `longitude` columns to numeric values and coerce any errors. This step ensures that non-numeric or missing values are converted to `NaN`.
2. **Convert Coordinates to Radians:**
- Use the `np.radians` function to convert all latitude and longitude values (both student and school) from degrees to radians.
3. **Apply the Haversine Formula:**
- Compute the distance between each student's location and the school's location using the following formula:
$$d = 2r \cdot \arcsin \left( \sqrt{\sin^2 \left( \frac{\Delta \phi}{2} \right) + \cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2 \left( \frac{\Delta \lambda}{2} \right)} \right)$$
- Definitions:
- $\phi_1$ and $\phi_2$ are the latitudes (in radians) of the school and the student, respectively.
- $\lambda_1$ and $\lambda_2$ are the longitudes (in radians) of the school and the student, respectively.
- $\Delta \phi = \phi_2 - \phi_1$ (difference in latitudes).
- $\Delta \lambda = \lambda_2 - \lambda_1$ (difference in longitudes).
- $r$ is the Earth's radius (mean radius = 3,959 miles).
4. **Add the New Column:**
- Create a new column, `distance_from_school`, to store the calculated distances.
"""
###
### YOUR CODE HERE
###
#print(df.head())
#print(school_lat)
#print(school_lon)
school_lat_rad = np.radians(school_lat)
school_lon_rad = np.radians(school_lon)
df_copy = df.copy()
df_copy['latitude'] = pd.to_numeric(df_copy['latitude'], errors='coerce')
df_copy['longitude'] = pd.to_numeric(df_copy['longitude'], errors='coerce')
df_copy['latitude_rad'] = np.radians(df_copy['latitude'])
df_copy['longitude_rad'] = np.radians(df_copy['longitude'])
df_copy['delta_lat'] = df_copy['latitude_rad'] - school_lat_rad
df_copy['delta_lon'] = df_copy['longitude_rad'] - school_lon_rad
a = (np.sin(df_copy['delta_lat'] / 2) ** 2 +
np.cos(school_lat_rad) * np.cos(df_copy['latitude_rad']) * np.sin(df_copy['delta_lon'] / 2) ** 2)
c = 2 * np.arcsin(np.sqrt(a))
r = 3959
df_copy['distance_from_school'] = (r * c).round(3)
final = df_copy.drop(columns=['latitude_rad', 'longitude_rad', 'delta_lat', 'delta_lon'])
return final
### Demo function call
demo_df_calculate_distance = pd.DataFrame({
"Permanent Region": ["FL", "NY", "CA", "GA", "TX"],
"latitude": [27.8333, 42.1497, 36.17, 32.9866, "invalid"],
"longitude": [-81.717, -74.9384, -119.7462, -83.6487, "missing"]
})
demo_output_calculate_distance = calculate_distance(
demo_df_calculate_distance,
school_lat=34.1271,
school_lon=-118.2109
)
demo_output_calculate_distance
Example: A correct implementation should produce the following output:
Permanent Region | latitude | longitude | distance_from_school | |
---|---|---|---|---|
0 | FL | 27.8333 | -81.717 | 2193.210034 |
1 | NY | 42.1497 | -74.9384 | 2389.334519 |
2 | CA | 36.17 | -119.7462 | 165.674547 |
3 | GA | 32.9866 | -83.6487 | 1982.193883 |
4 | TX | NaN | NaN | NaN |
The cell below will test your solution for calculate_distance (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=calculate_distance,
ex_name='calculate_distance',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_distance did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
one_hot_encode
Your task: define one_hot_encode
as follows:
Perform one-hot encoding on a dataset using pandas.get_dummies
.
Input:
df
: A pandas DataFrame containing the columns to be one-hot encoded.Output:
encoded_df
: A pandas DataFrame containing:df
retained unmodified.Requirements/steps:
Additional Notes:
pandas.get_dummies
documentation for more information.### Solution - Exercise 5
def one_hot_encode(df: pd.DataFrame):
###
### YOUR CODE HERE
###
cols = [
"Financial Aid Intent",
"Scholarship",
"Ethnic Background",
"Permanent Region",
"Sex",
"Level of Financial Need"]
one_hot_df = pd.get_dummies(df, columns = cols, dtype=float)
return one_hot_df
### Demo function call
demo_df_one_hot_encode = pd.DataFrame({
'Financial Aid Intent': ['FAY', 'FAY', 'FAN', 'FAY', 'FAY', 'FAN', 'FAN', 'FAY', 'FAY', 'FAN'],
'Scholarship': ['NO', 'LDRS', 'NO', 'NO', 'No Scholarship', 'DIR', 'DIR', 'NO', 'NO', 'DIR'],
'Ethnic Background': ['White', 'Black', 'Asian', 'White', 'Multiracial', 'White', 'White', 'Black', 'Asian', 'White'],
'Permanent Region': ['CA', 'NY', 'CA', 'CA', 'CA', 'WA', 'WA', 'NY', 'CA', 'WA'],
'Sex': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M'],
'Level of Financial Need': ['Low', 'Very High', 'Medium', 'Low', 'High', 'Low', 'Low', 'Medium', 'High', 'Very High']
})
demo_output_one_hot_encode = one_hot_encode(demo_df_one_hot_encode)
display(demo_output_one_hot_encode)
Example: A correct implementation should produce the following output:
Financial Aid Intent_FAN | Financial Aid Intent_FAY | Scholarship_DIR | Scholarship_LDRS | Scholarship_NO | Scholarship_No Scholarship | Ethnic Background_Asian | Ethnic Background_Black | Ethnic Background_Multiracial | Ethnic Background_White | Permanent Region_CA | Permanent Region_NY | Permanent Region_WA | Sex_F | Sex_M | Level of Financial Need_High | Level of Financial Need_Low | Level of Financial Need_Medium | Level of Financial Need_Very High |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
The cell below will test your solution for one_hot_encode (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=one_hot_encode,
ex_name='one_hot_encode',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to one_hot_encode did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
balance_split_data
Your task: define balance_split_data
as follows:
Balance the dataset by oversampling the minority class and split the data into training and test datasets.
You will use the train_test_split
and resample
functions from sklearn
to complete this task.
The functions have been imported for you. Refer to the documentation below for more information.:
Input:
df
: A pandas DataFrame containing the student admissions data.ind_col
: The indicator column containing the dependent variable encoded as binary: 0 or 1 (Gross Commit Indicator
in our data).test_size
: A float between 0.0 and 1.0 representing the proportion of the dataset to include in the test split.random_state
: An integer representing the random state for reproducibility.Output:
X_train
: A pandas DataFrame containing the features for the training dataset.X_test
: A pandas DataFrame containing the features for the test dataset.y_train
: A pandas Series containing the target variable for the training dataset.y_test
: A pandas Series containing the target variable for the test dataset.Each object represents a part of the split balanced dataset, with X_train
and X_test
containing the independent variables (features), and y_train
and y_test
containing the dependent variable (target).
The features in X_train
and X_test
retain the same column names as the input DataFrame, excluding the indicator column (ind_col
) used as the target.
Requirements/steps:
ind_col
= 1) and one for the majority class (ind_col
= 0).sklearn.utils.resample
to oversample the minority class to match the majority class size.random_state
parameter to the resampling function for reproducibility.train_test_split
to split the balanced data into training and test sets.random_state
parameter to the split function for reproducibility.stratify
parameter to maintain the class balance in both splits.X_train
, X_test
) and target variables (y_train
, y_test
) for both sets.Notes:
### Solution - Exercise 6
def balance_split_data(df: pd.DataFrame, ind_col: str, test_size: float, random_state: int):
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
###
### YOUR CODE HERE
###
# First balance data
majority_class = df[df[ind_col] == 0]
minority_class = df[df[ind_col] == 1]
#print(len(minority_class))
#print(len(majority_class))
minority_oversampled = resample(minority_class,
#replace=True,
n_samples=len(majority_class),
random_state=random_state)
#print(len(minority_oversampled))
balanced_df = pd.concat([majority_class, minority_oversampled])
#print(len(balanced_df))
# Perform split on balanced data
X = balanced_df.drop(columns=[ind_col])
y = balanced_df[ind_col]
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=test_size,
random_state=random_state,
stratify=y)
return X_train, X_test, y_train, y_test
### Demo function call
demo_df_balance_encode_split_data = pd.DataFrame({
'Gross Commit Indicator': [0, 0, 0, 0, 1, 1, 1, 1, 0, 1],
"Financial Aid Intent_FAN": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],
"Financial Aid Intent_FAY": [1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
"Scholarship_DIR": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
"Scholarship_LDRS": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
"Scholarship_NO": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
"Scholarship_No Scholarship": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
"Ethnic Background_Asian": [0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
"Ethnic Background_Black": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0],
"Ethnic Background_White": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'GPA': [3.67, 3.76, 3.58, 4.0, 3.9, 3.72, 3.94, 3.76, 3.58, 3.67],
'HS Class Size': [80.0, 26.0, 642.0, 303.0, 288.0, 288.0, 77.0, 77.0, 303.0, 288.0],
"Ethnic Background_Multiracial": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
"Permanent Region_CA": [1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0],
"Permanent Region_NY": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
"Permanent Region_WA": [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
"Sex_F": [0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0],
"Sex_M": [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0],
"Level of Financial Need_High": [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0],
"Level of Financial Need_Low": [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
"Level of Financial Need_Medium": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
"Level of Financial Need_Very High": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
})
X_train, X_test, y_train, y_test = balance_split_data(
demo_df_balance_encode_split_data,
ind_col='Gross Commit Indicator',
test_size=0.2,
random_state=42
)
print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape: {X_test.shape}")
print(f"y_train.shape: {y_train.shape}")
print(f"y_test.shape: {y_test.shape}")
Example: A correct implementation should produce the following output:
X_train.shape: (8, 21)
X_test.shape: (2, 21)
y_train.shape: (8,)
y_test.shape: (2,)
The cell below will test your solution for balance_split_data (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.series_handler(balance_split_data),
ex_name='balance_split_data',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=30)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to balance_split_data did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_f1_score
Your task: define calculate_f1_score
as follows:
Calculate the F1 score for a classification model's predictions.
The F1 score is a metric that balances precision and recall, making it particularly useful for evaluating models on imbalanced datasets.
Input:
y_true
: A list or NumPy array of true class labels (must contain only 0 or 1).y_pred
: A list or NumPy array of predicted class labels (must contain only 0 or 1).Output:
f1_score
: A float representing the F1 score, rounded to 3 decimal places.Definitions:
Formulas:
Precision ():
Recall ():
F1 Score ():
Special Cases:
Hints:
np.sum
) to efficiently calculate , , and .### Solution - Exercise 7
def calculate_f1_score(y_true, y_pred):
r"""
Calculate the F1 score for a classification model's predictions.
The F1 score is a metric that balances precision and recall, making it particularly useful for evaluating models on imbalanced datasets.
**Input:**
- `y_true`: A list or NumPy array of true class labels (must contain only 0 or 1).
- `y_pred`: A list or NumPy array of predicted class labels (must contain only 0 or 1).
**Output:**
- `f1_score`: A float representing the F1 score, rounded to 3 decimal places.
**Definitions:**
- **True Positives (TP):** The number of correctly predicted positive class instances.
- **False Positives (FP):** The number of instances incorrectly predicted as positive (but actually negative).
- **False Negatives (FN):** The number of instances incorrectly predicted as negative (but actually positive).
**Formulas:**
1. **Precision ($P$):**
$$P = \frac{TP}{TP + FP}$$
Precision measures the proportion of positive predictions that are correct.
2. **Recall ($R$):**
$$R = \frac{TP}{TP + FN}$$
Recall measures the proportion of actual positive instances that are correctly predicted.
3. **F1 Score ($F1$):**
$$F1 = 2 \cdot \frac{P \cdot R}{P + R}$$
The F1 score is the harmonic mean of precision and recall.
**Special Cases:**
- If $TP + FP = 0$ (no positive predictions), precision is undefined.
- If $TP + FN = 0$ (no actual positives), recall is undefined.
- In both cases, set the F1 score to 0 to handle division by zero.
**Hints:**
- Use NumPy's vectorized operations (e.g., `np.sum`) to efficiently calculate $TP$, $FP$, and $FN$.
"""
###
### YOUR CODE HERE
###
y_true = np.array(y_true)
y_pred = np.array(y_pred)
TP = np.sum((y_true == 1) & (y_pred == 1))
FP = np.sum((y_true == 0) & (y_pred == 1))
FN = np.sum((y_true == 1) & (y_pred == 0))
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)
return round(f1_score, 3)
### Demo function call
demo_y_true_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 0])
demo_y_pred_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1])
demo_output_f1_score = calculate_f1_score(demo_y_true_f1_score, demo_y_pred_f1_score)
print(demo_output_f1_score)
Example: A correct implementation should produce the following output:
# Example Input
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
y_pred = [0, 1, 1, 0, 1, 0, 1, 0, 0, 1]
# Example Output
F1 Score: 0.8
The cell below will test your solution for calculate_f1_score (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=calculate_f1_score,
ex_name='calculate_f1_score',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_f1_score did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
compute_logistic_metrics
Your task: define compute_logistic_metrics
as follows:
Calculate the predictions and gradients required for training a logistic regression model.
Input:
X
: A NumPy array of shape (m, n+1) containing the input features. The first column must be all ones, representing the bias term. Each row represents a single observation, and each column represents a feature.y_true
: A NumPy array of shape (m,) containing the true class labels (0 or 1) for each observation.weights
: A NumPy array of shape (n+1,) representing the model weights, including the bias term.Output:
sigmoid_predictions
: A NumPy array of shape (m,) containing the predicted probabilities for each observation. These values are computed using the sigmoid function.gradients_w
: A NumPy array of shape (n+1,) containing the gradient of the cost with respect to each weight, including the bias term.Requirements/steps:
Hints:
### Solution - Exercise 8
def compute_logistic_metrics(X, y_true, weights):
r"""
Calculate the predictions and gradients required for training a logistic regression model.
**Input:**
- `X`: A NumPy array of shape (m, n+1) containing the input features. The first column must be all ones, representing the bias term. Each row represents a single observation, and each column represents a feature.
- `y_true`: A NumPy array of shape (m,) containing the true class labels (0 or 1) for each observation.
- `weights`: A NumPy array of shape (n+1,) representing the model weights, including the bias term.
**Output:**
- `sigmoid_predictions`: A NumPy array of shape (m,) containing the predicted probabilities for each observation. These values are computed using the sigmoid function.
- `gradients_w`: A NumPy array of shape (n+1,) containing the gradient of the cost with respect to each weight, including the bias term.
**Requirements/steps:**
1. **Compute the Linear Combination of Inputs**:
- Calculate $z$, the linear combination of inputs, using the formula:
$$z = X \cdot weights$$
2. **Compute Sigmoid Predictions**:
- Apply the sigmoid function to $z$ to compute predicted probabilities:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
- Note: These predicted probabilities, $\sigma(z)$, are referred to as $\hat{y}$ in logistic regression.
3. **Compute Gradients**:
- Calculate the gradient of the cost with respect to the weights:
$$\frac{\partial J}{\partial w} = \frac{1}{m} X^T (\hat{y} - y)$$
- Here, $\hat{y}$ represents the predicted probabilities (sigmoid predictions), and $m$ is the number of samples in the dataset.
**Hints:**
- Use NumPy's vectorized operations for efficient computation of gradients.
- Note that $X$ includes the bias term as the first column.
"""
###
### YOUR CODE HERE
###
z = np.dot(X, weights)
sig_preds = 1 / (1 + np.exp(-z))
m = X.shape[0]
error = sig_preds - y_true
gradients = (1/m) * np.dot(X.T, error)
return sig_preds, gradients
### Demo function call
# Example Input
demo_X_logistic_metrics = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]]) # Includes the bias column
demo_y_true_logistic_metrics = np.array([0, 1, 1])
demo_weights_logistic_metrics = np.array([0.5, 0.1, 0.2]) # Includes bias as the first weight
# Compute Outputs
demo_sigmoid_predictions, demo_gradients_w = compute_logistic_metrics(
demo_X_logistic_metrics, demo_y_true_logistic_metrics, demo_weights_logistic_metrics
)
# Print Outputs
print(f"Sigmoid Predictions: {demo_sigmoid_predictions}")
print(f"Gradients (Weights): {demo_gradients_w}")
Example: A correct implementation should produce the following output:
# Example Input
X = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]]) # Includes the bias column as the first column
y_true = np.array([0, 1, 1])
weights = np.array([0.5, 0.1, 0.2])
# Example Output
Sigmoid Predictions: array([0.73105858 0.83201839 0.90024951])
Gradients (Weights): array([ 0.15444216 -0.09054624 0.06389592])
The cell below will test your solution for compute_logistic_metrics (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=compute_logistic_metrics,
ex_name='compute_logistic_metrics',
key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to compute_logistic_metrics did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
If you have made it this far, congratulations! You are done. Please submit your exam!
The remainder of this notebook combines the work you have completed above with a few addition steps to build a working logistic regression model.
## Data Exploration and Cleaning
# Step 0: Fill in missing values and standardize feature formats
cleaned_df = data_cleaning_and_standardization(admission_df)
# Step 1: Create clean state_data dictionary
abbr_dict = utils.load_object_from_publicdata('abbr_dict.dill')
coor_dict = utils.load_object_from_publicdata('coor_dict.dill')
state_data = (abbr_dict, coor_dict)
## Feature Engineering
# Step 2: Use state_data to feature engineer new columns in cleaned_df
lat_long_df = add_lat_long(cleaned_df, state_data)
# Step 3: Add student distance from school
distance_df = calculate_distance(lat_long_df, 34.1271, -118.2109)
# Step 4: One-hot encode categorical features
one_hot_df = one_hot_encode(distance_df)
## Model Building and Evaluation
# Step 5: Split data into train and test while fixing class imbalance
X_train, X_test, y_train, y_test = balance_split_data(one_hot_df, 'Gross Commit Indicator', 0.2, 42)
# Step 6: Train logistic regression model using (exercise is left to the reader)
def train_logistic_regression(X_train, y_train):
# Initialize the weights and biases to zero
# Implement the cost and gradients from previous exercises
# Use gradient desent to update the weights and biases
return weights, bias
# Step 7: Make predictions
def predict_logistic_regression(X_train, y_train, X_test):
weights, bias = train_logistic_regression(X_train, y_train)
# Apply sigmoid function to compute predicted probabilities
y_pred =
y_pred_class = (y_pred >= 0.5).astype(int)
return y_pred_class
y_pred = predict_logistic_regression(X_train, y_train, X_test)
# Step 8: Evaluate the model
f1_score = calculate_f1_score(y_test, y_pred)
print(f"Model F1 Score: {f1_score}")
Model F1 Score: 0.595
Our model converges!