Final Exam, Fall 2024: Predicting Student College Enrollment

Version 1.0.1

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of ['Basic Python', 'Pandas', 'Numpy', 'math as code', 'data cleaning', 'feature engineering', 'logistic regression', 'model evaluation']. It has 9 exercises numbered 0 to 8. There are 17 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 1 point(s)

  • Exercise 1 - : 2 point(s)

  • Exercise 2 - : 3 point(s)

  • Exercise 3 - : 1 point(s)

  • Exercise 4 - : 2 point(s)

  • Exercise 5 - : 1 point(s)

  • Exercise 6 - : 3 point(s)

  • Exercise 7 - : 2 point(s)

  • Exercise 8 - : 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

Exam Introduction

Your overall task. In this exam, you will work with a dataset containing metadata for students admitted to a small liberal arts college. The college seeks to predict which students are likely to commit to enrolling. This is critical for meeting enrollment targets and allocating resources effectively. The target variable in this dataset is the Gross Commit Indicator, which has two possible values: 1 if the student commits and 0 if the student does not commit. Your goal is to develop a logistic regression model to predict these outcomes.

Overview: You will follow a structured workflow to process the data, engineer features, and build a logistic regression model. The notebook is organized into three main sections:

  1. Data Exploration and Cleaning:

    • Explore the dataset to understand its structure, key features, and any potential issues such as missing or inconsistent data.
    • Clean the dataset by filling in missing values and standardizing feature formats to prepare it for analysis.
  2. Feature Engineering:

    • Add meaningful derived features, such as geographic distances, that can improve the predictive power of the model.
    • Transform categorical variables and other features into formats suitable for model building.
  3. Model Building and Evaluation:

    • Implement key functions for building a logistic regression model.
    • These functions will be integrated into the broader logistic regression pipeline to demonstrate the structure of a complete logistic regression model.

By the end of this exercise, you will understand how to calculate and apply key components of logistic regression, and see the implementation of the training loop provided for you.

In [1]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
import numpy as np
import pandas as pd
from collections import defaultdict

utils.add_from_file('series_handler', plugins)
cse6040_devkit.plugins
In [2]:
admission_df = utils.load_object_from_publicdata('admission_df.dill')
Successfully loaded admission_df.dill.

Exercise 0: (1 points)

get_dataframe_FREE

Example: we have defined get_dataframe_FREE as follows:

The exercise is meant to introduce you to the data used throughout the exam. You will receive a free point for completing this task. Run the provided code cell to explore the dataset and familiarize yourself with its structure.

Input:

  • df: A Pandas DataFrame containing student admission data. This dataset includes features such as GPA, financial aid information, and extracurricular interests. (For this exercise, the input df is the admission_df provided by the exam.)

Output:

  • Displays the first few rows of the DataFrame and the data types for each column.

The dataset: The dataset comes from the admissions office and includes several features related to students. The goal is to predict the Gross Commit Indicator, which indicates whether a student accepted the admission offer.

Dataset Features: Below are the key features included in the dataset:

Column Name Description
Gross Commit Indicator Indicates if the student accepted the offer (1) or not (0)
Financial Aid Intent Indicates if the student applied for financial aid
Scholarship Type of scholarship received, if any
Direct Legacy? (parent) Indicates if the student is a direct legacy
Ethnic Background Student's ethnic background
First Gen to College Indicates if the student is the first generation to attend college
Permanent Region Student's permanent region
GPA Student's GPA
HS Class Size Size of the student's high school class
Campus Visit Indicator? Indicates if the student visited the campus
Interview? Indicates if the student had an interview
Sex Student's sex
Level of Financial Need Indicates the student's level of financial need
Reader Academic Rating Rating given by the admissions reader

Instructions for this exercise:

  1. Run the provided test cell, which will execute the get_dataframe_FREE function using the predefined admission_df.
  2. Review the dataset by examining the printed output. Pay close attention to:
    • The features (columns) available in the dataset.
    • The presence of missing values or unexpected data types.
  3. Use this information to guide your understanding of subsequent exercises, which involve cleaning, transforming, and analyzing this dataset.
  4. No further action is required; completing this step awards you a free point!
In [3]:
### Solution - Exercise 0  
def get_dataframe_FREE(df):
    return df.head()

### Demo function call
df_head = get_dataframe_FREE(admission_df)
display(df_head)
display(df_head.dtypes)
Gross Commit Indicator Financial Aid Intent Scholarship Direct Legacy? (parent) Ethnic Background First Gen to College Permanent Region GPA HS Class Size Campus Visit Indicator? Interview? Sex Level of Financial Need Reader Academic Rating
0 0 FAY NO 0 White 0 CA 3.67 80.0 1 1 M Low OxyS - 1to19,999 3
1 0 FAY LDRS 0 Black 0 NY 3.76 26.0 1 0 M Very High OxyS - $46,000 + 2
2 0 FAY NO 0 Asian 0 CA 3.58 642.0 1 0 F Medium OxyS - 20,00029,999 3
3 0 FAY NO 0 White 0 CA 4.00 303.0 1 0 F Low OxyS - 1to19,999 2
4 0 FAY NO 0 White 0 CA 3.57 386.0 1 0 M Low OxyS - 1to19,999 4
Gross Commit Indicator       int64
Financial Aid Intent        string
Scholarship                 string
Direct Legacy? (parent)      int64
Ethnic Background           string
First Gen to College         int64
Permanent Region            string
GPA                        float64
HS Class Size              float64
Campus Visit Indicator?      int64
Interview?                   int64
Sex                         string
Level of Financial Need     string
Reader Academic Rating       int64
dtype: object


The test cell below will always pass. Please submit to collect your free points for get_dataframe_FREE (exercise 0).

In [4]:
### Test Cell - Exercise 0  


print('Passed! Please submit.')
Passed! Please submit.

Exercise 1: (2 points)

data_cleaning_and_standardization

Your task: define data_cleaning_and_standardization as follows:

Process the input DataFrame by:

  • Filling missing values in specific columns based on predefined strategies.
  • Standardizing the categories in the Level of Financial Need column for consistency.

Input:

  • df: A Pandas DataFrame containing student admission data. (For this exercise, the input df is the provided admission_df dataframe.)

Output:

  • A modified Pandas DataFrame where missing values are filled and the Level of Financial Need column is standardized.

Requirements/steps:

  1. Fill Missing Values:

    • Replace missing values in Scholarship with the string "No Scholarship".
    • Replace missing values in Permanent Region with the string "Unknown".
    • Replace missing values in GPA with the median of non-missing GPA values.
    • Replace missing values in HS Class Size with the median of non-missing class sizes.
    • Replace missing values in Level of Financial Need with the string "Unknown".
  2. Standardize Level of Financial Need:

    • Replace the existing labels in Level of Financial Need with the following simplified categories:
Existing Label Simplified Label
No OxyS - $0 No
Low OxyS - $1 to $19,999 Low
Medium OxyS - $20,000 - $29,999 Medium
High OxyS - $30,000 to $45,999 High
Very High OxyS - $46,000 + Very High
Unknown - In Progress Unknown
In [5]:
### Solution - Exercise 1  
def data_cleaning_and_standardization(df: pd.DataFrame) -> pd.DataFrame:

    ###
    ### YOUR CODE HERE
    ###
    
    df_copy = df.copy()

    df_copy['Scholarship'] = df_copy['Scholarship'].fillna('No Scholarship')
    df_copy['Permanent Region'] = df_copy['Permanent Region'].fillna('Unknown')
    df_copy['GPA'] = df_copy['GPA'].fillna(df['GPA'].median())
    df_copy['HS Class Size'] = df_copy['HS Class Size'].fillna(df['HS Class Size'].median())
    
    # Create dictionary for standardization of Level of Financial Need
    
    new = {
    'No OxyS - $0': 'No',
    'Low OxyS - $1 to $19,999': 'Low',
    'Medium OxyS - $20,000 - $29,999': 'Medium',
    'High OxyS - $30,000 to $45,999': 'High',
    'Very High OxyS - $46,000 +': 'Very High',
    'Unknown - In Progress': 'Unknown'
    }

    df_copy['Level of Financial Need'] = df_copy['Level of Financial Need'].replace(new)
    
    df_copy['Level of Financial Need'] = df_copy['Level of Financial Need'].fillna('Unknown')

    return df_copy

### Demo function call
demo_df_data_cleaning_and_standardization = pd.DataFrame({
    'Scholarship': [np.nan, 'UPBW', np.nan, 'DUN', np.nan],
    'Permanent Region': ['CA', 'NY', np.nan, 'CA', np.nan],
    'GPA': [3.8, 3.6, 3.9, np.nan, 3.5],
    'HS Class Size': [np.nan, 250, 150, np.nan, 300],
    'Level of Financial Need': ['Low OxyS - $1 to $19,999', 'Medium OxyS - $20,000 - $29,999', 'Very High OxyS - $46,000 +', 'Unknown - In Progress', 'No OxyS - $0']
})

demo_output_data_cleaning_and_standardization = data_cleaning_and_standardization(demo_df_data_cleaning_and_standardization)
display(demo_output_data_cleaning_and_standardization)
Scholarship Permanent Region GPA HS Class Size Level of Financial Need
0 No Scholarship CA 3.8 250.0 Low
1 UPBW NY 3.6 250.0 Medium
2 No Scholarship Unknown 3.9 150.0 Very High
3 DUN CA 3.7 250.0 Unknown
4 No Scholarship Unknown 3.5 300.0 No

Example: A correct implementation should produce the following output for the provided demo DataFrame:

Scholarship Permanent Region GPA HS Class Size Level of Financial Need
0 No Scholarship CA 3.8 250.0 Low
1 UPBW NY 3.6 250.0 Medium
2 No Scholarship Unknown 3.9 150.0 Very High
3 DUN CA 3.7 250.0 Unknown
4 No Scholarship Unknown 3.5 300.0 No


The cell below will test your solution for data_cleaning_and_standardization (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [6]:
### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=data_cleaning_and_standardization,
              ex_name='data_cleaning_and_standardization',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to data_cleaning_and_standardization did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
data_cleaning_and_standardization test ran 50 iterations in 8.46 seconds
Passed! Please submit.

Exercise 2: (3 points)

state_coordinates

Your task: define state_coordinates as follows:

We need to tie together state data that exists in separate dictionaries. abbr_dict holds each state's abbreviation and state name. coor_dict hold each state name and the coordinates. This data needs to be joined so we have a single data structure that holds each state name, the coordinates, and the state abbreviation.

Input:

  • abbr_dict: A dictionary with each key the state abbreviation and each value the respective state name, each in lowercase.
  • coor_dict: A nested dictionary with the outer keys of latitude and longitude. Each inner dictionary holds the full state name (for example 'California') as the key and the respective coordinate float as the value.

Output:

  • state_data: A new list of dictionaries where each dictionary holds the data for 1 state, sorted by the state full name. Within each dictionary will be the following key-value pairs:
    • state: The value of this key is the full name of the respective state (for example, 'California') as a string
    • latitude: The value of this key is the latitude as a float
    • longtitude: The value of this key is the longtitude as a float
    • abbr: The value of this key is the upper-case abbreviation of the respective state (for example 'CA') as a string

Additional Notes:

  • coor_dict is guaranteed to have the keys latitude and longitude.
  • The latitude and longitude values in coor_dict are expected to be floats. If there are data type inconsistencies, they should be handled appropriately to maintain the expected format.
In [7]:
### Solution - Exercise 2  
def state_coordinates(abbr_dict: dict, coor_dict: dict) -> dict:

    ###
    ### YOUR CODE HERE
    ###
    
    print(abbr_dict)
    #print(coor_dict)
    
    state_data = []
    
    # Create new dictionary with the state as key, and abbr as value

    state_map = {v: k.upper() for k, v in abbr_dict.items()}
    
    #print(state_map)
    
    #print(coor_dict['latitude'].items())
    
    for state, latitude, in coor_dict['latitude'].items():
        
        #print(state)
        #print(latitude)
        
        longitude = coor_dict['longitude'].get(state)
  
        abbr = state_map.get(state.lower())
      
        state_data.append({
              'state': state,
              'latitude': float(latitude),
              'longitude': float(longitude),
              'abbr': abbr
              })
        
    state_data.sort(key=lambda x: x['state'])
        
    return state_data

### Demo function call
demo_abbr_dict_state_coordinates = {'nc': 'north carolina', 'ca': 'california'}
demo_door_dict_state_coordinates = {'latitude': {'California': 36.17, 'North Carolina': 35.6411}, 
                                    'longitude': {'California': -119.7462, 'North Carolina': -79.8431}}

demo_output_state_coordinates = state_coordinates(demo_abbr_dict_state_coordinates, demo_door_dict_state_coordinates)
display(demo_output_state_coordinates)
{'nc': 'north carolina', 'ca': 'california'}
[{'state': 'California',
  'latitude': 36.17,
  'longitude': -119.7462,
  'abbr': 'CA'},
 {'state': 'North Carolina',
  'latitude': 35.6411,
  'longitude': -79.8431,
  'abbr': 'NC'}]

Example: A correct implementation should produce the following output for the provided demo DataFrame:

[{'state': 'California',
  'latitude': 36.17,
  'longitude': -119.7462,
  'abbr': 'CA'},
  {'state': 'North Carolina',
  'latitude': 35.6411,
  'longitude': -79.8431,
  'abbr': 'NC'}]


The cell below will test your solution for state_coordinates (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [8]:
### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=state_coordinates,
              ex_name='state_coordinates',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to state_coordinates did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
{'ak': 'alaska', 'wv': 'west virginia', 'wy': 'wyoming', 'la': 'louisiana', 'hi': 'hawaii', 'ga': 'georgia', 'nv': 'nevada', 'ia': 'iowa', 'nm': 'new mexico', 'ms': 'mississippi', 'ne': 'nebraska', 'mt': 'montana', 'il': 'illinois', 'mn': 'minnesota', 'ca': 'california', 'sc': 'south carolina', 'ut': 'utah', 'ks': 'kansas', 'sd': 'south dakota', 'wa': 'washington'}
{'pr': 'puerto rico', 'de': 'delaware', 'al': 'alabama', 'ri': 'rhode island', 'fl': 'florida', 'ms': 'mississippi', 'wy': 'wyoming', 'ak': 'alaska', 'il': 'illinois', 'wa': 'washington', 'va': 'virginia', 'mt': 'montana', 'mo': 'missouri', 'ok': 'oklahoma', 'nh': 'new hampshire', 'mi': 'michigan', 'id': 'idaho', 'nd': 'north dakota', 'ky': 'kentucky', 'la': 'louisiana'}
{'sd': 'south dakota', 'nj': 'new jersey', 'fl': 'florida', 'ok': 'oklahoma', 'ga': 'georgia', 'nv': 'nevada', 'mo': 'missouri', 'oh': 'ohio', 'tn': 'tennessee', 'de': 'delaware', 'ar': 'arkansas', 'pr': 'puerto rico', 'id': 'idaho', 'nm': 'new mexico', 'me': 'maine', 'la': 'louisiana', 'or': 'oregon', 'mi': 'michigan', 'sc': 'south carolina', 'wv': 'west virginia'}
{'oh': 'ohio', 'wy': 'wyoming', 'nh': 'new hampshire', 'wi': 'wisconsin', 'vt': 'vermont', 'hi': 'hawaii', 'la': 'louisiana', 'sd': 'south dakota', 'ms': 'mississippi', 'mi': 'michigan', 'tn': 'tennessee', 'ak': 'alaska', 'wa': 'washington', 'ma': 'massachusetts', 'pr': 'puerto rico', 'tx': 'texas', 'nd': 'north dakota', 'sc': 'south carolina', 'ky': 'kentucky', 'ga': 'georgia'}
{'wy': 'wyoming', 'fl': 'florida', 'nh': 'new hampshire', 'ky': 'kentucky', 'md': 'maryland', 'ar': 'arkansas', 'sd': 'south dakota', 'al': 'alabama', 'ak': 'alaska', 'az': 'arizona', 'mn': 'minnesota', 'ct': 'connecticut', 'tx': 'texas', 'nd': 'north dakota', 'ne': 'nebraska', 'tn': 'tennessee', 'wi': 'wisconsin', 'mo': 'missouri', 'oh': 'ohio', 'mi': 'michigan'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'ks': 'kansas', 'mi': 'michigan', 'md': 'maryland', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'mn': 'minnesota', 'sc': 'south carolina', 'pr': 'puerto rico', 'ct': 'connecticut', 'il': 'illinois', 'wa': 'washington', 'de': 'delaware', 'vt': 'vermont', 'or': 'oregon', 'ut': 'utah', 'nc': 'north carolina', 'va': 'virginia', 'ms': 'mississippi', 'ca': 'california', 'me': 'maine'}
{'ak': 'alaska', 'wv': 'west virginia', 'wy': 'wyoming', 'la': 'louisiana', 'hi': 'hawaii', 'ga': 'georgia', 'nv': 'nevada', 'ia': 'iowa', 'nm': 'new mexico', 'ms': 'mississippi', 'ne': 'nebraska', 'mt': 'montana', 'il': 'illinois', 'mn': 'minnesota', 'ca': 'california', 'sc': 'south carolina', 'ut': 'utah', 'ks': 'kansas', 'sd': 'south dakota', 'wa': 'washington'}
{'nd': 'north dakota', 'ar': 'arkansas', 'ms': 'mississippi', 'az': 'arizona', 'tn': 'tennessee', 'mn': 'minnesota', 'wa': 'washington', 'nh': 'new hampshire', 'vi': 'virgin islands', 'nm': 'new mexico', 'wi': 'wisconsin', 'nj': 'new jersey', 'nc': 'north carolina', 'de': 'delaware', 'oh': 'ohio', 'ky': 'kentucky', 'va': 'virginia', 'wv': 'west virginia', 'ca': 'california', 'ga': 'georgia'}
{'wy': 'wyoming', 'fl': 'florida', 'nh': 'new hampshire', 'ky': 'kentucky', 'md': 'maryland', 'ar': 'arkansas', 'sd': 'south dakota', 'al': 'alabama', 'ak': 'alaska', 'az': 'arizona', 'mn': 'minnesota', 'ct': 'connecticut', 'tx': 'texas', 'nd': 'north dakota', 'ne': 'nebraska', 'tn': 'tennessee', 'wi': 'wisconsin', 'mo': 'missouri', 'oh': 'ohio', 'mi': 'michigan'}
{'ma': 'massachusetts', 'ga': 'georgia', 'ar': 'arkansas', 'la': 'louisiana', 'dc': 'district of columbia', 'al': 'alabama', 'mt': 'montana', 'nc': 'north carolina', 'mn': 'minnesota', 'nh': 'new hampshire', 'oh': 'ohio', 'or': 'oregon', 'wy': 'wyoming', 'ut': 'utah', 'mi': 'michigan', 'vi': 'virgin islands', 'sc': 'south carolina', 'nj': 'new jersey', 'ky': 'kentucky', 'ri': 'rhode island'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'pa': 'pennsylvania', 'il': 'illinois', 'az': 'arizona', 'dc': 'district of columbia', 'mn': 'minnesota', 'nj': 'new jersey', 'va': 'virginia', 'ia': 'iowa', 'or': 'oregon', 'ak': 'alaska', 'oh': 'ohio', 'sd': 'south dakota', 'wy': 'wyoming', 'mo': 'missouri', 'ma': 'massachusetts', 'ny': 'new york', 'vi': 'virgin islands', 'co': 'colorado', 'ar': 'arkansas', 'hi': 'hawaii'}
{'al': 'alabama', 'ia': 'iowa', 'ok': 'oklahoma', 'ca': 'california', 'ri': 'rhode island', 'ks': 'kansas', 'me': 'maine', 'wv': 'west virginia', 'in': 'indiana', 'az': 'arizona', 'sc': 'south carolina', 'nj': 'new jersey', 'wy': 'wyoming', 'or': 'oregon', 'nd': 'north dakota', 'mn': 'minnesota', 'wa': 'washington', 'ny': 'new york', 'mi': 'michigan', 'ky': 'kentucky'}
{'nm': 'new mexico', 'az': 'arizona', 'pa': 'pennsylvania', 'de': 'delaware', 'mn': 'minnesota', 'wv': 'west virginia', 'wi': 'wisconsin', 'nh': 'new hampshire', 'ks': 'kansas', 'vt': 'vermont', 'nv': 'nevada', 'tx': 'texas', 'fl': 'florida', 'tn': 'tennessee', 'wa': 'washington', 'il': 'illinois', 'wy': 'wyoming', 'ne': 'nebraska', 'mt': 'montana', 'ga': 'georgia'}
{'pa': 'pennsylvania', 'il': 'illinois', 'az': 'arizona', 'dc': 'district of columbia', 'mn': 'minnesota', 'nj': 'new jersey', 'va': 'virginia', 'ia': 'iowa', 'or': 'oregon', 'ak': 'alaska', 'oh': 'ohio', 'sd': 'south dakota', 'wy': 'wyoming', 'mo': 'missouri', 'ma': 'massachusetts', 'ny': 'new york', 'vi': 'virgin islands', 'co': 'colorado', 'ar': 'arkansas', 'hi': 'hawaii'}
{'nm': 'new mexico', 'nh': 'new hampshire', 'ca': 'california', 'mt': 'montana', 'de': 'delaware', 'mn': 'minnesota', 'ms': 'mississippi', 'me': 'maine', 'wy': 'wyoming', 'ut': 'utah', 'az': 'arizona', 'mi': 'michigan', 'wa': 'washington', 'ar': 'arkansas', 'in': 'indiana', 'ok': 'oklahoma', 'ak': 'alaska', 'al': 'alabama', 'hi': 'hawaii', 'id': 'idaho'}
{'nd': 'north dakota', 'nh': 'new hampshire', 'in': 'indiana', 'mn': 'minnesota', 'ca': 'california', 'mi': 'michigan', 'ky': 'kentucky', 'ny': 'new york', 'id': 'idaho', 'wv': 'west virginia', 'pa': 'pennsylvania', 'pr': 'puerto rico', 'me': 'maine', 'sc': 'south carolina', 'ak': 'alaska', 'ks': 'kansas', 'nj': 'new jersey', 'nv': 'nevada', 'fl': 'florida', 'co': 'colorado'}
{'md': 'maryland', 'nj': 'new jersey', 'ms': 'mississippi', 'mt': 'montana', 'sd': 'south dakota', 'nc': 'north carolina', 'nh': 'new hampshire', 'ok': 'oklahoma', 'il': 'illinois', 'ne': 'nebraska', 'pa': 'pennsylvania', 'co': 'colorado', 'ak': 'alaska', 'tx': 'texas', 'az': 'arizona', 'al': 'alabama', 'wy': 'wyoming', 'wa': 'washington', 'me': 'maine', 'ga': 'georgia'}
{'ks': 'kansas', 'mi': 'michigan', 'md': 'maryland', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'mn': 'minnesota', 'sc': 'south carolina', 'pr': 'puerto rico', 'ct': 'connecticut', 'il': 'illinois', 'wa': 'washington', 'de': 'delaware', 'vt': 'vermont', 'or': 'oregon', 'ut': 'utah', 'nc': 'north carolina', 'va': 'virginia', 'ms': 'mississippi', 'ca': 'california', 'me': 'maine'}
{'ky': 'kentucky', 'sc': 'south carolina', 'ut': 'utah', 'mo': 'missouri', 'ia': 'iowa', 'va': 'virginia', 'mn': 'minnesota', 'wi': 'wisconsin', 'ca': 'california', 'ne': 'nebraska', 'sd': 'south dakota', 'co': 'colorado', 'or': 'oregon', 'in': 'indiana', 'ar': 'arkansas', 'md': 'maryland', 'ks': 'kansas', 'mi': 'michigan', 'nd': 'north dakota', 'nh': 'new hampshire'}
{'mo': 'missouri', 'sc': 'south carolina', 'ks': 'kansas', 'mt': 'montana', 'ut': 'utah', 'oh': 'ohio', 'nm': 'new mexico', 'id': 'idaho', 'tx': 'texas', 'va': 'virginia', 'mn': 'minnesota', 'al': 'alabama', 'wv': 'west virginia', 'wa': 'washington', 'co': 'colorado', 'fl': 'florida', 'ky': 'kentucky', 'tn': 'tennessee', 'wi': 'wisconsin', 'ri': 'rhode island'}
{'mo': 'missouri', 'sc': 'south carolina', 'ks': 'kansas', 'mt': 'montana', 'ut': 'utah', 'oh': 'ohio', 'nm': 'new mexico', 'id': 'idaho', 'tx': 'texas', 'va': 'virginia', 'mn': 'minnesota', 'al': 'alabama', 'wv': 'west virginia', 'wa': 'washington', 'co': 'colorado', 'fl': 'florida', 'ky': 'kentucky', 'tn': 'tennessee', 'wi': 'wisconsin', 'ri': 'rhode island'}
{'ar': 'arkansas', 'de': 'delaware', 'ks': 'kansas', 'ca': 'california', 'nh': 'new hampshire', 'wi': 'wisconsin', 'mi': 'michigan', 'ak': 'alaska', 'fl': 'florida', 'md': 'maryland', 'mn': 'minnesota', 'nj': 'new jersey', 'nv': 'nevada', 'ky': 'kentucky', 'sc': 'south carolina', 'vt': 'vermont', 'nc': 'north carolina', 'dc': 'district of columbia', 'ms': 'mississippi', 'or': 'oregon'}
{'nm': 'new mexico', 'az': 'arizona', 'pa': 'pennsylvania', 'de': 'delaware', 'mn': 'minnesota', 'wv': 'west virginia', 'wi': 'wisconsin', 'nh': 'new hampshire', 'ks': 'kansas', 'vt': 'vermont', 'nv': 'nevada', 'tx': 'texas', 'fl': 'florida', 'tn': 'tennessee', 'wa': 'washington', 'il': 'illinois', 'wy': 'wyoming', 'ne': 'nebraska', 'mt': 'montana', 'ga': 'georgia'}
{'md': 'maryland', 'ok': 'oklahoma', 'tx': 'texas', 'ks': 'kansas', 'sd': 'south dakota', 'il': 'illinois', 'tn': 'tennessee', 'co': 'colorado', 'al': 'alabama', 'oh': 'ohio', 'ct': 'connecticut', 'pa': 'pennsylvania', 'az': 'arizona', 'ny': 'new york', 'ak': 'alaska', 'ms': 'mississippi', 'ar': 'arkansas', 'nv': 'nevada', 'ga': 'georgia', 'mi': 'michigan'}
{'nd': 'north dakota', 'ar': 'arkansas', 'ms': 'mississippi', 'az': 'arizona', 'tn': 'tennessee', 'mn': 'minnesota', 'wa': 'washington', 'nh': 'new hampshire', 'vi': 'virgin islands', 'nm': 'new mexico', 'wi': 'wisconsin', 'nj': 'new jersey', 'nc': 'north carolina', 'de': 'delaware', 'oh': 'ohio', 'ky': 'kentucky', 'va': 'virginia', 'wv': 'west virginia', 'ca': 'california', 'ga': 'georgia'}
{'in': 'indiana', 'de': 'delaware', 'nc': 'north carolina', 'nm': 'new mexico', 'me': 'maine', 'sd': 'south dakota', 'ct': 'connecticut', 'tx': 'texas', 'mn': 'minnesota', 'nh': 'new hampshire', 'ny': 'new york', 'pr': 'puerto rico', 'id': 'idaho', 'ga': 'georgia', 'co': 'colorado', 'md': 'maryland', 'ne': 'nebraska', 'vt': 'vermont', 'ma': 'massachusetts', 'va': 'virginia'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'ky': 'kentucky', 'nm': 'new mexico', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'oh': 'ohio', 'nh': 'new hampshire', 'ia': 'iowa', 'sd': 'south dakota', 'ct': 'connecticut', 'nj': 'new jersey', 'wy': 'wyoming', 'la': 'louisiana', 'ks': 'kansas', 'ga': 'georgia', 'sc': 'south carolina', 'me': 'maine', 'pr': 'puerto rico', 'wi': 'wisconsin', 'in': 'indiana', 'vi': 'virgin islands'}
{'mo': 'missouri', 'sc': 'south carolina', 'ks': 'kansas', 'mt': 'montana', 'ut': 'utah', 'oh': 'ohio', 'nm': 'new mexico', 'id': 'idaho', 'tx': 'texas', 'va': 'virginia', 'mn': 'minnesota', 'al': 'alabama', 'wv': 'west virginia', 'wa': 'washington', 'co': 'colorado', 'fl': 'florida', 'ky': 'kentucky', 'tn': 'tennessee', 'wi': 'wisconsin', 'ri': 'rhode island'}
{'ky': 'kentucky', 'sc': 'south carolina', 'nh': 'new hampshire', 'fl': 'florida', 'ne': 'nebraska', 'hi': 'hawaii', 'wi': 'wisconsin', 'wv': 'west virginia', 'ri': 'rhode island', 'dc': 'district of columbia', 'ct': 'connecticut', 'il': 'illinois', 'de': 'delaware', 'tn': 'tennessee', 'ca': 'california', 'or': 'oregon', 'nm': 'new mexico', 'mt': 'montana', 'me': 'maine', 'tx': 'texas'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'ar': 'arkansas', 'de': 'delaware', 'ks': 'kansas', 'ca': 'california', 'nh': 'new hampshire', 'wi': 'wisconsin', 'mi': 'michigan', 'ak': 'alaska', 'fl': 'florida', 'md': 'maryland', 'mn': 'minnesota', 'nj': 'new jersey', 'nv': 'nevada', 'ky': 'kentucky', 'sc': 'south carolina', 'vt': 'vermont', 'nc': 'north carolina', 'dc': 'district of columbia', 'ms': 'mississippi', 'or': 'oregon'}
{'md': 'maryland', 'nj': 'new jersey', 'ms': 'mississippi', 'mt': 'montana', 'sd': 'south dakota', 'nc': 'north carolina', 'nh': 'new hampshire', 'ok': 'oklahoma', 'il': 'illinois', 'ne': 'nebraska', 'pa': 'pennsylvania', 'co': 'colorado', 'ak': 'alaska', 'tx': 'texas', 'az': 'arizona', 'al': 'alabama', 'wy': 'wyoming', 'wa': 'washington', 'me': 'maine', 'ga': 'georgia'}
{'nm': 'new mexico', 'az': 'arizona', 'pa': 'pennsylvania', 'de': 'delaware', 'mn': 'minnesota', 'wv': 'west virginia', 'wi': 'wisconsin', 'nh': 'new hampshire', 'ks': 'kansas', 'vt': 'vermont', 'nv': 'nevada', 'tx': 'texas', 'fl': 'florida', 'tn': 'tennessee', 'wa': 'washington', 'il': 'illinois', 'wy': 'wyoming', 'ne': 'nebraska', 'mt': 'montana', 'ga': 'georgia'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'pr': 'puerto rico', 'al': 'alabama', 'vt': 'vermont', 'ky': 'kentucky', 'tx': 'texas', 'md': 'maryland', 'ms': 'mississippi', 'ar': 'arkansas', 'dc': 'district of columbia', 'me': 'maine', 'mo': 'missouri', 'mt': 'montana', 'in': 'indiana', 'wy': 'wyoming', 'fl': 'florida', 'la': 'louisiana', 'nv': 'nevada', 'pa': 'pennsylvania', 'sc': 'south carolina', 'ct': 'connecticut'}
{'hi': 'hawaii', 'nh': 'new hampshire', 'ca': 'california', 'ga': 'georgia', 'nv': 'nevada', 'or': 'oregon', 'ok': 'oklahoma', 'wa': 'washington', 'va': 'virginia', 'mo': 'missouri', 'ne': 'nebraska', 'ct': 'connecticut', 'me': 'maine', 'tn': 'tennessee', 'vi': 'virgin islands', 'ks': 'kansas', 'nc': 'north carolina', 'sc': 'south carolina', 'mt': 'montana', 'az': 'arizona'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'ky': 'kentucky', 'sc': 'south carolina', 'nh': 'new hampshire', 'fl': 'florida', 'ne': 'nebraska', 'hi': 'hawaii', 'wi': 'wisconsin', 'wv': 'west virginia', 'ri': 'rhode island', 'dc': 'district of columbia', 'ct': 'connecticut', 'il': 'illinois', 'de': 'delaware', 'tn': 'tennessee', 'ca': 'california', 'or': 'oregon', 'nm': 'new mexico', 'mt': 'montana', 'me': 'maine', 'tx': 'texas'}
{'md': 'maryland', 'ok': 'oklahoma', 'tx': 'texas', 'ks': 'kansas', 'sd': 'south dakota', 'il': 'illinois', 'tn': 'tennessee', 'co': 'colorado', 'al': 'alabama', 'oh': 'ohio', 'ct': 'connecticut', 'pa': 'pennsylvania', 'az': 'arizona', 'ny': 'new york', 'ak': 'alaska', 'ms': 'mississippi', 'ar': 'arkansas', 'nv': 'nevada', 'ga': 'georgia', 'mi': 'michigan'}
{'ma': 'massachusetts', 'ga': 'georgia', 'ar': 'arkansas', 'la': 'louisiana', 'dc': 'district of columbia', 'al': 'alabama', 'mt': 'montana', 'nc': 'north carolina', 'mn': 'minnesota', 'nh': 'new hampshire', 'oh': 'ohio', 'or': 'oregon', 'wy': 'wyoming', 'ut': 'utah', 'mi': 'michigan', 'vi': 'virgin islands', 'sc': 'south carolina', 'nj': 'new jersey', 'ky': 'kentucky', 'ri': 'rhode island'}
{'oh': 'ohio', 'wy': 'wyoming', 'nh': 'new hampshire', 'wi': 'wisconsin', 'vt': 'vermont', 'hi': 'hawaii', 'la': 'louisiana', 'sd': 'south dakota', 'ms': 'mississippi', 'mi': 'michigan', 'tn': 'tennessee', 'ak': 'alaska', 'wa': 'washington', 'ma': 'massachusetts', 'pr': 'puerto rico', 'tx': 'texas', 'nd': 'north dakota', 'sc': 'south carolina', 'ky': 'kentucky', 'ga': 'georgia'}
{'pr': 'puerto rico', 'al': 'alabama', 'vt': 'vermont', 'ky': 'kentucky', 'tx': 'texas', 'md': 'maryland', 'ms': 'mississippi', 'ar': 'arkansas', 'dc': 'district of columbia', 'me': 'maine', 'mo': 'missouri', 'mt': 'montana', 'in': 'indiana', 'wy': 'wyoming', 'fl': 'florida', 'la': 'louisiana', 'nv': 'nevada', 'pa': 'pennsylvania', 'sc': 'south carolina', 'ct': 'connecticut'}
{'nm': 'new mexico', 'nh': 'new hampshire', 'ca': 'california', 'mt': 'montana', 'de': 'delaware', 'mn': 'minnesota', 'ms': 'mississippi', 'me': 'maine', 'wy': 'wyoming', 'ut': 'utah', 'az': 'arizona', 'mi': 'michigan', 'wa': 'washington', 'ar': 'arkansas', 'in': 'indiana', 'ok': 'oklahoma', 'ak': 'alaska', 'al': 'alabama', 'hi': 'hawaii', 'id': 'idaho'}
{'nd': 'north dakota', 'ar': 'arkansas', 'ms': 'mississippi', 'az': 'arizona', 'tn': 'tennessee', 'mn': 'minnesota', 'wa': 'washington', 'nh': 'new hampshire', 'vi': 'virgin islands', 'nm': 'new mexico', 'wi': 'wisconsin', 'nj': 'new jersey', 'nc': 'north carolina', 'de': 'delaware', 'oh': 'ohio', 'ky': 'kentucky', 'va': 'virginia', 'wv': 'west virginia', 'ca': 'california', 'ga': 'georgia'}
{'oh': 'ohio', 'wy': 'wyoming', 'nh': 'new hampshire', 'wi': 'wisconsin', 'vt': 'vermont', 'hi': 'hawaii', 'la': 'louisiana', 'sd': 'south dakota', 'ms': 'mississippi', 'mi': 'michigan', 'tn': 'tennessee', 'ak': 'alaska', 'wa': 'washington', 'ma': 'massachusetts', 'pr': 'puerto rico', 'tx': 'texas', 'nd': 'north dakota', 'sc': 'south carolina', 'ky': 'kentucky', 'ga': 'georgia'}
{'ky': 'kentucky', 'pr': 'puerto rico', 'ri': 'rhode island', 'ne': 'nebraska', 'tn': 'tennessee', 'dc': 'district of columbia', 'or': 'oregon', 'ny': 'new york', 'pa': 'pennsylvania', 'hi': 'hawaii', 'in': 'indiana', 'ct': 'connecticut', 'ok': 'oklahoma', 'tx': 'texas', 'nm': 'new mexico', 'il': 'illinois', 'mt': 'montana', 'wi': 'wisconsin', 'vi': 'virgin islands', 'ut': 'utah'}
{'oh': 'ohio', 'wy': 'wyoming', 'nh': 'new hampshire', 'wi': 'wisconsin', 'vt': 'vermont', 'hi': 'hawaii', 'la': 'louisiana', 'sd': 'south dakota', 'ms': 'mississippi', 'mi': 'michigan', 'tn': 'tennessee', 'ak': 'alaska', 'wa': 'washington', 'ma': 'massachusetts', 'pr': 'puerto rico', 'tx': 'texas', 'nd': 'north dakota', 'sc': 'south carolina', 'ky': 'kentucky', 'ga': 'georgia'}
{'nv': 'nevada', 'nm': 'new mexico', 'ne': 'nebraska', 'ak': 'alaska', 'ar': 'arkansas', 'co': 'colorado', 'ia': 'iowa', 'mo': 'missouri', 'wa': 'washington', 'nh': 'new hampshire', 'ok': 'oklahoma', 'oh': 'ohio', 'tx': 'texas', 'mi': 'michigan', 'de': 'delaware', 'ms': 'mississippi', 'nj': 'new jersey', 'me': 'maine', 'wy': 'wyoming', 'ca': 'california'}
{'md': 'maryland', 'nj': 'new jersey', 'ms': 'mississippi', 'mt': 'montana', 'sd': 'south dakota', 'nc': 'north carolina', 'nh': 'new hampshire', 'ok': 'oklahoma', 'il': 'illinois', 'ne': 'nebraska', 'pa': 'pennsylvania', 'co': 'colorado', 'ak': 'alaska', 'tx': 'texas', 'az': 'arizona', 'al': 'alabama', 'wy': 'wyoming', 'wa': 'washington', 'me': 'maine', 'ga': 'georgia'}
{'nd': 'north dakota', 'nh': 'new hampshire', 'in': 'indiana', 'mn': 'minnesota', 'ca': 'california', 'mi': 'michigan', 'ky': 'kentucky', 'ny': 'new york', 'id': 'idaho', 'wv': 'west virginia', 'pa': 'pennsylvania', 'pr': 'puerto rico', 'me': 'maine', 'sc': 'south carolina', 'ak': 'alaska', 'ks': 'kansas', 'nj': 'new jersey', 'nv': 'nevada', 'fl': 'florida', 'co': 'colorado'}
{'ak': 'alaska', 'wv': 'west virginia', 'wy': 'wyoming', 'la': 'louisiana', 'hi': 'hawaii', 'ga': 'georgia', 'nv': 'nevada', 'ia': 'iowa', 'nm': 'new mexico', 'ms': 'mississippi', 'ne': 'nebraska', 'mt': 'montana', 'il': 'illinois', 'mn': 'minnesota', 'ca': 'california', 'sc': 'south carolina', 'ut': 'utah', 'ks': 'kansas', 'sd': 'south dakota', 'wa': 'washington'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'md': 'maryland', 'ok': 'oklahoma', 'tx': 'texas', 'ks': 'kansas', 'sd': 'south dakota', 'il': 'illinois', 'tn': 'tennessee', 'co': 'colorado', 'al': 'alabama', 'oh': 'ohio', 'ct': 'connecticut', 'pa': 'pennsylvania', 'az': 'arizona', 'ny': 'new york', 'ak': 'alaska', 'ms': 'mississippi', 'ar': 'arkansas', 'nv': 'nevada', 'ga': 'georgia', 'mi': 'michigan'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'ks': 'kansas', 'mi': 'michigan', 'md': 'maryland', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'mn': 'minnesota', 'sc': 'south carolina', 'pr': 'puerto rico', 'ct': 'connecticut', 'il': 'illinois', 'wa': 'washington', 'de': 'delaware', 'vt': 'vermont', 'or': 'oregon', 'ut': 'utah', 'nc': 'north carolina', 'va': 'virginia', 'ms': 'mississippi', 'ca': 'california', 'me': 'maine'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'nv': 'nevada', 'nm': 'new mexico', 'ne': 'nebraska', 'ak': 'alaska', 'ar': 'arkansas', 'co': 'colorado', 'ia': 'iowa', 'mo': 'missouri', 'wa': 'washington', 'nh': 'new hampshire', 'ok': 'oklahoma', 'oh': 'ohio', 'tx': 'texas', 'mi': 'michigan', 'de': 'delaware', 'ms': 'mississippi', 'nj': 'new jersey', 'me': 'maine', 'wy': 'wyoming', 'ca': 'california'}
{'pr': 'puerto rico', 'de': 'delaware', 'al': 'alabama', 'ri': 'rhode island', 'fl': 'florida', 'ms': 'mississippi', 'wy': 'wyoming', 'ak': 'alaska', 'il': 'illinois', 'wa': 'washington', 'va': 'virginia', 'mt': 'montana', 'mo': 'missouri', 'ok': 'oklahoma', 'nh': 'new hampshire', 'mi': 'michigan', 'id': 'idaho', 'nd': 'north dakota', 'ky': 'kentucky', 'la': 'louisiana'}
{'ky': 'kentucky', 'nm': 'new mexico', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'oh': 'ohio', 'nh': 'new hampshire', 'ia': 'iowa', 'sd': 'south dakota', 'ct': 'connecticut', 'nj': 'new jersey', 'wy': 'wyoming', 'la': 'louisiana', 'ks': 'kansas', 'ga': 'georgia', 'sc': 'south carolina', 'me': 'maine', 'pr': 'puerto rico', 'wi': 'wisconsin', 'in': 'indiana', 'vi': 'virgin islands'}
{'hi': 'hawaii', 'nh': 'new hampshire', 'ca': 'california', 'ga': 'georgia', 'nv': 'nevada', 'or': 'oregon', 'ok': 'oklahoma', 'wa': 'washington', 'va': 'virginia', 'mo': 'missouri', 'ne': 'nebraska', 'ct': 'connecticut', 'me': 'maine', 'tn': 'tennessee', 'vi': 'virgin islands', 'ks': 'kansas', 'nc': 'north carolina', 'sc': 'south carolina', 'mt': 'montana', 'az': 'arizona'}
{'al': 'alabama', 'ia': 'iowa', 'ok': 'oklahoma', 'ca': 'california', 'ri': 'rhode island', 'ks': 'kansas', 'me': 'maine', 'wv': 'west virginia', 'in': 'indiana', 'az': 'arizona', 'sc': 'south carolina', 'nj': 'new jersey', 'wy': 'wyoming', 'or': 'oregon', 'nd': 'north dakota', 'mn': 'minnesota', 'wa': 'washington', 'ny': 'new york', 'mi': 'michigan', 'ky': 'kentucky'}
{'hi': 'hawaii', 'nh': 'new hampshire', 'ca': 'california', 'ga': 'georgia', 'nv': 'nevada', 'or': 'oregon', 'ok': 'oklahoma', 'wa': 'washington', 'va': 'virginia', 'mo': 'missouri', 'ne': 'nebraska', 'ct': 'connecticut', 'me': 'maine', 'tn': 'tennessee', 'vi': 'virgin islands', 'ks': 'kansas', 'nc': 'north carolina', 'sc': 'south carolina', 'mt': 'montana', 'az': 'arizona'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'ar': 'arkansas', 'de': 'delaware', 'ks': 'kansas', 'ca': 'california', 'nh': 'new hampshire', 'wi': 'wisconsin', 'mi': 'michigan', 'ak': 'alaska', 'fl': 'florida', 'md': 'maryland', 'mn': 'minnesota', 'nj': 'new jersey', 'nv': 'nevada', 'ky': 'kentucky', 'sc': 'south carolina', 'vt': 'vermont', 'nc': 'north carolina', 'dc': 'district of columbia', 'ms': 'mississippi', 'or': 'oregon'}
{'md': 'maryland', 'ok': 'oklahoma', 'tx': 'texas', 'ks': 'kansas', 'sd': 'south dakota', 'il': 'illinois', 'tn': 'tennessee', 'co': 'colorado', 'al': 'alabama', 'oh': 'ohio', 'ct': 'connecticut', 'pa': 'pennsylvania', 'az': 'arizona', 'ny': 'new york', 'ak': 'alaska', 'ms': 'mississippi', 'ar': 'arkansas', 'nv': 'nevada', 'ga': 'georgia', 'mi': 'michigan'}
{'mo': 'missouri', 'sc': 'south carolina', 'ks': 'kansas', 'mt': 'montana', 'ut': 'utah', 'oh': 'ohio', 'nm': 'new mexico', 'id': 'idaho', 'tx': 'texas', 'va': 'virginia', 'mn': 'minnesota', 'al': 'alabama', 'wv': 'west virginia', 'wa': 'washington', 'co': 'colorado', 'fl': 'florida', 'ky': 'kentucky', 'tn': 'tennessee', 'wi': 'wisconsin', 'ri': 'rhode island'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'ky': 'kentucky', 'nm': 'new mexico', 'pa': 'pennsylvania', 'dc': 'district of columbia', 'oh': 'ohio', 'nh': 'new hampshire', 'ia': 'iowa', 'sd': 'south dakota', 'ct': 'connecticut', 'nj': 'new jersey', 'wy': 'wyoming', 'la': 'louisiana', 'ks': 'kansas', 'ga': 'georgia', 'sc': 'south carolina', 'me': 'maine', 'pr': 'puerto rico', 'wi': 'wisconsin', 'in': 'indiana', 'vi': 'virgin islands'}
{'ky': 'kentucky', 'sc': 'south carolina', 'nh': 'new hampshire', 'fl': 'florida', 'ne': 'nebraska', 'hi': 'hawaii', 'wi': 'wisconsin', 'wv': 'west virginia', 'ri': 'rhode island', 'dc': 'district of columbia', 'ct': 'connecticut', 'il': 'illinois', 'de': 'delaware', 'tn': 'tennessee', 'ca': 'california', 'or': 'oregon', 'nm': 'new mexico', 'mt': 'montana', 'me': 'maine', 'tx': 'texas'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'al': 'alabama', 'ia': 'iowa', 'ok': 'oklahoma', 'ca': 'california', 'ri': 'rhode island', 'ks': 'kansas', 'me': 'maine', 'wv': 'west virginia', 'in': 'indiana', 'az': 'arizona', 'sc': 'south carolina', 'nj': 'new jersey', 'wy': 'wyoming', 'or': 'oregon', 'nd': 'north dakota', 'mn': 'minnesota', 'wa': 'washington', 'ny': 'new york', 'mi': 'michigan', 'ky': 'kentucky'}
{'ca': 'california', 'or': 'oregon', 'co': 'colorado', 'ks': 'kansas', 'ny': 'new york', 'ma': 'massachusetts', 'ms': 'mississippi', 'az': 'arizona', 'ky': 'kentucky', 'tx': 'texas', 'mo': 'missouri', 'id': 'idaho', 'tn': 'tennessee', 'ri': 'rhode island', 'wv': 'west virginia', 'md': 'maryland', 'me': 'maine', 'ia': 'iowa', 'wa': 'washington', 'ak': 'alaska'}
{'oh': 'ohio', 'wy': 'wyoming', 'nh': 'new hampshire', 'wi': 'wisconsin', 'vt': 'vermont', 'hi': 'hawaii', 'la': 'louisiana', 'sd': 'south dakota', 'ms': 'mississippi', 'mi': 'michigan', 'tn': 'tennessee', 'ak': 'alaska', 'wa': 'washington', 'ma': 'massachusetts', 'pr': 'puerto rico', 'tx': 'texas', 'nd': 'north dakota', 'sc': 'south carolina', 'ky': 'kentucky', 'ga': 'georgia'}
{'ak': 'alaska', 'wv': 'west virginia', 'wy': 'wyoming', 'la': 'louisiana', 'hi': 'hawaii', 'ga': 'georgia', 'nv': 'nevada', 'ia': 'iowa', 'nm': 'new mexico', 'ms': 'mississippi', 'ne': 'nebraska', 'mt': 'montana', 'il': 'illinois', 'mn': 'minnesota', 'ca': 'california', 'sc': 'south carolina', 'ut': 'utah', 'ks': 'kansas', 'sd': 'south dakota', 'wa': 'washington'}
{'nd': 'north dakota', 'ar': 'arkansas', 'ms': 'mississippi', 'az': 'arizona', 'tn': 'tennessee', 'mn': 'minnesota', 'wa': 'washington', 'nh': 'new hampshire', 'vi': 'virgin islands', 'nm': 'new mexico', 'wi': 'wisconsin', 'nj': 'new jersey', 'nc': 'north carolina', 'de': 'delaware', 'oh': 'ohio', 'ky': 'kentucky', 'va': 'virginia', 'wv': 'west virginia', 'ca': 'california', 'ga': 'georgia'}
{'al': 'alabama', 'ia': 'iowa', 'ok': 'oklahoma', 'ca': 'california', 'ri': 'rhode island', 'ks': 'kansas', 'me': 'maine', 'wv': 'west virginia', 'in': 'indiana', 'az': 'arizona', 'sc': 'south carolina', 'nj': 'new jersey', 'wy': 'wyoming', 'or': 'oregon', 'nd': 'north dakota', 'mn': 'minnesota', 'wa': 'washington', 'ny': 'new york', 'mi': 'michigan', 'ky': 'kentucky'}
{'ak': 'alaska', 'wv': 'west virginia', 'wy': 'wyoming', 'la': 'louisiana', 'hi': 'hawaii', 'ga': 'georgia', 'nv': 'nevada', 'ia': 'iowa', 'nm': 'new mexico', 'ms': 'mississippi', 'ne': 'nebraska', 'mt': 'montana', 'il': 'illinois', 'mn': 'minnesota', 'ca': 'california', 'sc': 'south carolina', 'ut': 'utah', 'ks': 'kansas', 'sd': 'south dakota', 'wa': 'washington'}
{'al': 'alabama', 'ia': 'iowa', 'ok': 'oklahoma', 'ca': 'california', 'ri': 'rhode island', 'ks': 'kansas', 'me': 'maine', 'wv': 'west virginia', 'in': 'indiana', 'az': 'arizona', 'sc': 'south carolina', 'nj': 'new jersey', 'wy': 'wyoming', 'or': 'oregon', 'nd': 'north dakota', 'mn': 'minnesota', 'wa': 'washington', 'ny': 'new york', 'mi': 'michigan', 'ky': 'kentucky'}
{'nm': 'new mexico', 'nh': 'new hampshire', 'ca': 'california', 'mt': 'montana', 'de': 'delaware', 'mn': 'minnesota', 'ms': 'mississippi', 'me': 'maine', 'wy': 'wyoming', 'ut': 'utah', 'az': 'arizona', 'mi': 'michigan', 'wa': 'washington', 'ar': 'arkansas', 'in': 'indiana', 'ok': 'oklahoma', 'ak': 'alaska', 'al': 'alabama', 'hi': 'hawaii', 'id': 'idaho'}
{'or': 'oregon', 'nj': 'new jersey', 'sd': 'south dakota', 'sc': 'south carolina', 'mt': 'montana', 'ms': 'mississippi', 'id': 'idaho', 'ak': 'alaska', 'ut': 'utah', 'in': 'indiana', 'al': 'alabama', 'nc': 'north carolina', 'mo': 'missouri', 'ok': 'oklahoma', 'pr': 'puerto rico', 'fl': 'florida', 'tx': 'texas', 'de': 'delaware', 'vi': 'virgin islands', 'ks': 'kansas'}
{'hi': 'hawaii', 'nh': 'new hampshire', 'ca': 'california', 'ga': 'georgia', 'nv': 'nevada', 'or': 'oregon', 'ok': 'oklahoma', 'wa': 'washington', 'va': 'virginia', 'mo': 'missouri', 'ne': 'nebraska', 'ct': 'connecticut', 'me': 'maine', 'tn': 'tennessee', 'vi': 'virgin islands', 'ks': 'kansas', 'nc': 'north carolina', 'sc': 'south carolina', 'mt': 'montana', 'az': 'arizona'}
{'nv': 'nevada', 'nm': 'new mexico', 'ne': 'nebraska', 'ak': 'alaska', 'ar': 'arkansas', 'co': 'colorado', 'ia': 'iowa', 'mo': 'missouri', 'wa': 'washington', 'nh': 'new hampshire', 'ok': 'oklahoma', 'oh': 'ohio', 'tx': 'texas', 'mi': 'michigan', 'de': 'delaware', 'ms': 'mississippi', 'nj': 'new jersey', 'me': 'maine', 'wy': 'wyoming', 'ca': 'california'}
{'ky': 'kentucky', 'sc': 'south carolina', 'nh': 'new hampshire', 'fl': 'florida', 'ne': 'nebraska', 'hi': 'hawaii', 'wi': 'wisconsin', 'wv': 'west virginia', 'ri': 'rhode island', 'dc': 'district of columbia', 'ct': 'connecticut', 'il': 'illinois', 'de': 'delaware', 'tn': 'tennessee', 'ca': 'california', 'or': 'oregon', 'nm': 'new mexico', 'mt': 'montana', 'me': 'maine', 'tx': 'texas'}
{'pr': 'puerto rico', 'al': 'alabama', 'vt': 'vermont', 'ky': 'kentucky', 'tx': 'texas', 'md': 'maryland', 'ms': 'mississippi', 'ar': 'arkansas', 'dc': 'district of columbia', 'me': 'maine', 'mo': 'missouri', 'mt': 'montana', 'in': 'indiana', 'wy': 'wyoming', 'fl': 'florida', 'la': 'louisiana', 'nv': 'nevada', 'pa': 'pennsylvania', 'sc': 'south carolina', 'ct': 'connecticut'}
{'ar': 'arkansas', 'de': 'delaware', 'ks': 'kansas', 'ca': 'california', 'nh': 'new hampshire', 'wi': 'wisconsin', 'mi': 'michigan', 'ak': 'alaska', 'fl': 'florida', 'md': 'maryland', 'mn': 'minnesota', 'nj': 'new jersey', 'nv': 'nevada', 'ky': 'kentucky', 'sc': 'south carolina', 'vt': 'vermont', 'nc': 'north carolina', 'dc': 'district of columbia', 'ms': 'mississippi', 'or': 'oregon'}
{'ia': 'iowa', 'nv': 'nevada', 'az': 'arizona', 'al': 'alabama', 'tx': 'texas', 'wv': 'west virginia', 'nm': 'new mexico', 'ga': 'georgia', 'id': 'idaho', 'ca': 'california', 'pa': 'pennsylvania', 'mt': 'montana', 'md': 'maryland', 'dc': 'district of columbia', 'vt': 'vermont', 'ut': 'utah', 'de': 'delaware', 'hi': 'hawaii', 'oh': 'ohio', 'ok': 'oklahoma'}
{'md': 'maryland', 'ok': 'oklahoma', 'tx': 'texas', 'ks': 'kansas', 'sd': 'south dakota', 'il': 'illinois', 'tn': 'tennessee', 'co': 'colorado', 'al': 'alabama', 'oh': 'ohio', 'ct': 'connecticut', 'pa': 'pennsylvania', 'az': 'arizona', 'ny': 'new york', 'ak': 'alaska', 'ms': 'mississippi', 'ar': 'arkansas', 'nv': 'nevada', 'ga': 'georgia', 'mi': 'michigan'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
{'ct': 'connecticut', 'az': 'arizona', 'md': 'maryland', 'ga': 'georgia', 'tx': 'texas', 'nm': 'new mexico', 'in': 'indiana', 'wv': 'west virginia', 'de': 'delaware', 'nh': 'new hampshire', 'id': 'idaho', 'ms': 'mississippi', 'hi': 'hawaii', 'nd': 'north dakota', 'wi': 'wisconsin', 'dc': 'district of columbia', 'nv': 'nevada', 'ky': 'kentucky', 'ny': 'new york', 'ne': 'nebraska'}
state_coordinates test ran 100 iterations in 0.21 seconds
Passed! Please submit.

Exercise 3: (1 points)

add_lat_long

Your task: define add_lat_long as follows:

Add two new columns to the input DataFrame (latitude and longitude), based on the Permanent Region column. Use the provided state_data dictionary to map state abbreviations to their corresponding geographic coordinates.

Input:

  • df: A Pandas DataFrame. It must include the column Permanent Region, which contains state abbreviations (e.g., "CA", "NY"). (For this exercise, the input df is the admission_df modified through prior steps.)
  • state_data: A list of dictionaries, where each dictionary represents a state and includes the following fields:
    • "abbr": State abbreviation (e.g., "CA", "NY")
    • "latitude": Latitude of the state (float)
    • "longitude": Longitude of the state (float)

Output:

  • A new Pandas DataFrame with two new columns added.:
    • latitude: Latitude of the state based on Permanent Region
    • longitude: Longitude of the state based on Permanent Region

Requirements/steps:

  1. Create a mapping from state abbreviations to their corresponding latitude and longitude using state_data. For example, the dictionary { "CA": {"latitude": 36.7783, "longitude": -119.4179} } maps "CA" to its coordinates.
  2. Use this mapping to populate the latitude and longitude columns based on the Permanent Region column in df.
  3. If a value in Permanent Region doesn't match any key in the mapping, set latitude and longitude to pd.NA.
In [9]:
### Solution - Exercise 3  
def add_lat_long(df: pd.DataFrame, state_data) -> pd.DataFrame:
    ###
    ### YOUR CODE HERE
    ###
    
    #print(df.head())
    state_df = pd.DataFrame(state_data)
    state_df = state_df.rename(columns={'abbr': 'Permanent Region'})
    
    #print(state_df.head())

    merged_df = pd.merge(df, state_df, on='Permanent Region', how='left')
    #print(merged_df.head())

    merged_df['latitude'] = merged_df['latitude'].fillna(pd.NA)
    merged_df['longitude'] = merged_df['longitude'].fillna(pd.NA)

    #print(merged_df.head())
    final = merged_df.drop('state', axis=1)

    return final

### Demo function call
demo_df_add_lat_long = pd.DataFrame({"Permanent Region": ["FL", "NY", "CA", "GA", "FR"]})
demo_state_data = [
    {"state": "Florida", "latitude": 27.8333, "longitude": -81.717, "abbr": "FL"},
    {"state": "New York", "latitude": 42.1497, "longitude": -74.9384, "abbr": "NY"},
    {"state": "California", "latitude": 36.17, "longitude": -119.7462, "abbr": "CA"},
    {"state": "Georgia", "latitude": 32.9866, "longitude": -83.6487, "abbr": "GA"}
]
demo_output_add_lat_long = add_lat_long(demo_df_add_lat_long, demo_state_data)
display(demo_output_add_lat_long)
Permanent Region latitude longitude
0 FL 27.8333 -81.717
1 NY 42.1497 -74.9384
2 CA 36.17 -119.7462
3 GA 32.9866 -83.6487
4 FR <NA> <NA>

Example: A correct implementation should produce the following output:

Permanent Region latitude longitude
FL 27.8333 -81.717
NY 42.1497 -74.9384
CA 36.17 -119.7462
GA 32.9866 -83.6487
FR <NA> <NA>


The cell below will test your solution for add_lat_long (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [10]:
### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=add_lat_long,
              ex_name='add_lat_long',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to add_lat_long did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
add_lat_long test ran 100 iterations in 0.94 seconds
Passed! Please submit.

Exercise 4: (2 points)

calculate_distance

Your task: define calculate_distance as follows:

Calculate the distance in miles from each student's location to the school. The location of each student is determined by the latitude and longitude columns in the input DataFrame, while the school's location is specified by the coordinates latitude 34.1271 and longitude -118.2109.

Input:

  • df: A Pandas DataFrame containing two columns:
    • latitude: Latitude of the student's location in decimal degrees.
    • longitude: Longitude of the student's location in decimal degrees.

Output:

  • A new Pandas DataFrame that includes all columns from the input df, with an additional column:
    • distance_from_school: The distance (in miles) from each student's location to the school.

Requirements/steps:

  1. Convert Columns to Numeric:
    • Use the pd.to_numeric function to convert the latitude and longitude columns to numeric values and coerce any errors. This step ensures that non-numeric or missing values are converted to NaN.
  2. Convert Coordinates to Radians:
    • Use the np.radians function to convert all latitude and longitude values (both student and school) from degrees to radians.
  3. Apply the Haversine Formula:

    • Compute the distance between each student's location and the school's location using the following formula:

      d=2rarcsin(sin2(Δϕ2)+cos(ϕ1)cos(ϕ2)sin2(Δλ2))

    • Definitions:

      • ϕ1 and ϕ2 are the latitudes (in radians) of the school and the student, respectively.
      • λ1 and λ2 are the longitudes (in radians) of the school and the student, respectively.
      • Δϕ=ϕ2ϕ1 (difference in latitudes).
      • Δλ=λ2λ1 ( ifference in longitudes).
      • r is the Earth's radius (mean radius = 3,959 miles).
  4. Add the New Column:
    • Create a new column, distance_from_school, to store the calculated distances.
In [11]:
### Solution - Exercise 4  
def calculate_distance(df: pd.DataFrame, school_lat: float, school_lon: float) -> pd.DataFrame:
    r"""Calculate the distance in miles from each student's location 
to the school. The location of each student is determined by the `latitude` and `longitude` columns in the input DataFrame, 
while the school's location is specified by the coordinates latitude `34.1271` and longitude `-118.2109`.

**Input:**
  - `df`: A Pandas DataFrame containing two columns:
    - `latitude`: Latitude of the student's location in decimal degrees.
    - `longitude`: Longitude of the student's location in decimal degrees.

**Output:**
  - A new Pandas DataFrame that includes all columns from the input `df`, with an additional column:
    - `distance_from_school`: The distance (in miles) from each student's location to the school.

**Requirements/steps:**
  1. **Convert Columns to Numeric:**
     - Use the `pd.to_numeric` function to convert the `latitude` and `longitude` columns to numeric values and coerce any errors. This step ensures that non-numeric or missing values are converted to `NaN`.
  2. **Convert Coordinates to Radians:**
     - Use the `np.radians` function to convert all latitude and longitude values (both student and school) from degrees to radians.
  3. **Apply the Haversine Formula:**
     - Compute the distance between each student's location and the school's location using the following formula:
     
       $$d = 2r \cdot \arcsin \left( \sqrt{\sin^2 \left( \frac{\Delta \phi}{2} \right) + \cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2 \left( \frac{\Delta \lambda}{2} \right)} \right)$$
     
     - Definitions:
       - $\phi_1$ and $\phi_2$ are the latitudes (in radians) of the school and the student, respectively.
       - $\lambda_1$ and $\lambda_2$ are the longitudes (in radians) of the school and the student, respectively.
       - $\Delta \phi = \phi_2 - \phi_1$ (difference in latitudes).
       - $\Delta \lambda = \lambda_2 - \lambda_1$ (difference in longitudes).
       - $r$ is the Earth's radius (mean radius = 3,959 miles).
  4. **Add the New Column:**
     - Create a new column, `distance_from_school`, to store the calculated distances.
"""
    ###
    ### YOUR CODE HERE
    ###
    #print(df.head())
    #print(school_lat)
    #print(school_lon)
    
    school_lat_rad = np.radians(school_lat)
    school_lon_rad = np.radians(school_lon)

    df_copy = df.copy()
    
    df_copy['latitude'] = pd.to_numeric(df_copy['latitude'], errors='coerce')
    df_copy['longitude'] = pd.to_numeric(df_copy['longitude'], errors='coerce')

    df_copy['latitude_rad'] = np.radians(df_copy['latitude'])
    df_copy['longitude_rad'] = np.radians(df_copy['longitude'])

    df_copy['delta_lat'] = df_copy['latitude_rad'] - school_lat_rad
    df_copy['delta_lon'] = df_copy['longitude_rad'] - school_lon_rad

    a = (np.sin(df_copy['delta_lat'] / 2) ** 2 + 
         np.cos(school_lat_rad) * np.cos(df_copy['latitude_rad']) * np.sin(df_copy['delta_lon'] / 2) ** 2)
    c = 2 * np.arcsin(np.sqrt(a))
    r = 3959

    df_copy['distance_from_school'] = (r * c).round(3)

    final = df_copy.drop(columns=['latitude_rad', 'longitude_rad', 'delta_lat', 'delta_lon'])

    return final

### Demo function call
demo_df_calculate_distance = pd.DataFrame({
    "Permanent Region": ["FL", "NY", "CA", "GA", "TX"],
    "latitude": [27.8333, 42.1497, 36.17, 32.9866, "invalid"],
    "longitude": [-81.717, -74.9384, -119.7462, -83.6487, "missing"]
})

demo_output_calculate_distance = calculate_distance(
    demo_df_calculate_distance, 
    school_lat=34.1271, 
    school_lon=-118.2109
)
demo_output_calculate_distance
Out[11]:
Permanent Region latitude longitude distance_from_school
0 FL 27.8333 -81.7170 2193.210
1 NY 42.1497 -74.9384 2389.335
2 CA 36.1700 -119.7462 165.675
3 GA 32.9866 -83.6487 1982.194
4 TX NaN NaN NaN

Example: A correct implementation should produce the following output:

Permanent Region latitude longitude distance_from_school
0 FL 27.8333 -81.717 2193.210034
1 NY 42.1497 -74.9384 2389.334519
2 CA 36.17 -119.7462 165.674547
3 GA 32.9866 -83.6487 1982.193883
4 TX NaN NaN NaN


The cell below will test your solution for calculate_distance (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [12]:
### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=calculate_distance,
              ex_name='calculate_distance',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_distance did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_distance test ran 100 iterations in 1.06 seconds
Passed! Please submit.

Exercise 5: (1 points)

one_hot_encode

Your task: define one_hot_encode as follows:

Perform one-hot encoding on a dataset using pandas.get_dummies.

Input:

  • df: A pandas DataFrame containing the columns to be one-hot encoded.

Output:

  • encoded_df: A pandas DataFrame containing:
    • The specified columns one-hot encoded, with new columns for each unique category in the original columns.
    • All other columns from the input df retained unmodified.

Requirements/steps:

  1. One-hot encode the following columns:
    • "Financial Aid Intent"
    • "Scholarship"
    • "Ethnic Background"
    • "Permanent Region"
    • "Sex"
    • "Level of Financial Need"
  2. Ensure that the resulting one-hot encoded columns are encoded as floats.
  3. Retain all other columns in the original DataFrame without modification.

Additional Notes:

  • Refer to the pandas.get_dummies documentation for more information.
In [13]:
### Solution - Exercise 5  
def one_hot_encode(df: pd.DataFrame):

    ###
    ### YOUR CODE HERE
    ###
    
    cols = [
        "Financial Aid Intent",
        "Scholarship",
        "Ethnic Background",
        "Permanent Region",
        "Sex",
        "Level of Financial Need"]

    one_hot_df = pd.get_dummies(df, columns = cols, dtype=float)

    return one_hot_df

### Demo function call
demo_df_one_hot_encode = pd.DataFrame({
    'Financial Aid Intent': ['FAY', 'FAY', 'FAN', 'FAY', 'FAY', 'FAN', 'FAN', 'FAY', 'FAY', 'FAN'],
    'Scholarship': ['NO', 'LDRS', 'NO', 'NO', 'No Scholarship', 'DIR', 'DIR', 'NO', 'NO', 'DIR'],
    'Ethnic Background': ['White', 'Black', 'Asian', 'White', 'Multiracial', 'White', 'White', 'Black', 'Asian', 'White'],
    'Permanent Region': ['CA', 'NY', 'CA', 'CA', 'CA', 'WA', 'WA', 'NY', 'CA', 'WA'],
    'Sex': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M'],
    'Level of Financial Need': ['Low', 'Very High', 'Medium', 'Low', 'High', 'Low', 'Low', 'Medium', 'High', 'Very High']
})

demo_output_one_hot_encode = one_hot_encode(demo_df_one_hot_encode)
display(demo_output_one_hot_encode)
Financial Aid Intent_FAN Financial Aid Intent_FAY Scholarship_DIR Scholarship_LDRS Scholarship_NO Scholarship_No Scholarship Ethnic Background_Asian Ethnic Background_Black Ethnic Background_Multiracial Ethnic Background_White Permanent Region_CA Permanent Region_NY Permanent Region_WA Sex_F Sex_M Level of Financial Need_High Level of Financial Need_Low Level of Financial Need_Medium Level of Financial Need_Very High
0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
5 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
6 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
7 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
8 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
9 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0

Example: A correct implementation should produce the following output:

Financial Aid Intent_FAN Financial Aid Intent_FAY Scholarship_DIR Scholarship_LDRS Scholarship_NO Scholarship_No Scholarship Ethnic Background_Asian Ethnic Background_Black Ethnic Background_Multiracial Ethnic Background_White Permanent Region_CA Permanent Region_NY Permanent Region_WA Sex_F Sex_M Level of Financial Need_High Level of Financial Need_Low Level of Financial Need_Medium Level of Financial Need_Very High
0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0


The cell below will test your solution for one_hot_encode (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [14]:
### Test Cell - Exercise 5  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=one_hot_encode,
              ex_name='one_hot_encode',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to one_hot_encode did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
one_hot_encode test ran 50 iterations in 11.20 seconds
Passed! Please submit.

Exercise 6: (3 points)

balance_split_data

Your task: define balance_split_data as follows:

Balance the dataset by oversampling the minority class and split the data into training and test datasets. You will use the train_test_split and resample functions from sklearn to complete this task. The functions have been imported for you. Refer to the documentation below for more information.:

Input:

  • df: A pandas DataFrame containing the student admissions data.
  • ind_col: The indicator column containing the dependent variable encoded as binary: 0 or 1 (Gross Commit Indicator in our data).
  • test_size: A float between 0.0 and 1.0 representing the proportion of the dataset to include in the test split.
  • random_state: An integer representing the random state for reproducibility.

Output:

  • X_train: A pandas DataFrame containing the features for the training dataset.
  • X_test: A pandas DataFrame containing the features for the test dataset.
  • y_train: A pandas Series containing the target variable for the training dataset.
  • y_test: A pandas Series containing the target variable for the test dataset.

Each object represents a part of the split balanced dataset, with X_train and X_test containing the independent variables (features), and y_train and y_test containing the dependent variable (target). The features in X_train and X_test retain the same column names as the input DataFrame, excluding the indicator column (ind_col) used as the target.

Requirements/steps:

  1. Oversample the Minority Class:
    • Separate the DataFrame into two subsets: one for the minority class (ind_col = 1) and one for the majority class (ind_col = 0).
    • Use sklearn.utils.resample to oversample the minority class to match the majority class size.
    • Pass the random_state parameter to the resampling function for reproducibility.
    • Combine the two subsets to create a balanced dataset.
  2. Split the Balanced Data:
    • Use train_test_split to split the balanced data into training and test sets.
    • Pass the random_state parameter to the split function for reproducibility.
    • Set the stratify parameter to maintain the class balance in both splits.
  3. Return the features (X_train, X_test) and target variables (y_train, y_test) for both sets.

Notes:

  • The same random state value (from the function call) should be used for both the resampling and the split functions.
In [17]:
### Solution - Exercise 6  
def balance_split_data(df: pd.DataFrame, ind_col: str, test_size: float, random_state: int):
    from sklearn.model_selection import train_test_split
    from sklearn.utils import resample

    ###
    ### YOUR CODE HERE
    ###
    
    # First balance data
    
    majority_class = df[df[ind_col] == 0]
    minority_class = df[df[ind_col] == 1]
    
    #print(len(minority_class))
    #print(len(majority_class))

    minority_oversampled = resample(minority_class,
                                    #replace=True,
                                    n_samples=len(majority_class),
                                    random_state=random_state)
    
    #print(len(minority_oversampled))

    balanced_df = pd.concat([majority_class, minority_oversampled])
    
    #print(len(balanced_df))
    
    # Perform split on balanced data
    
    X = balanced_df.drop(columns=[ind_col])
    y = balanced_df[ind_col]

    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size=test_size,
                                                        random_state=random_state,
                                                        stratify=y)

    return X_train, X_test, y_train, y_test
    

### Demo function call

demo_df_balance_encode_split_data = pd.DataFrame({
    'Gross Commit Indicator': [0, 0, 0, 0, 1, 1, 1, 1, 0, 1],
    "Financial Aid Intent_FAN": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],
    "Financial Aid Intent_FAY": [1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
    "Scholarship_DIR": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
    "Scholarship_LDRS": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Scholarship_NO": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
    "Scholarship_No Scholarship": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "Ethnic Background_Asian": [0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0],
    "Ethnic Background_Black": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0],
    "Ethnic Background_White": [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    'GPA': [3.67, 3.76, 3.58, 4.0, 3.9, 3.72, 3.94, 3.76, 3.58, 3.67],
    'HS Class Size': [80.0, 26.0, 642.0, 303.0, 288.0, 288.0, 77.0, 77.0, 303.0, 288.0],
    "Ethnic Background_Multiracial": [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Permanent Region_CA": [1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0],
    "Permanent Region_NY": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
    "Permanent Region_WA": [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
    "Sex_F": [0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0],
    "Sex_M": [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0],
    "Level of Financial Need_High": [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0],
    "Level of Financial Need_Low": [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
    "Level of Financial Need_Medium": [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
    "Level of Financial Need_Very High": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
})

X_train, X_test, y_train, y_test = balance_split_data(
    demo_df_balance_encode_split_data, 
    ind_col='Gross Commit Indicator', 
    test_size=0.2,
    random_state=42
)

print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape:  {X_test.shape}")
print(f"y_train.shape: {y_train.shape}")
print(f"y_test.shape:  {y_test.shape}")
X_train.shape: (8, 21)
X_test.shape:  (2, 21)
y_train.shape: (8,)
y_test.shape:  (2,)

Example: A correct implementation should produce the following output:

X_train.shape: (8, 21)
X_test.shape:  (2, 21)
y_train.shape: (8,)
y_test.shape:  (2,)


The cell below will test your solution for balance_split_data (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [19]:
### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.series_handler(balance_split_data),
              ex_name='balance_split_data',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=30)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to balance_split_data did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
balance_split_data test ran 30 iterations in 2.04 seconds
Passed! Please submit.

Exercise 7: (2 points)

calculate_f1_score

Your task: define calculate_f1_score as follows:

Calculate the F1 score for a classification model's predictions.

The F1 score is a metric that balances precision and recall, making it particularly useful for evaluating models on imbalanced datasets.

Input:

  • y_true: A list or NumPy array of true class labels (must contain only 0 or 1).
  • y_pred: A list or NumPy array of predicted class labels (must contain only 0 or 1).

Output:

  • f1_score: A float representing the F1 score, rounded to 3 decimal places.

Definitions:

  • True Positives (TP): The number of correctly predicted positive class instances.
  • False Positives (FP): The number of instances incorrectly predicted as positive (but actually negative).
  • False Negatives (FN): The number of instances incorrectly predicted as negative (but actually positive).

Formulas:

  1. Precision (P):

    P=TPTP+FP
    Precision measures the proportion of positive predictions that are correct.

  2. Recall (R):

    R=TPTP+FN
    Recall measures the proportion of actual positive instances that are correctly predicted.

  3. F1 Score (F1):

    F1=2PRP+R
    The F1 score is the harmonic mean of precision and recall.

Special Cases:

  • If TP+FP=0 (no positive predictions), precision is undefined.
  • If TP+FN=0 (no actual positives), recall is undefined.
  • In both cases, set the F1 score to 0 to handle division by zero.

Hints:

  • Use NumPy's vectorized operations (e.g., np.sum) to efficiently calculate TP, FP, and FN.
In [20]:
### Solution - Exercise 7  
def calculate_f1_score(y_true, y_pred):
    r"""
    Calculate the F1 score for a classification model's predictions.

    The F1 score is a metric that balances precision and recall, making it particularly useful for evaluating models on imbalanced datasets.

    **Input:**
    - `y_true`: A list or NumPy array of true class labels (must contain only 0 or 1).
    - `y_pred`: A list or NumPy array of predicted class labels (must contain only 0 or 1).

    **Output:**
    - `f1_score`: A float representing the F1 score, rounded to 3 decimal places.

    **Definitions:**
    - **True Positives (TP):** The number of correctly predicted positive class instances.
    - **False Positives (FP):** The number of instances incorrectly predicted as positive (but actually negative).
    - **False Negatives (FN):** The number of instances incorrectly predicted as negative (but actually positive).

    **Formulas:**
    1. **Precision ($P$):**
        $$P = \frac{TP}{TP + FP}$$
        Precision measures the proportion of positive predictions that are correct.

    2. **Recall ($R$):**
        $$R = \frac{TP}{TP + FN}$$
        Recall measures the proportion of actual positive instances that are correctly predicted.

    3. **F1 Score ($F1$):**
        $$F1 = 2 \cdot \frac{P \cdot R}{P + R}$$
        The F1 score is the harmonic mean of precision and recall.

    **Special Cases:**
    - If $TP + FP = 0$ (no positive predictions), precision is undefined.
    - If $TP + FN = 0$ (no actual positives), recall is undefined.
    - In both cases, set the F1 score to 0 to handle division by zero.

    **Hints:**
    - Use NumPy's vectorized operations (e.g., `np.sum`) to efficiently calculate $TP$, $FP$, and $FN$.
    """
    ###
    ### YOUR CODE HERE
    ###
    
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    TP = np.sum((y_true == 1) & (y_pred == 1))
    FP = np.sum((y_true == 0) & (y_pred == 1))
    FN = np.sum((y_true == 1) & (y_pred == 0))

    precision = TP / (TP + FP)

    recall = TP / (TP + FN)
    
    f1_score = 2 * (precision * recall) / (precision + recall)

    return round(f1_score, 3)

### Demo function call
demo_y_true_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 0])
demo_y_pred_f1_score = np.array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1])

demo_output_f1_score = calculate_f1_score(demo_y_true_f1_score, demo_y_pred_f1_score)
print(demo_output_f1_score) 
0.8

Example: A correct implementation should produce the following output:

# Example Input
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
y_pred = [0, 1, 1, 0, 1, 0, 1, 0, 0, 1]

# Example Output
F1 Score: 0.8


The cell below will test your solution for calculate_f1_score (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [21]:
### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=calculate_f1_score,
              ex_name='calculate_f1_score',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_f1_score did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_f1_score test ran 100 iterations in 0.09 seconds
Passed! Please submit.

Exercise 8: (2 points)

compute_logistic_metrics

Your task: define compute_logistic_metrics as follows:

Calculate the predictions and gradients required for training a logistic regression model.

Input:

  • X: A NumPy array of shape (m, n+1) containing the input features. The first column must be all ones, representing the bias term. Each row represents a single observation, and each column represents a feature.
  • y_true: A NumPy array of shape (m,) containing the true class labels (0 or 1) for each observation.
  • weights: A NumPy array of shape (n+1,) representing the model weights, including the bias term.

Output:

  • sigmoid_predictions: A NumPy array of shape (m,) containing the predicted probabilities for each observation. These values are computed using the sigmoid function.
  • gradients_w: A NumPy array of shape (n+1,) containing the gradient of the cost with respect to each weight, including the bias term.

Requirements/steps:

  1. Compute the Linear Combination of Inputs:
    • Calculate z, the linear combination of inputs, using the formula:
      z=Xweights
  2. Compute Sigmoid Predictions:
    • Apply the sigmoid function to z to compute predicted probabilities:
      σ(z)=11+ez
    • Note: These predicted probabilities, σ(z), are referred to as y^ in logistic regression.
  3. Compute Gradients:
    • Calculate the gradient of the cost with respect to the weights:
      Jw=1mXT(y^y)
    • Here, y^ represents the predicted probabilities (sigmoid predictions), and m is the number of samples in the dataset.

Hints:

  • Use NumPy's vectorized operations for efficient computation of gradients.
  • Note that X includes the bias term as the first column.
In [22]:
### Solution - Exercise 8  
def compute_logistic_metrics(X, y_true, weights):
    r"""
    Calculate the predictions and gradients required for training a logistic regression model.

    **Input:**
    - `X`: A NumPy array of shape (m, n+1) containing the input features. The first column must be all ones, representing the bias term. Each row represents a single observation, and each column represents a feature.
    - `y_true`: A NumPy array of shape (m,) containing the true class labels (0 or 1) for each observation.
    - `weights`: A NumPy array of shape (n+1,) representing the model weights, including the bias term.

    **Output:**
    - `sigmoid_predictions`: A NumPy array of shape (m,) containing the predicted probabilities for each observation. These values are computed using the sigmoid function.
    - `gradients_w`: A NumPy array of shape (n+1,) containing the gradient of the cost with respect to each weight, including the bias term.

    **Requirements/steps:**
    1. **Compute the Linear Combination of Inputs**:
        - Calculate $z$, the linear combination of inputs, using the formula:
        $$z = X \cdot weights$$
    2. **Compute Sigmoid Predictions**:
        - Apply the sigmoid function to $z$ to compute predicted probabilities:
        $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
        - Note: These predicted probabilities, $\sigma(z)$, are referred to as $\hat{y}$ in logistic regression. 
    3. **Compute Gradients**:
        - Calculate the gradient of the cost with respect to the weights:
        $$\frac{\partial J}{\partial w} = \frac{1}{m} X^T (\hat{y} - y)$$
        - Here, $\hat{y}$ represents the predicted probabilities (sigmoid predictions), and $m$ is the number of samples in the dataset.

    **Hints:**
    - Use NumPy's vectorized operations for efficient computation of gradients.
    - Note that $X$ includes the bias term as the first column.
    """

    ###
    ### YOUR CODE HERE
    ###
    
    z = np.dot(X, weights)

    sig_preds = 1 / (1 + np.exp(-z))

    m = X.shape[0]

    error = sig_preds - y_true

    gradients = (1/m) * np.dot(X.T, error)

    return sig_preds, gradients

### Demo function call
# Example Input
demo_X_logistic_metrics = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]])  # Includes the bias column
demo_y_true_logistic_metrics = np.array([0, 1, 1])
demo_weights_logistic_metrics = np.array([0.5, 0.1, 0.2])  # Includes bias as the first weight

# Compute Outputs
demo_sigmoid_predictions, demo_gradients_w = compute_logistic_metrics(
    demo_X_logistic_metrics, demo_y_true_logistic_metrics, demo_weights_logistic_metrics
)

# Print Outputs
print(f"Sigmoid Predictions: {demo_sigmoid_predictions}")
print(f"Gradients (Weights): {demo_gradients_w}")
Sigmoid Predictions: [0.73105858 0.83201839 0.90024951]
Gradients (Weights): [ 0.15444216 -0.09054624  0.06389592]

Example: A correct implementation should produce the following output:

# Example Input
   X = np.array([[1, 1, 2], [1, 3, 4], [1, 5, 6]])  # Includes the bias column as the first column
   y_true = np.array([0, 1, 1])
   weights = np.array([0.5, 0.1, 0.2])

   # Example Output
   Sigmoid Predictions: array([0.73105858 0.83201839 0.90024951])
   Gradients (Weights): array([ 0.15444216 -0.09054624  0.06389592])


The cell below will test your solution for compute_logistic_metrics (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [23]:
### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=compute_logistic_metrics,
              ex_name='compute_logistic_metrics',
              key=b'apvdqSXE1hpoezgeyhgb6Y557k-QtNd5WaF1QCOuIQE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to compute_logistic_metrics did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
compute_logistic_metrics test ran 100 iterations in 0.17 seconds
Passed! Please submit.

Fin

If you have made it this far, congratulations! You are done. Please submit your exam!

The remainder of this notebook combines the work you have completed above with a few addition steps to build a working logistic regression model.

Epilogue: It's Time to Build a Model

## Data Exploration and Cleaning

# Step 0: Fill in missing values and standardize feature formats
cleaned_df = data_cleaning_and_standardization(admission_df)

# Step 1: Create clean state_data dictionary
abbr_dict = utils.load_object_from_publicdata('abbr_dict.dill')
coor_dict = utils.load_object_from_publicdata('coor_dict.dill')
state_data = (abbr_dict, coor_dict)


## Feature Engineering

# Step 2: Use state_data to feature engineer new columns in cleaned_df
lat_long_df = add_lat_long(cleaned_df, state_data)

# Step 3: Add student distance from school
distance_df = calculate_distance(lat_long_df, 34.1271, -118.2109)

# Step 4: One-hot encode categorical features
one_hot_df = one_hot_encode(distance_df)

## Model Building and Evaluation

# Step 5: Split data into train and test while fixing class imbalance
X_train, X_test, y_train, y_test = balance_split_data(one_hot_df, 'Gross Commit Indicator', 0.2, 42)

# Step 6: Train logistic regression model using (exercise is left to the reader)
def train_logistic_regression(X_train, y_train):
    # Initialize the weights and biases to zero

    # Implement the cost and gradients from previous exercises

    # Use gradient desent to update the weights and biases

  return weights, bias


# Step 7: Make predictions
def predict_logistic_regression(X_train, y_train, X_test):
    weights, bias = train_logistic_regression(X_train, y_train)

    # Apply sigmoid function to compute predicted probabilities
    y_pred = 

    y_pred_class = (y_pred >= 0.5).astype(int)
    return y_pred_class
y_pred = predict_logistic_regression(X_train, y_train, X_test)

# Step 8: Evaluate the model
f1_score = calculate_f1_score(y_test, y_pred)
print(f"Model F1 Score: {f1_score}")

Model F1 Score: 0.595

Our model converges!

In [ ]:
 
In [ ]: