Midterm 2, Spring 2025
: Data Deidentification
¶Version 0.0.2
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of SQL, basic Python, pandas, and NumPy
. It has 11 exercises numbered 0 to 10. There are 22 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!!
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 1 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 3 point(s)
Exercise 5 - : 3 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 1 point(s)
Exercise 8 - : 2 point(s)
Exercise 9 - : 2 point(s)
Exercise 10 - : 3 point(s)
Final reminders:
This exam mimics the process of exploring and extracting data from a university database for third-party research. The law in the United States requires your university to protect your personal information when working with outside organizations.
For this exam, you are given a mock university database, and you must "de-identify" it for sharing with outside researchers. Deidentification refers to hiding or removing details that could be used to associate data with specific people. Your overall task is to do the following:
Explore the Database Schema
Use SQL queries to explore the mock university database and understand its structure, including table names, columns, data types, and relationships between tables.
Pull the Required Data
Write SQL queries to retrieve the required data for analysis, ensuring the minimum necessary data is selected for the task.
Deidentify the Data
Use Python, Pandas, and NumPy to deidentify the data. This includes:
By the end of this exam, you will have a deidentified dataset ready for third-party research.
Run the two cells below to load the global imports, database connection, and objects needed for the exam. In particular, take note of conn, a SQLite3 connection object. The exam will not run without these cells.
### Global imports
import dill
from cse6040_devkit import plugins, utils
import re
import sqlite3
import hashlib
import pandas as pd
import numpy as np
from pprint import pprint
# Load the database and objects
conn = sqlite3.connect('resource/asnlib/publicdata/university.db')
SEED = 6040
The database has eight (8) tables whose overall organization and columns are depicted below. You do not need to understand all of these details right now, but you may find it helpful to refer back to them as you progress through the exam. Consider taking a screenshot, opening the picture in a new browser tab, using the Table of Contents feature to get back to this information later, or referring to the following PDF copy of the image.
student_main
Contains primary student information, including:
student_id
: Unique identifier for each student (primary key).last_name
, first_name
, middle_initial
: Basic personal details.email
: Email address, which may be null for some records.gender
, ethnicity
: Demographic details.address
: Residential address.us_citizen
, us_resident
, state_resident
, pell_recipient
, us_veteran
: Indicators for citizenship, residency, financial aid, and veteran status.student_enrollment
Tracks student enrollment details, including:
id
: Primary key for each enrollment record.student_id
: References student_main.student_id
.term
: Academic term (e.g., "202308" for Fall 2023).major_code
: Code for the student's major.semester_hours_attempted
, semester_hours_earned
: Academic credit details.semester_gpa
, cumulative_hours_earned
, cumulative_gpa
: Academic performance metrics.graduation
Stores graduation records, including:
id
: Primary key for each graduation record.student_id
: References student_main.student_id
.last_enroll_term
, grad_term
: Last enrollment and graduation terms.grad_level
: Graduation level (e.g., Bachelor, Master, Doctorate).grad_status
: Graduation status (e.g., active, completed).student_key
Links students to financial and employee records:
student_id
: Primary key, references student_main.student_id
.finance_id
: Financial record identifier.employee_id
: Employee identifier (nullable).student_scholarship
Records student scholarships and financial aid details, including:
id
: Primary key for each scholarship record.finance_id
: References student_key.finance_id
.scholarship_term
: Term during which the scholarship was awarded.scholarship_code
: Code identifying the scholarship, references scholarship_rules.scholarship_code
.scholarship_total
, scholarship_payment
, scholarship_refund
: Financial details of the scholarship.scholarship_rules
Defines eligibility rules for scholarships, including:
id
: Primary key for each scholarship rule.scholarship_crosswalk_id
: References scholarship_crosswalk.id
.scholarship_code
: Code identifying the scholarship.major_code
: Eligibility criteria based on major, references major_crosswalk.major_code
.activation_date
: Date the rule became effective.scholarship_active
: Indicates if the scholarship is currently active.min_gpa
: Minimum GPA required to qualify.gender
, pell_recipient
, us_veteran
, us_citizen
, us_resident
, state_resident
: Eligibility criteria for specific demographics.amount
: Scholarship amount awarded.scholarship_crosswalk
Maps scholarships to descriptive details:
id
: Primary key for each scholarship description.scholarship_code
: Code identifying the scholarship.scholarship_description
: Description of the scholarship.activation_date
: Date the description became effective.major_crosswalk
Maps major codes to detailed descriptions:
id
: Primary key for each major description.major_code
: Code for the major (e.g., "CS" for Computer Science).major_description
: Descriptive name of the major.activation_date
: Date the description became effective.Review the helper functions below before tackling the exercises.
We have provided the following functions to assist you in exploring the university database. Run each cell to load the functions into your environment and view a demonstration of their usage. You are encouraged to use these functions to explore the database, understand its structure, and develop solutions to the exercises.
def get_table_list(conn):
"""
Retrieves the list of tables from the SQLite database.
Args:
conn: SQLite database connection object
Returns:
list: A list of table names in the database
"""
# Query to get all tables from sqlite_master
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables_df = pd.read_sql_query(tables_query, conn)
# Return the list of table names
return tables_df['name'].tolist()
get_table_list_demo = get_table_list(conn)
for table in get_table_list_demo:
print(table)
def get_column_details(conn, table_name):
"""
Retrieves column details for a specific table.
Args:
conn: SQLite database connection object
table_name (str): Name of the table to get column details for
Returns:
DataFrame: A DataFrame containing column details with columns:
- column_id: Column ID (0-based index)
- column_name: Column name
- data_type: Data type of the column
- not_null: Whether column allows NULL values (1=NOT NULL, 0=NULL allowed)
- primary_key: Whether column is a primary key (1=PRIMARY KEY, 0=not a primary key)
- table_name: Name of the table
Raises:
AssertionError: If the provided table_name does not exist in the database
"""
# Get the list of tables in the database and verify table_name exists
tables = get_table_list(conn)
assert table_name in tables, f"Table '{table_name}' does not exist in the database"
# Use PRAGMA table_info to get column details
pragma_query = f"PRAGMA table_info('{table_name}')"
column_details = pd.read_sql_query(pragma_query, conn)
# Add table name to the DataFrame
column_details['table_name'] = table_name
# Remove unnecessary columns
column_details = column_details[['cid', 'name', 'type', 'notnull', 'pk', 'table_name']]
# Rename columns for clarity
column_details = column_details.rename(columns={
'cid': 'column_id',
'name': 'column_name',
'type': 'data_type',
'notnull': 'not_null',
'pk': 'primary_key'
})
return column_details
get_column_details_demo_df = get_column_details(conn, 'student_main')
display(get_column_details_demo_df)
def get_foreign_key_details(conn, table_name):
"""
Retrieves foreign key details for a specific table.
Args:
conn: SQLite database connection object
table_name (str): Name of the table to get foreign key details for
Returns:
DataFrame: A DataFrame containing foreign key details with columns:
- column_name: Column name in the current table
- references_table: Referenced table name
- references_column: Referenced column name
- table_name: Name of the table
Raises:
AssertionError: If the provided table_name does not exist in the database
"""
# Get the list of tables in the database and verify table_name exists
tables = get_table_list(conn)
assert table_name in tables, f"Table '{table_name}' does not exist in the database"
# Use PRAGMA foreign_key_list to get foreign key details
pragma_query = f"PRAGMA foreign_key_list('{table_name}')"
fk_details = pd.read_sql_query(pragma_query, conn)
# If the DataFrame is empty, return an empty DataFrame with the required columns
if fk_details.empty:
empty_df = pd.DataFrame(columns=['column_name', 'references_table', 'references_column', 'table_name'])
empty_df['table_name'] = table_name
return empty_df
# Add table name to the DataFrame
fk_details['table_name'] = table_name
# Keep only relevant columns and rename them for clarity
fk_details = fk_details[['table_name', 'from', 'table', 'to']]
fk_details.rename(columns={
'from': 'column_name',
'table': 'references_table',
'to': 'references_column'
}, inplace=True)
return fk_details
get_foreign_key_details_demo_df = get_foreign_key_details(conn, 'student_key')
display(get_foreign_key_details_demo_df)
def get_table_schema(conn, table_name):
"""
Combines column details and foreign key details for a specific table.
Args:
conn: SQLite database connection object
table_name (str): Name of the table to get the schema for
Returns:
DataFrame: A DataFrame containing complete schema information for the table
Raises:
AssertionError: If the provided table_name does not exist in the database
"""
# Get the list of tables in the database and verify table_name exists
tables = get_table_list(conn)
assert table_name in tables, f"Table '{table_name}' does not exist in the database"
# Get column details
column_details = get_column_details(conn, table_name)
# Get foreign key details
fk_details = get_foreign_key_details(conn, table_name)
# Drop the table_name column from foreign key details before merging
if not fk_details.empty:
fk_details.drop(columns=['table_name'], inplace=True)
# Merge only the foreign key columns we need, excluding table_name
table_schema = column_details.merge(
fk_details[['column_name', 'references_table', 'references_column']],
how='left',
on='column_name'
)
return table_schema
get_table_schema_demo_df = get_table_schema(conn, 'student_main')
display(get_table_schema_demo_df)
In the SQL section of the exam, you will use the student_scholarship
table several times. Run the FREE exercise cell below to examine the table, understand the structure, and get your free point!
get_student_scholarship_schema__FREE
Example: we have defined get_student_scholarship_schema__FREE
as follows:
This is a free exercise!
**Please run the test cell below to collect your FREE point**
### Solution - Exercise 0
def get_student_scholarship_schema__FREE(conn):
# Get schema information specifically for the student_scholarship table
student_scholarship_schema = get_table_schema(conn, 'student_scholarship')
# Return the schema information
return student_scholarship_schema
### Demo function call
student_scholarship_schema = get_student_scholarship_schema__FREE(conn)
student_scholarship_schema
### Run Me!!!
demo_result_get_student_scholarship_schema__FREE_TRUE = utils.load_object_from_publicdata('demo_result_get_student_scholarship_schema__FREE_TRUE')
The demo should display this output.
column_id | column_name | data_type | not_null | primary_key | table_name | references_table | references_column | |
---|---|---|---|---|---|---|---|---|
0 | 0 | id | INTEGER | 1 | 1 | student_scholarship | NaN | NaN |
1 | 1 | finance_id | VARCHAR(12) | 0 | 0 | student_scholarship | student_key | finance_id |
2 | 2 | scholarship_term | VARCHAR(6) | 0 | 0 | student_scholarship | NaN | NaN |
3 | 3 | scholarship_code | VARCHAR | 0 | 0 | student_scholarship | scholarship_rules | scholarship_code |
4 | 4 | scholarship_total | INTEGER | 0 | 0 | student_scholarship | NaN | NaN |
5 | 5 | scholarship_payment | INTEGER | 0 | 0 | student_scholarship | NaN | NaN |
6 | 6 | scholarship_refund | INTEGER | 0 | 0 | student_scholarship | NaN | NaN |
The test cell below will always pass. Please submit to collect your free points for get_student_scholarship_schema__FREE (exercise 0).
### Test Cell - Exercise 0
print('Passed! Please submit.')
In this section, you will use SQL queries to explore the mock university database and compile the required data for analysis. The connection to the database has been made for you, as part of the notebook starter code above, and is available in the environment as conn
.
binary_indicators
Your task: define binary_indicators
as follows:
To begin, it would be helpful to get a sense of the data in tables. Start by writing a SQL query to calculate the percentage of students with positive values for binary indicators.
Inputs:
Return:
status
: The name of the binary indicator (column name)percentage
: The percentage of students with a positive value for that indicator (between 0 and 100)status
column should be first and percentage
column second; the order of the rows is not important.Requirements:
student_main
tablestatus
column:us_citizen
column → Use 'US Citizen' as the status valueus_resident
column → Use 'US Resident' as the status valuestate_resident
column → Use 'State Resident' as the status valuepell_recipient
column → Use 'Pell Recipient' as the status valueus_veteran
column → Use 'US Veteran' as the status valueHints:
### Solution - Exercise 1
def binary_indicators() -> str:
### BEGIN SOLUTION
query = """
SELECT
'US Citizen' AS status
, SUM(CASE WHEN us_citizen = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
FROM student_main
GROUP BY 1
UNION ALL
SELECT
'US Resident' AS status
, SUM(CASE WHEN us_resident = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
FROM student_main
GROUP BY 1
UNION ALL
SELECT
'State Resident' AS status
, SUM(CASE WHEN state_resident = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
FROM student_main
GROUP BY 1
UNION ALL
SELECT
'Pell Recipient' AS status
, SUM(CASE WHEN pell_recipient = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
FROM student_main
GROUP BY 1
UNION ALL
SELECT
'US Veteran' AS status
, SUM(CASE WHEN us_veteran = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
FROM student_main
GROUP BY 1
"""
return query
### END SOLUTION
### Demo function call
demo_query_binary_indicators = binary_indicators()
demo_result_binary_indicators = pd.read_sql(demo_query_binary_indicators, conn)
demo_result_binary_indicators
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_binary_indicators_TRUE = utils.load_object_from_publicdata('demo_result_binary_indicators_TRUE')
The demo should display this output.
status | percentage | |
---|---|---|
0 | US Citizen | 88.6 |
1 | US Resident | 82.4 |
2 | State Resident | 81.2 |
3 | Pell Recipient | 25.6 |
4 | US Veteran | 11.2 |
The cell below will test your solution for binary_indicators (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(binary_indicators),
ex_name='binary_indicators',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to binary_indicators did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(binary_indicators),
ex_name='binary_indicators',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to binary_indicators did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
scholarship_payments
Your task: define scholarship_payments
as follows:
Let's continue exploring the scholarship data. Write a SQL query to calculate the total amount paid out by each scholarship.
Inputs:
Return:
scholarship_code
: The 3-letter scholarship identifiertotal_payment
: The sum of all scholarship_payment
amounts made by that scholarshipscholarship_code
column should be first and total_payment
column second; the order of the rows is not important.Requirements:
student_scholarship
table### Solution - Exercise 2
def scholarship_payments() -> str:
### BEGIN SOLUTION
query = '''
SELECT
ss.scholarship_code
, SUM(ss.scholarship_payment) AS total_payment
FROM student_scholarship ss
GROUP BY ss.scholarship_code
'''
return query
### END SOLUTION
### Demo function call
demo_query_scholarship_payments = scholarship_payments()
demo_result_scholarship_payments = pd.read_sql(demo_query_scholarship_payments, conn)
demo_result_scholarship_payments
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_scholarship_payments_TRUE = utils.load_object_from_publicdata('demo_result_scholarship_payments_TRUE')
The demo should display this output.
scholarship_code | total_payment | |
---|---|---|
0 | ALE | 13738 |
1 | BCI | 51788 |
2 | FEG | 26333 |
3 | FGA | 128097 |
4 | FRS | 12500 |
5 | GAM | 71019 |
6 | GVS | 532829 |
7 | KCJ | 15708 |
8 | PRE | 1005649 |
9 | USE | 536667 |
10 | WIM | 6000 |
The cell below will test your solution for scholarship_payments (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(scholarship_payments),
ex_name='scholarship_payments',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to scholarship_payments did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(scholarship_payments),
ex_name='scholarship_payments',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to scholarship_payments did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
avg_gpa_by_scholarship
Your task: define avg_gpa_by_scholarship
as follows:
Write a SQL query to calculate the average semester GPA for students grouped by scholarship code.
Inputs:
Return:
scholarship_code
: The 3-letter scholarship identifier or 'NONE' for students without scholarshipsavg_gpa
: The average semester GPA for students with that scholarship, rounded to 2 decimal placesRequirements:
semester_gpa
with the alias avg_gpa
student_enrollment
student_key
student_scholarship
student_enrollment
and student_key
JOIN
type when joining to student_scholarship
so that students without scholarships are includedscholarship_code
valueavg_gpa
scholarship_code
as a tiebreaker (secondary sort) in ascending order for consistent resultsHint:
COALESCE
or IFNULL
functions can be used to solve this exercise.### Solution - Exercise 3
def avg_gpa_by_scholarship() -> str:
### BEGIN SOLUTION
query = '''
SELECT
COALESCE(ss.scholarship_code, 'NONE') AS scholarship_code
, ROUND(AVG(se.semester_gpa), 2) AS avg_gpa
FROM student_enrollment se
JOIN student_key sk
ON se.student_id = sk.student_id
LEFT JOIN student_scholarship ss
ON sk.finance_id = ss.finance_id
GROUP BY ss.scholarship_code
ORDER BY avg_gpa DESC, scholarship_code ASC
'''
return query
### END SOLUTION
### Demo function call
demo_query_avg_gpa_by_scholarship = avg_gpa_by_scholarship()
demo_result_avg_gpa_by_scholarship = pd.read_sql(demo_query_avg_gpa_by_scholarship, conn)
demo_result_avg_gpa_by_scholarship
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_avg_gpa_by_scholarship_TRUE = utils.load_object_from_publicdata('demo_result_avg_gpa_by_scholarship_TRUE')
The demo should display this output.
scholarship_code | avg_gpa | |
---|---|---|
0 | WIM | 2.92 |
1 | ALE | 2.86 |
2 | FEG | 2.84 |
3 | FRS | 2.84 |
4 | FGA | 2.81 |
5 | USE | 2.81 |
6 | KCJ | 2.78 |
7 | GVS | 2.76 |
8 | BCI | 2.75 |
9 | PRE | 2.75 |
10 | NONE | 2.72 |
11 | GAM | 2.69 |
The cell below will test your solution for avg_gpa_by_scholarship (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(avg_gpa_by_scholarship),
ex_name='avg_gpa_by_scholarship',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to avg_gpa_by_scholarship did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(avg_gpa_by_scholarship),
ex_name='avg_gpa_by_scholarship',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to avg_gpa_by_scholarship did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
active_scholarships
Your task: define active_scholarships
as follows:
Write a SQL query to generate a comprehensive report of all currently active scholarships and their eligibility rules.
Inputs:
Return:
scholarship_description
: Description of the scholarshipmajor_description
: Description of the eligible majormin_gpa
: Minimum GPA requirementgender
: Gender requirement (if applicable)pell_recipient
: Pell recipient status requirementus_veteran
: US veteran status requirementus_citizen
: US citizenship requirementus_resident
: US residency requirementstate_resident
: State residency requirementamount
: Scholarship award amountRequirements:
scholarship_rules
scholarship_crosswalk
major_crosswalk
scholarship_active = 'Y'
scholarship_description
that matches the activation date in scholarship_rules
major_description
based on activation_date
for each major_code
major_description
should be NULLHints:
scholarship_rules.major_code
could be null)### Solution - Exercise 4
def active_scholarships() -> str:
### BEGIN SOLUTION
query = '''
SELECT
sc.scholarship_description
, mc.major_description
, sr.min_gpa
, sr.gender
, sr.pell_recipient
, sr.us_veteran
, sr.us_citizen
, sr.us_resident
, sr.state_resident
, sr.amount
FROM scholarship_rules sr
JOIN scholarship_crosswalk sc
ON sr.scholarship_code = sc.scholarship_code
AND sr.activation_date = sc.activation_date
LEFT JOIN (
SELECT
major_code,
major_description
FROM major_crosswalk mc
WHERE activation_date = (
SELECT MAX(activation_date)
FROM major_crosswalk mc2
WHERE mc2.major_code = mc.major_code
)
) mc ON sr.major_code = mc.major_code
WHERE sr.scholarship_active = 'Y'
'''
return query
### END SOLUTION
### Demo function call
demo_query_active_scholarships = active_scholarships()
demo_result_active_scholarships = pd.read_sql(demo_query_active_scholarships, conn)
demo_result_active_scholarships
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_active_scholarships_TRUE = utils.load_object_from_publicdata('demo_result_active_scholarships_TRUE')
The demo should display this output.
scholarship_description | major_description | min_gpa | gender | pell_recipient | us_veteran | us_citizen | us_resident | state_resident | amount | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Georgia Veterans Fund | None | 3.00 | None | None | Y | Y | Y | Y | 2500 |
1 | University Student Excellence Award | None | 3.70 | None | None | None | None | None | Y | 2000 |
2 | Women in Mathematics Award | Mathematics | 3.50 | F | None | None | None | None | None | 3000 |
3 | Future Educators of Georgia Scholarship | Education | 3.50 | None | None | None | None | None | Y | 2500 |
4 | First Generation Achievement Scholarship | None | 3.50 | None | Y | None | None | None | Y | 1500 |
5 | Pell Recipients Excellence Fund | None | 3.00 | None | Y | None | Y | Y | None | 1500 |
6 | Chris Kinkade Criminal Justice Fund | Criminal Justice | 3.50 | None | None | None | Y | Y | Y | 2500 |
7 | Georgia Aquarium Marine Biology Fund | Marine Biology | 3.00 | None | None | None | None | None | Y | 2500 |
8 | My Brother's Keeper Fund | None | 3.50 | None | Y | None | N | N | N | 1000 |
9 | Audre Lorde English Studies Award | English | 3.75 | None | None | None | None | Y | None | 2500 |
10 | Bell Center for International Studies Scholarship | International Relations | 3.50 | None | None | None | Y | Y | None | 2500 |
11 | Feingold Institute for Religious Studies Award | Religious Studies | 3.75 | None | None | None | Y | Y | None | 2500 |
12 | Professor Richard Vuduc Analytics Dreamers Scholarship | Analytics | 3.75 | None | None | None | Y | Y | Y | 3000 |
The cell below will test your solution for active_scholarships (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(active_scholarships),
ex_name='active_scholarships',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to active_scholarships did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(active_scholarships),
ex_name='active_scholarships',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to active_scholarships did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
grad_retention_inds
Your task: define grad_retention_inds
as follows:
Write a SQL query to calculate retention and graduation indicators for each student.
Inputs:
Return:
student_id
: The unique identifier for each studentretention_ind
: Binary indicator (1 or 0) showing whether a student was retainedgraduation_ind
: Binary indicator (1 or 0) showing whether a student has graduatedRequirements:
student_enrollment
graduation
grad_level
of 'B' and a grad_status
of 'A' in the graduation tableHints:
CAST('202001' AS INTEGER)
would become an integer value of 202001
### Solution - Exercise 5
def grad_retention_inds() -> str:
### BEGIN SOLUTION
query = '''
SELECT
ft.student_id
, MAX(CASE WHEN se.term = CAST(ft.first_term AS INTEGER) + 100 THEN 1 ELSE 0 END) AS retention_ind
, MAX(CASE WHEN g.grad_level = 'B' AND g.grad_status = 'A' THEN 1 ELSE 0 END) AS graduation_ind
FROM
(
SELECT
student_id
, MIN(term) AS first_term
FROM student_enrollment
GROUP BY student_id
) AS ft
LEFT JOIN student_enrollment se
ON ft.student_id = se.student_id
LEFT JOIN graduation g
ON ft.student_id = g.student_id
GROUP BY ft.student_id
'''
return query
### END SOLUTION
### Demo function call
demo_query_grad_retention_inds = grad_retention_inds()
demo_result_grad_retention_inds = pd.read_sql(demo_query_grad_retention_inds, conn)
demo_result_grad_retention_inds.head(10)
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_grad_retention_inds_TRUE = utils.load_object_from_publicdata('demo_result_grad_retention_inds_TRUE')
The demo should display this output.
student_id | retention_ind | graduation_ind | |
---|---|---|---|
0 | 100052749 | 0 | 0 |
1 | 100636982 | 0 | 0 |
2 | 102689054 | 0 | 0 |
3 | 103503531 | 0 | 0 |
4 | 103972552 | 0 | 0 |
5 | 103975474 | 1 | 0 |
6 | 107116054 | 0 | 1 |
7 | 107372067 | 0 | 0 |
8 | 108283090 | 1 | 0 |
9 | 113524474 | 0 | 0 |
The cell below will test your solution for grad_retention_inds (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(grad_retention_inds),
ex_name='grad_retention_inds',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to grad_retention_inds did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(grad_retention_inds),
ex_name='grad_retention_inds',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to grad_retention_inds did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In this section, you will shift your focus from exploring and compiling data to deidentifying the student records. The consolidated dataset has been built and loaded for you, as part of the notebook starter code above. It is available in the environment as df_consolidated_data
.
Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.
df_consolidated_data = utils.load_object_from_publicdata('df_consolidated_data.dill')
rename_recpt_columns
Your task: define rename_recpt_columns
as follows:
Standardize the naming convention for scholarship recipient columns in a DataFrame.
Inputs:
df
: A DataFrame containing student records with columns that include the substring 'recpt'Return:
Requirements:
Example:
Given a DataFrame with columns containing 'recpt' in their names:
['student_id', ..., 'b_recpt', 'a_recpt', ...]
The output DataFrame should have the following columns:
['student_id', ..., 'scholarship_2_recpt', 'scholarship_1_recpt', ...]
Note: In this example, 'a_recpt' becomes 'scholarship_1_recpt', 'b_recpt' becomes 'scholarship_2_recpt', based on alphabetical sorting, but the original order is preserved.
Hint:
enumerate()
function can be used to generate numbers for the new column names. See the documentation for details.Note:
### Solution - Exercise 6
def rename_recpt_columns(df: pd.DataFrame) -> pd.DataFrame:
### BEGIN SOLUTION
recpt_columns = [col for col in df.columns if 'recpt' in col]
sorted_recpt_columns = sorted(recpt_columns)
new_column_names = {old_name: f'scholarship_{i+1}_recpt' for i, old_name in enumerate(sorted_recpt_columns)}
renamed_df = df.rename(columns=new_column_names)
return renamed_df
### END SOLUTION
### Demo function call
renamed_df = rename_recpt_columns(df_consolidated_data)
print(renamed_df.columns)
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_rename_recpt_columns_TRUE = utils.load_object_from_publicdata('demo_result_rename_recpt_columns_TRUE')
The demo should display this printed output.
Index(['student_id', 'term', 'major_code', 'semester_hours_attempted',
'semester_hours_earned', 'semester_gpa', 'cumulative_hours_earned',
'cumulative_gpa', 'gender', 'ethnicity', 'state', 'us_citizen',
'us_resident', 'state_resident', 'pell_recipient', 'us_veteran',
'scholarship_7_recpt', 'scholarship_12_recpt', 'scholarship_13_recpt',
'scholarship_3_recpt', 'scholarship_4_recpt', 'scholarship_11_recpt',
'scholarship_8_recpt', 'scholarship_6_recpt', 'scholarship_9_recpt',
'scholarship_1_recpt', 'scholarship_2_recpt', 'scholarship_5_recpt',
'scholarship_10_recpt', 'retention_ind', 'graduation_ind'],
dtype='object')
The cell below will test your solution for rename_recpt_columns (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rename_recpt_columns),
ex_name='rename_recpt_columns',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to rename_recpt_columns did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rename_recpt_columns),
ex_name='rename_recpt_columns',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to rename_recpt_columns did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.
renamed_df = utils.load_object_from_publicdata('renamed_df.dill')
adjust_term
Your task: define adjust_term
as follows:
Anonymize student data by shifting the year component of all term values by a specified number of years.
Inputs:
df
: A DataFrame containing student records with a 'term' columnn_years
: An integer specifying how many years to shift (can be positive or negative)Return:
Requirements:
Example:
Given an example df
and n_years = 1
:
Input:
student_id | term | |
---|---|---|
0 | 1 | 201801 |
1 | 2 | 201805 |
Output:
student_id | term | |
---|---|---|
0 | 1 | 201901 |
1 | 2 | 201905 |
### Solution - Exercise 7
def adjust_term(df: pd.DataFrame, n_years: int) -> pd.DataFrame:
### BEGIN SOLUTION
new_term_df = df.copy()
new_term_df['term'] = new_term_df['term'].astype(int) + (n_years * 100)
new_term_df['term'] = new_term_df['term'].astype(str)
return new_term_df
### END SOLUTION
### Demo function call
demo_result_adjust_term = adjust_term(renamed_df, 2)
demo_result_adjust_term[["student_id", "term"]].head(10)
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_adjust_term_TRUE = utils.load_object_from_publicdata('demo_result_adjust_term_TRUE')
The demo should display this output.
student_id | term | |
---|---|---|
0 | 100052749 | 201901 |
1 | 100052749 | 201905 |
2 | 100052749 | 201908 |
3 | 100636982 | 201701 |
4 | 100636982 | 201708 |
5 | 100636982 | 201901 |
6 | 100636982 | 201905 |
7 | 100636982 | 201908 |
8 | 100636982 | 202201 |
9 | 100636982 | 202205 |
The cell below will test your solution for adjust_term (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(adjust_term),
ex_name='adjust_term',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to adjust_term did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(adjust_term),
ex_name='adjust_term',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to adjust_term did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.
new_term_df = utils.load_object_from_publicdata('new_term_df.dill')
generate_fake_ids
Your task: define generate_fake_ids
as follows:
Create anonymized student identifiers to protect student privacy in the dataset.
Inputs:
df
: A DataFrame containing student records with a student_id
columnReturn:
student_id
column has been replaced with an anon_id
column containing anonymized identifiersRequirements:
generate_md5_hash
helper function to create anonymized IDsNote: The generate_md5_hash
function is already implemented and available for you to use
We have defined a helper function, generate_md5_hash
as follows:
Outputs an MD5 hash for a given original_id input.
### Helper Function
def generate_md5_hash(original_id) -> str:
return hashlib.md5(str(original_id).encode()).hexdigest()
### Solution - Exercise 8
def generate_fake_ids(df: pd.DataFrame) -> pd.DataFrame:
### BEGIN SOLUTION
student_ids_df = df.copy()
if 'student_id' not in student_ids_df.columns:
return student_ids_df
student_ids_df['student_id'] = student_ids_df['student_id'].astype(str)
student_ids_df['student_id'] = student_ids_df['student_id'].apply(generate_md5_hash)
student_ids_df = student_ids_df.rename(columns={'student_id': 'anon_id'})
return student_ids_df
### END SOLUTION
### Demo function call
demo_result_generate_fake_ids = generate_fake_ids(new_term_df)
demo_result_generate_fake_ids[["anon_id"]].head(10)
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_generate_fake_ids_TRUE = utils.load_object_from_publicdata('demo_result_generate_fake_ids_TRUE')
The demo should display this output.
anon_id | |
---|---|
0 | 9ee43c210f34e3f962279d84d72633d6 |
1 | 9ee43c210f34e3f962279d84d72633d6 |
2 | 9ee43c210f34e3f962279d84d72633d6 |
3 | 06dff90a4976f6fa8c9abc1f698e27ee |
4 | 06dff90a4976f6fa8c9abc1f698e27ee |
5 | 06dff90a4976f6fa8c9abc1f698e27ee |
6 | 06dff90a4976f6fa8c9abc1f698e27ee |
7 | 06dff90a4976f6fa8c9abc1f698e27ee |
8 | 06dff90a4976f6fa8c9abc1f698e27ee |
9 | 06dff90a4976f6fa8c9abc1f698e27ee |
The cell below will test your solution for generate_fake_ids (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(generate_fake_ids),
ex_name='generate_fake_ids',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to generate_fake_ids did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(generate_fake_ids),
ex_name='generate_fake_ids',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to generate_fake_ids did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.
student_ids_df = utils.load_object_from_publicdata('student_ids_df.dill')
deidentify_columns
Your task: define deidentify_columns
as follows:
Standardize and deidentify values in categorical columns by replacing original values with generic identifiers.
Inputs:
df
: A DataFrame containing student recordscolumns
: A list of column names whose values should be deidentifiedReturn:
Requirements:
column_name_X
column_name
is the original column nameX
is a number starting at 1, corresponding to the position in the sorted listExample:
Given a DataFrame with a gender
column containing values F
and M
:
Before:
student_id gender age
0 1 F 21
1 2 M 19
2 3 F 22
After deidentify_columns(df, ['gender'])
:
student_id gender age
0 1 gender_1 21
1 2 gender_2 19
2 3 gender_1 22
Hint:
enumerate()
function can be used to generate numbers for the new identifiers. See the documentation for more information: Python enumerate()### Solution - Exercise 9
def deidentify_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
### BEGIN SOLUTION
deidentified_df = df.copy()
for column in columns:
unique_values = sorted(deidentified_df[column].unique())
value_map = {value: f"{column}_{i+1}" for i, value in enumerate(unique_values)}
deidentified_df[column] = deidentified_df[column].map(value_map)
return deidentified_df
### END SOLUTION
### Demo function call
target_columns = ['major_code', 'gender', 'ethnicity', 'state']
demo_result_deidentify_columns = deidentify_columns(student_ids_df, target_columns)
demo_result_deidentify_columns[target_columns].head(10)
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_deidentify_columns_TRUE = utils.load_object_from_publicdata('demo_result_deidentify_columns_TRUE')
The demo should display this output.
major_code | gender | ethnicity | state | |
---|---|---|---|---|
0 | major_code_47 | gender_1 | ethnicity_5 | state_25 |
1 | major_code_47 | gender_1 | ethnicity_5 | state_25 |
2 | major_code_47 | gender_1 | ethnicity_5 | state_25 |
3 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
4 | major_code_50 | gender_1 | ethnicity_2 | state_58 |
5 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
6 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
7 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
8 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
9 | major_code_18 | gender_1 | ethnicity_2 | state_58 |
The cell below will test your solution for deidentify_columns (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(deidentify_columns),
ex_name='deidentify_columns',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to deidentify_columns did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(deidentify_columns),
ex_name='deidentify_columns',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to deidentify_columns did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.
deidentified_df = utils.load_object_from_publicdata('deidentified_df.dill')
perturb_gpa
Your task: define perturb_gpa
as follows:
Apply data perturbation to GPA values to protect student privacy while preserving statistical properties.
Inputs:
df
: A DataFrame containing student records with 'semester_gpa' and scholarship columns ending with '_recpt'seed
: An integer seed value for the random number generator (default: 6040)Return:
Requirements:
rng
to ensure reproducibilitySteps:
Important Note on Random Number Generation:
The random number generator maintains its state between calls. This means the sequence of random numbers
depends on the order of operations. You must process the student groups in the exact order specified above
(no-scholarship students first, then single-scholarship students in column order) to ensure reproducible results.
Each call to rng.normal()
advances the generator's internal state.
Example:
If a group has a mean GPA of 3.5:
Hints:
rng
is already created for you using np.random.default_rng(seed)
rng.normal(loc=mean, scale=standard_deviation, size=number_of_values)
to generate perturbed values### Solution - Exercise 10
def perturb_gpa(df: pd.DataFrame, seed: int = 6040) -> pd.DataFrame:
rng = np.random.default_rng(seed)
### BEGIN SOLUTION
if 'semester_gpa' not in df.columns:
return df
recpt_columns = [col for col in df.columns if col.endswith('_recpt')]
if not recpt_columns:
return df
result_df = df.copy()
zeros_mask = (result_df[recpt_columns] == 0).all(axis=1)
mean_zeros = result_df.loc[zeros_mask, 'semester_gpa'].mean()
num_zeros = zeros_mask.sum()
if num_zeros > 0:
perturb_zeros = rng.normal(mean_zeros, 0.1, size=num_zeros)
result_df.loc[zeros_mask, 'semester_gpa'] = perturb_zeros
for recpt_col in recpt_columns:
other_cols = [col for col in recpt_columns if col != recpt_col]
single_mask = (result_df[recpt_col] == 1) & (result_df[other_cols] == 0).all(axis=1)
mean_gpa = result_df.loc[single_mask, 'semester_gpa'].mean()
if single_mask.sum() == 0 or np.isnan(mean_gpa):
continue
num = single_mask.sum()
perturbations = rng.normal(mean_gpa, 0.1, size=num)
result_df.loc[single_mask, 'semester_gpa'] = perturbations
return result_df
### END SOLUTION
### Demo function call
demo_result_perturb_gpa = perturb_gpa(deidentified_df, seed=SEED)
filtered_demo_result_perturb_gpa = demo_result_perturb_gpa[['semester_gpa', 'scholarship_4_recpt', 'scholarship_12_recpt']]
no_scholarships = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 0) &
(filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 0)]
both_scholarships = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 1) &
(filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 1)]
only_scholarship1 = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 1) &
(filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 0)]
only_scholarship2 = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 0) &
(filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 1)]
sample = pd.concat([
no_scholarships.head(3), both_scholarships.head(3), only_scholarship1.head(2), only_scholarship2.head(2)
]).reset_index(drop=True)
sample
Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.
### Run Me!!!
demo_result_perturb_gpa_TRUE = utils.load_object_from_publicdata('demo_result_perturb_gpa_TRUE')
The demo should display this output.
semester_gpa | scholarship_4_recpt | scholarship_12_recpt | |
---|---|---|---|
0 | 2.867951 | 0 | 0 |
1 | 2.662993 | 0 | 0 |
2 | 2.854889 | 0 | 0 |
3 | 2.890000 | 1 | 1 |
4 | 2.920000 | 1 | 1 |
5 | 2.890000 | 1 | 1 |
6 | 3.061051 | 1 | 0 |
7 | 2.480000 | 1 | 0 |
8 | 3.060911 | 0 | 1 |
9 | 3.201331 | 0 | 1 |
The cell below will test your solution for perturb_gpa (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 10
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(perturb_gpa),
ex_name='perturb_gpa',
key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to perturb_gpa did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(perturb_gpa),
ex_name='perturb_gpa',
key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to perturb_gpa did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Fin