`Midterm 2, Spring 2025`: `Data Deidentification`¶

Version 0.0.2

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of SQL, basic Python, pandas, and NumPy. It has 11 exercises numbered 0 to 10. There are 22 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 1 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 3 point(s)
Exercise 5 - : 3 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 1 point(s)
Exercise 8 - : 2 point(s)
Exercise 9 - : 2 point(s)
Exercise 10 - : 3 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

The Problem¶

This exam mimics the process of exploring and extracting data from a university database for third-party research. The law in the United States requires your university to protect your personal information when working with outside organizations.

Your Overall Task¶

For this exam, you are given a mock university database, and you must "de-identify" it for sharing with outside researchers. Deidentification refers to hiding or removing details that could be used to associate data with specific people. Your overall task is to do the following:

Explore the Database Schema
Use SQL queries to explore the mock university database and understand its structure, including table names, columns, data types, and relationships between tables.
Pull the Required Data
Write SQL queries to retrieve the required data for analysis, ensuring the minimum necessary data is selected for the task.
Deidentify the Data
Use Python, Pandas, and NumPy to deidentify the data. This includes:
- Masking: Obfuscating identifiers like student ID, ethnicity, location, and major.
- Perturbing: Slightly modifying data values (e.g., GPA) to prevent reidentification while maintaining overall statistical properties.

By the end of this exam, you will have a deidentified dataset ready for third-party research.

Run me!¶

Run the two cells below to load the global imports, database connection, and objects needed for the exam. In particular, take note of conn, a SQLite3 connection object. The exam will not run without these cells.

### Global imports
import dill
from cse6040_devkit import plugins, utils
import re 
import sqlite3 
import hashlib 
import pandas as pd
import numpy as np
from pprint import pprint

# Load the database and objects
conn = sqlite3.connect('resource/asnlib/publicdata/university.db')

SEED = 6040

The Tables¶

The database has eight (8) tables whose overall organization and columns are depicted below. You do not need to understand all of these details right now, but you may find it helpful to refer back to them as you progress through the exam. Consider taking a screenshot, opening the picture in a new browser tab, using the Table of Contents feature to get back to this information later, or referring to the following PDF copy of the image.

Schema Diagram¶

title

student_main
Contains primary student information, including:
- student_id: Unique identifier for each student (primary key).
- last_name, first_name, middle_initial: Basic personal details.
- email: Email address, which may be null for some records.
- gender, ethnicity: Demographic details.
- address: Residential address.
- us_citizen, us_resident, state_resident, pell_recipient, us_veteran: Indicators for citizenship, residency, financial aid, and veteran status.
student_enrollment
Tracks student enrollment details, including:
- id: Primary key for each enrollment record.
- student_id: References student_main.student_id.
- term: Academic term (e.g., "202308" for Fall 2023).
- major_code: Code for the student's major.
- semester_hours_attempted, semester_hours_earned: Academic credit details.
- semester_gpa, cumulative_hours_earned, cumulative_gpa: Academic performance metrics.
graduation
Stores graduation records, including:
- id: Primary key for each graduation record.
- student_id: References student_main.student_id.
- last_enroll_term, grad_term: Last enrollment and graduation terms.
- grad_level: Graduation level (e.g., Bachelor, Master, Doctorate).
- grad_status: Graduation status (e.g., active, completed).
student_key
Links students to financial and employee records:
- student_id: Primary key, references student_main.student_id.
- finance_id: Financial record identifier.
- employee_id: Employee identifier (nullable).
student_scholarship
Records student scholarships and financial aid details, including:
- id: Primary key for each scholarship record.
- finance_id: References student_key.finance_id.
- scholarship_term: Term during which the scholarship was awarded.
- scholarship_code: Code identifying the scholarship, references scholarship_rules.scholarship_code.
- scholarship_total, scholarship_payment, scholarship_refund: Financial details of the scholarship.
scholarship_rules
Defines eligibility rules for scholarships, including:
- id: Primary key for each scholarship rule.
- scholarship_crosswalk_id: References scholarship_crosswalk.id.
- scholarship_code: Code identifying the scholarship.
- major_code: Eligibility criteria based on major, references major_crosswalk.major_code.
- activation_date: Date the rule became effective.
- scholarship_active: Indicates if the scholarship is currently active.
- min_gpa: Minimum GPA required to qualify.
- gender, pell_recipient, us_veteran, us_citizen, us_resident, state_resident: Eligibility criteria for specific demographics.
- amount: Scholarship amount awarded.
scholarship_crosswalk
Maps scholarships to descriptive details:
- id: Primary key for each scholarship description.
- scholarship_code: Code identifying the scholarship.
- scholarship_description: Description of the scholarship.
- activation_date: Date the description became effective.
major_crosswalk
Maps major codes to detailed descriptions:
- id: Primary key for each major description.
- major_code: Code for the major (e.g., "CS" for Computer Science).
- major_description: Descriptive name of the major.
- activation_date: Date the description became effective.

Review the helper functions below before tackling the exercises.

Helper Functions¶

We have provided the following functions to assist you in exploring the university database. Run each cell to load the functions into your environment and view a demonstration of their usage. You are encouraged to use these functions to explore the database, understand its structure, and develop solutions to the exercises.

def get_table_list(conn):
    """
    Retrieves the list of tables from the SQLite database.

    Args:
        conn: SQLite database connection object
        
    Returns:
        list: A list of table names in the database
    """
    # Query to get all tables from sqlite_master
    tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
    tables_df = pd.read_sql_query(tables_query, conn)

    # Return the list of table names
    return tables_df['name'].tolist()

get_table_list_demo = get_table_list(conn)
for table in get_table_list_demo:
    print(table)

def get_column_details(conn, table_name):
    """
    Retrieves column details for a specific table.

    Args:
        conn: SQLite database connection object
        table_name (str): Name of the table to get column details for
        
    Returns:
        DataFrame: A DataFrame containing column details with columns:
            - column_id: Column ID (0-based index)
            - column_name: Column name
            - data_type: Data type of the column
            - not_null: Whether column allows NULL values (1=NOT NULL, 0=NULL allowed)
            - primary_key: Whether column is a primary key (1=PRIMARY KEY, 0=not a primary key)
            - table_name: Name of the table
            
    Raises:
        AssertionError: If the provided table_name does not exist in the database
    """
    # Get the list of tables in the database and verify table_name exists
    tables = get_table_list(conn)
    assert table_name in tables, f"Table '{table_name}' does not exist in the database"

    # Use PRAGMA table_info to get column details
    pragma_query = f"PRAGMA table_info('{table_name}')"
    column_details = pd.read_sql_query(pragma_query, conn)

    # Add table name to the DataFrame
    column_details['table_name'] = table_name

    # Remove unnecessary columns
    column_details = column_details[['cid', 'name', 'type', 'notnull', 'pk', 'table_name']]

    # Rename columns for clarity
    column_details = column_details.rename(columns={
        'cid': 'column_id',
        'name': 'column_name',
        'type': 'data_type',
        'notnull': 'not_null',
        'pk': 'primary_key'
    })

    return column_details

get_column_details_demo_df = get_column_details(conn, 'student_main')
display(get_column_details_demo_df)

def get_foreign_key_details(conn, table_name):
    """
    Retrieves foreign key details for a specific table.

    Args:
        conn: SQLite database connection object
        table_name (str): Name of the table to get foreign key details for
        
    Returns:
        DataFrame: A DataFrame containing foreign key details with columns:
            - column_name: Column name in the current table
            - references_table: Referenced table name
            - references_column: Referenced column name
            - table_name: Name of the table
            
    Raises:
        AssertionError: If the provided table_name does not exist in the database
    """
    # Get the list of tables in the database and verify table_name exists
    tables = get_table_list(conn)
    assert table_name in tables, f"Table '{table_name}' does not exist in the database"

    # Use PRAGMA foreign_key_list to get foreign key details
    pragma_query = f"PRAGMA foreign_key_list('{table_name}')"
    fk_details = pd.read_sql_query(pragma_query, conn)

    # If the DataFrame is empty, return an empty DataFrame with the required columns
    if fk_details.empty:
        empty_df = pd.DataFrame(columns=['column_name', 'references_table', 'references_column', 'table_name'])
        empty_df['table_name'] = table_name
        return empty_df

    # Add table name to the DataFrame
    fk_details['table_name'] = table_name

    # Keep only relevant columns and rename them for clarity
    fk_details = fk_details[['table_name', 'from', 'table', 'to']]
    fk_details.rename(columns={
        'from': 'column_name',
        'table': 'references_table', 
        'to': 'references_column'
    }, inplace=True)

    return fk_details

get_foreign_key_details_demo_df = get_foreign_key_details(conn, 'student_key')
display(get_foreign_key_details_demo_df)

def get_table_schema(conn, table_name):
    """
    Combines column details and foreign key details for a specific table.

    Args:
        conn: SQLite database connection object
        table_name (str): Name of the table to get the schema for
        
    Returns:
        DataFrame: A DataFrame containing complete schema information for the table
        
    Raises:
        AssertionError: If the provided table_name does not exist in the database
    """
    # Get the list of tables in the database and verify table_name exists
    tables = get_table_list(conn)
    assert table_name in tables, f"Table '{table_name}' does not exist in the database"
    
    # Get column details
    column_details = get_column_details(conn, table_name)

    # Get foreign key details
    fk_details = get_foreign_key_details(conn, table_name)
    
    # Drop the table_name column from foreign key details before merging
    if not fk_details.empty:
        fk_details.drop(columns=['table_name'], inplace=True)

    # Merge only the foreign key columns we need, excluding table_name
    table_schema = column_details.merge(
        fk_details[['column_name', 'references_table', 'references_column']], 
        how='left',
        on='column_name'
    )
    
    return table_schema

get_table_schema_demo_df = get_table_schema(conn, 'student_main')
display(get_table_schema_demo_df)

In the SQL section of the exam, you will use the student_scholarship table several times. Run the FREE exercise cell below to examine the table, understand the structure, and get your free point!

Exercise 0: (1 points)¶

get_student_scholarship_schema__FREE

Example: we have defined get_student_scholarship_schema__FREE as follows:

This is a free exercise!

**Please run the test cell below to collect your FREE point**

### Solution - Exercise 0  
def get_student_scholarship_schema__FREE(conn):
    # Get schema information specifically for the student_scholarship table
    student_scholarship_schema = get_table_schema(conn, 'student_scholarship')
    
    # Return the schema information
    return student_scholarship_schema

### Demo function call
student_scholarship_schema = get_student_scholarship_schema__FREE(conn)
student_scholarship_schema

### Run Me!!!
demo_result_get_student_scholarship_schema__FREE_TRUE = utils.load_object_from_publicdata('demo_result_get_student_scholarship_schema__FREE_TRUE')

The demo should display this output.

	column_id	column_name	data_type	not_null	primary_key	table_name	references_table	references_column
0	0	id	INTEGER	1	1	student_scholarship	NaN	NaN
1	1	finance_id	VARCHAR(12)	0	0	student_scholarship	student_key	finance_id
2	2	scholarship_term	VARCHAR(6)	0	0	student_scholarship	NaN	NaN
3	3	scholarship_code	VARCHAR	0	0	student_scholarship	scholarship_rules	scholarship_code
4	4	scholarship_total	INTEGER	0	0	student_scholarship	NaN	NaN
5	5	scholarship_payment	INTEGER	0	0	student_scholarship	NaN	NaN
6	6	scholarship_refund	INTEGER	0	0	student_scholarship	NaN	NaN

The test cell below will always pass. Please submit to collect your free points for get_student_scholarship_schema__FREE (exercise 0).

### Test Cell - Exercise 0  


print('Passed! Please submit.')

Part 1: Exploring and Compiling Data with SQL¶

In this section, you will use SQL queries to explore the mock university database and compile the required data for analysis. The connection to the database has been made for you, as part of the notebook starter code above, and is available in the environment as conn.

Exercise 1: (2 points)¶

binary_indicators

Your task: define binary_indicators as follows:

To begin, it would be helpful to get a sense of the data in tables. Start by writing a SQL query to calculate the percentage of students with positive values for binary indicators.

Inputs:

None

Return:

A Python string containing a SQLite query that produces a table with 2 columns:
- status: The name of the binary indicator (column name)
- percentage: The percentage of students with a positive value for that indicator (between 0 and 100)
- The status column should be first and percentage column second; the order of the rows is not important.

Requirements:

Query the student_main table
Calculate percentages for these binary indicators and use the following text values in the status column:
- For us_citizen column → Use 'US Citizen' as the status value
- For us_resident column → Use 'US Resident' as the status value
- For state_resident column → Use 'State Resident' as the status value
- For pell_recipient column → Use 'Pell Recipient' as the status value
- For us_veteran column → Use 'US Veteran' as the status value
A positive value is indicated by 'Y' in the corresponding column

Hints:

Use UNION ALL to combine individual queries for each indicator. See the following SQLite Union Documentation for an example.
Use SUM with CASE statements to calculate percentages of 'Y' values

### Solution - Exercise 1  
def binary_indicators() -> str:
    ### BEGIN SOLUTION    
    query = """
      SELECT
          'US Citizen'                                                             AS status
        , SUM(CASE WHEN us_citizen = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)     AS percentage
      FROM student_main
      GROUP BY 1
    UNION ALL
      SELECT
          'US Resident'                                                            AS status
        , SUM(CASE WHEN us_resident = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)    AS percentage
      FROM student_main
      GROUP BY 1
    UNION ALL
      SELECT
          'State Resident'                                                         AS status
        , SUM(CASE WHEN state_resident = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
      FROM student_main
      GROUP BY 1
    UNION ALL
      SELECT
          'Pell Recipient'                                                         AS status
        , SUM(CASE WHEN pell_recipient = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
      FROM student_main
      GROUP BY 1
    UNION ALL
      SELECT
          'US Veteran'                                                         AS status
        , SUM(CASE WHEN us_veteran = 'Y' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS percentage
      FROM student_main
      GROUP BY 1
    """

    return query
    ### END SOLUTION

### Demo function call
demo_query_binary_indicators = binary_indicators()
demo_result_binary_indicators = pd.read_sql(demo_query_binary_indicators, conn)
demo_result_binary_indicators

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_binary_indicators_TRUE = utils.load_object_from_publicdata('demo_result_binary_indicators_TRUE')

The demo should display this output.

	status	percentage
0	US Citizen	88.6
1	US Resident	82.4
2	State Resident	81.2
3	Pell Recipient	25.6
4	US Veteran	11.2

The cell below will test your solution for binary_indicators (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(binary_indicators),
              ex_name='binary_indicators',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to binary_indicators did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(binary_indicators),
              ex_name='binary_indicators',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to binary_indicators did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 2: (1 points)¶

scholarship_payments

Your task: define scholarship_payments as follows:

Let's continue exploring the scholarship data. Write a SQL query to calculate the total amount paid out by each scholarship.

Inputs:

None

Return:

A Python string containing a SQLite query that produces a table with 2 columns:
- scholarship_code: The 3-letter scholarship identifier
- total_payment: The sum of all scholarship_payment amounts made by that scholarship
- The scholarship_code column should be first and total_payment column second; the order of the rows is not important.

Requirements:

Query the student_scholarship table

### Solution - Exercise 2  
def scholarship_payments() -> str:
    ### BEGIN SOLUTION
    query = '''
    SELECT
        ss.scholarship_code
      , SUM(ss.scholarship_payment) AS total_payment
    FROM student_scholarship ss
    GROUP BY ss.scholarship_code
   '''
    
    return query
    ### END SOLUTION

### Demo function call
demo_query_scholarship_payments = scholarship_payments()
demo_result_scholarship_payments = pd.read_sql(demo_query_scholarship_payments, conn)
demo_result_scholarship_payments

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_scholarship_payments_TRUE = utils.load_object_from_publicdata('demo_result_scholarship_payments_TRUE')

The demo should display this output.

	scholarship_code	total_payment
0	ALE	13738
1	BCI	51788
2	FEG	26333
3	FGA	128097
4	FRS	12500
5	GAM	71019
6	GVS	532829
7	KCJ	15708
8	PRE	1005649
9	USE	536667
10	WIM	6000

The cell below will test your solution for scholarship_payments (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(scholarship_payments),
              ex_name='scholarship_payments',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to scholarship_payments did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(scholarship_payments),
              ex_name='scholarship_payments',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to scholarship_payments did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 3: (2 points)¶

avg_gpa_by_scholarship

Your task: define avg_gpa_by_scholarship as follows:

Write a SQL query to calculate the average semester GPA for students grouped by scholarship code.

Inputs:

None

Return:

A Python string containing a SQLite query that produces a table with 2 columns:
- scholarship_code: The 3-letter scholarship identifier or 'NONE' for students without scholarships
- avg_gpa: The average semester GPA for students with that scholarship, rounded to 2 decimal places
- The results must be ordered by avg_gpa first and scholarship_code second.

Requirements:

Calculate the average of the semester_gpa with the alias avg_gpa
Use the following tables:
- student_enrollment
- student_key
- student_scholarship
Only include students that exist in both student_enrollment and student_key
Use the correct JOIN type when joining to student_scholarship so that students without scholarships are included
For students without a scholarship, use 'NONE' as their scholarship_code value
Sort results in descending order by avg_gpa
Use scholarship_code as a tiebreaker (secondary sort) in ascending order for consistent results
Round the average GPA to 2 decimal places

Hint:

The COALESCE or IFNULL functions can be used to solve this exercise.

### Solution - Exercise 3  
def avg_gpa_by_scholarship() -> str:
    ### BEGIN SOLUTION    
    query = '''
    SELECT 
        COALESCE(ss.scholarship_code, 'NONE') AS scholarship_code
      , ROUND(AVG(se.semester_gpa), 2) AS avg_gpa
    FROM student_enrollment se
    JOIN student_key sk 
      ON se.student_id = sk.student_id
    LEFT JOIN student_scholarship ss 
      ON sk.finance_id = ss.finance_id
    GROUP BY ss.scholarship_code
    ORDER BY avg_gpa DESC, scholarship_code ASC
    '''

    return query
    ### END SOLUTION

### Demo function call
demo_query_avg_gpa_by_scholarship = avg_gpa_by_scholarship()
demo_result_avg_gpa_by_scholarship = pd.read_sql(demo_query_avg_gpa_by_scholarship, conn)
demo_result_avg_gpa_by_scholarship

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_avg_gpa_by_scholarship_TRUE = utils.load_object_from_publicdata('demo_result_avg_gpa_by_scholarship_TRUE')

The demo should display this output.

	scholarship_code	avg_gpa
0	WIM	2.92
1	ALE	2.86
2	FEG	2.84
3	FRS	2.84
4	FGA	2.81
5	USE	2.81
6	KCJ	2.78
7	GVS	2.76
8	BCI	2.75
9	PRE	2.75
10	NONE	2.72
11	GAM	2.69

The cell below will test your solution for avg_gpa_by_scholarship (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(avg_gpa_by_scholarship),
              ex_name='avg_gpa_by_scholarship',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to avg_gpa_by_scholarship did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(avg_gpa_by_scholarship),
              ex_name='avg_gpa_by_scholarship',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to avg_gpa_by_scholarship did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 4: (3 points)¶

active_scholarships

Your task: define active_scholarships as follows:

Write a SQL query to generate a comprehensive report of all currently active scholarships and their eligibility rules.

Inputs:

None

Return:

A Python string containing a SQLite query that produces a table with the following columns:
- scholarship_description: Description of the scholarship
- major_description: Description of the eligible major
- min_gpa: Minimum GPA requirement
- gender: Gender requirement (if applicable)
- pell_recipient: Pell recipient status requirement
- us_veteran: US veteran status requirement
- us_citizen: US citizenship requirement
- us_resident: US residency requirement
- state_resident: State residency requirement
- amount: Scholarship award amount
- The columns must be ordered as listed above; the order of rows is not important.

Requirements:

Query the following tables:
- scholarship_rules
- scholarship_crosswalk
- major_crosswalk
Include only scholarships where the scholarship is active, i.e. scholarship_active = 'Y'
Use the current scholarship_description that matches the activation date in scholarship_rules
For major descriptions:
- Use the most recent major_description based on activation_date for each major_code
- If a scholarship is not limited to a specific major, major_description should be NULL

Hints:

You will need a subquery or CTE to find the most recent activation date for each major
Use appropriate joins to ensure scholarships without major requirements are still included (i.e. we want to include entries where scholarship_rules.major_code could be null)

### Solution - Exercise 4  
def active_scholarships() -> str:
    ### BEGIN SOLUTION    
    query = '''
    SELECT 
        sc.scholarship_description 
      , mc.major_description
      , sr.min_gpa
      , sr.gender
      , sr.pell_recipient
      , sr.us_veteran
      , sr.us_citizen
      , sr.us_resident
      , sr.state_resident
      , sr.amount
    FROM scholarship_rules sr
    JOIN scholarship_crosswalk sc 
      ON sr.scholarship_code = sc.scholarship_code 
      AND sr.activation_date = sc.activation_date
    LEFT JOIN (
        SELECT 
            major_code,
            major_description
        FROM major_crosswalk mc
        WHERE activation_date = (
                SELECT MAX(activation_date)
                FROM major_crosswalk mc2
                WHERE mc2.major_code = mc.major_code
            )
    ) mc ON sr.major_code = mc.major_code
    WHERE sr.scholarship_active = 'Y'
    '''
    return query
    ### END SOLUTION

### Demo function call
demo_query_active_scholarships = active_scholarships()
demo_result_active_scholarships = pd.read_sql(demo_query_active_scholarships, conn)
demo_result_active_scholarships

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_active_scholarships_TRUE = utils.load_object_from_publicdata('demo_result_active_scholarships_TRUE')

The demo should display this output.

	scholarship_description	major_description	min_gpa	gender	pell_recipient	us_veteran	us_citizen	us_resident	state_resident	amount
0	Georgia Veterans Fund	None	3.00	None	None	Y	Y	Y	Y	2500
1	University Student Excellence Award	None	3.70	None	None	None	None	None	Y	2000
2	Women in Mathematics Award	Mathematics	3.50	F	None	None	None	None	None	3000
3	Future Educators of Georgia Scholarship	Education	3.50	None	None	None	None	None	Y	2500
4	First Generation Achievement Scholarship	None	3.50	None	Y	None	None	None	Y	1500
5	Pell Recipients Excellence Fund	None	3.00	None	Y	None	Y	Y	None	1500
6	Chris Kinkade Criminal Justice Fund	Criminal Justice	3.50	None	None	None	Y	Y	Y	2500
7	Georgia Aquarium Marine Biology Fund	Marine Biology	3.00	None	None	None	None	None	Y	2500
8	My Brother's Keeper Fund	None	3.50	None	Y	None	N	N	N	1000
9	Audre Lorde English Studies Award	English	3.75	None	None	None	None	Y	None	2500
10	Bell Center for International Studies Scholarship	International Relations	3.50	None	None	None	Y	Y	None	2500
11	Feingold Institute for Religious Studies Award	Religious Studies	3.75	None	None	None	Y	Y	None	2500
12	Professor Richard Vuduc Analytics Dreamers Scholarship	Analytics	3.75	None	None	None	Y	Y	Y	3000

The cell below will test your solution for active_scholarships (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(active_scholarships),
              ex_name='active_scholarships',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to active_scholarships did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(active_scholarships),
              ex_name='active_scholarships',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to active_scholarships did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 5: (3 points)¶

grad_retention_inds

Your task: define grad_retention_inds as follows:

Write a SQL query to calculate retention and graduation indicators for each student.

Inputs:

None

Return:

A Python string containing a SQLite query that produces a table with 3 columns:
- student_id: The unique identifier for each student
- retention_ind: Binary indicator (1 or 0) showing whether a student was retained
- graduation_ind: Binary indicator (1 or 0) showing whether a student has graduated
- The columns must be ordered as listed above; the order of rows is not important.

Requirements:

Query the following tables:
- student_enrollment
- graduation
A student is considered retained if they are enrolled EXACTLY one year after their first term
A student has graduated if they have a grad_level of 'B' and a grad_status of 'A' in the graduation table
Term format: The first 4 digits represent the year, and the last 2 digits represent the starting month

Hints:

You'll need to identify each student's first enrollment term
To calculate retention, you'll need to add 100 to the first term (representing one year later)
Use CAST() to convert string values to integers when performing arithmetic operations
- Example: CAST('202001' AS INTEGER) would become an integer value of 202001
- Documentation: SQLite CAST function
Use a subquery or CTE to determine each student's first term
Binary indicators can be created by using MAX() with CASE statements

### Solution - Exercise 5  
def grad_retention_inds() -> str:
    ### BEGIN SOLUTION    
    query = '''
    SELECT 
        ft.student_id
      , MAX(CASE WHEN se.term = CAST(ft.first_term AS INTEGER) + 100 THEN 1 ELSE 0 END) AS retention_ind
      , MAX(CASE WHEN g.grad_level = 'B' AND g.grad_status = 'A' THEN 1 ELSE 0 END)     AS graduation_ind
    FROM 
        (
            SELECT 
                student_id 
              , MIN(term) AS first_term
            FROM student_enrollment
            GROUP BY student_id
        ) AS ft
    LEFT JOIN student_enrollment se 
      ON ft.student_id = se.student_id
    LEFT JOIN graduation g 
      ON ft.student_id = g.student_id
    GROUP BY ft.student_id
    '''
    return query
    ### END SOLUTION

### Demo function call
demo_query_grad_retention_inds = grad_retention_inds()
demo_result_grad_retention_inds = pd.read_sql(demo_query_grad_retention_inds, conn)
demo_result_grad_retention_inds.head(10)

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_grad_retention_inds_TRUE = utils.load_object_from_publicdata('demo_result_grad_retention_inds_TRUE')

The demo should display this output.

	student_id	retention_ind	graduation_ind
0	100052749	0	0
1	100636982	0	0
2	102689054	0	0
3	103503531	0	0
4	103972552	0	0
5	103975474	1	0
6	107116054	0	1
7	107372067	0	0
8	108283090	1	0
9	113524474	0	0

The cell below will test your solution for grad_retention_inds (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 5  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(grad_retention_inds),
              ex_name='grad_retention_inds',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to grad_retention_inds did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(grad_retention_inds),
              ex_name='grad_retention_inds',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to grad_retention_inds did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Part 2: Data Deidentification with Pandas and Numpy¶

In this section, you will shift your focus from exploring and compiling data to deidentifying the student records. The consolidated dataset has been built and loaded for you, as part of the notebook starter code above. It is available in the environment as df_consolidated_data.

Run me!¶

Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.

df_consolidated_data = utils.load_object_from_publicdata('df_consolidated_data.dill')

Exercise 6: (2 points)¶

rename_recpt_columns

Your task: define rename_recpt_columns as follows:

Standardize the naming convention for scholarship recipient columns in a DataFrame.

Inputs:

df: A DataFrame containing student records with columns that include the substring 'recpt'

Return:

A new DataFrame with standardized column names for scholarship recipient fields

Requirements:

Identify all columns in the DataFrame that contain the substring 'recpt'
Sort these columns alphabetically
Rename each column using the pattern 'scholarship_X_recpt', where X is a number starting at 1
- The first column in the alphabetically sorted list gets named 'scholarship_1_recpt'
- The second column gets named 'scholarship_2_recpt', and so on
Preserve the original order of all columns in the DataFrame
Do not modify the input DataFrame
Do not use SQLite

Example:

Given a DataFrame with columns containing 'recpt' in their names:

['student_id', ...,  'b_recpt', 'a_recpt', ...]

The output DataFrame should have the following columns:

['student_id', ..., 'scholarship_2_recpt', 'scholarship_1_recpt', ...]

Note: In this example, 'a_recpt' becomes 'scholarship_1_recpt', 'b_recpt' becomes 'scholarship_2_recpt', based on alphabetical sorting, but the original order is preserved.

Hint:

The enumerate() function can be used to generate numbers for the new column names. See the documentation for details.

Note:

The demo output below only returns the column names of the renamed DataFrame and not the entire DataFrame.

### Solution - Exercise 6  
def rename_recpt_columns(df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    recpt_columns = [col for col in df.columns if 'recpt' in col]
    sorted_recpt_columns = sorted(recpt_columns)
    new_column_names = {old_name: f'scholarship_{i+1}_recpt' for i, old_name in enumerate(sorted_recpt_columns)}
    renamed_df = df.rename(columns=new_column_names)
    
    return renamed_df
    ### END SOLUTION

### Demo function call
renamed_df = rename_recpt_columns(df_consolidated_data)
print(renamed_df.columns)

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_rename_recpt_columns_TRUE = utils.load_object_from_publicdata('demo_result_rename_recpt_columns_TRUE')

The demo should display this printed output.

Index(['student_id', 'term', 'major_code', 'semester_hours_attempted',
       'semester_hours_earned', 'semester_gpa', 'cumulative_hours_earned',
       'cumulative_gpa', 'gender', 'ethnicity', 'state', 'us_citizen',
       'us_resident', 'state_resident', 'pell_recipient', 'us_veteran',
       'scholarship_7_recpt', 'scholarship_12_recpt', 'scholarship_13_recpt',
       'scholarship_3_recpt', 'scholarship_4_recpt', 'scholarship_11_recpt',
       'scholarship_8_recpt', 'scholarship_6_recpt', 'scholarship_9_recpt',
       'scholarship_1_recpt', 'scholarship_2_recpt', 'scholarship_5_recpt',
       'scholarship_10_recpt', 'retention_ind', 'graduation_ind'],
      dtype='object')

The cell below will test your solution for rename_recpt_columns (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rename_recpt_columns),
              ex_name='rename_recpt_columns',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to rename_recpt_columns did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rename_recpt_columns),
              ex_name='rename_recpt_columns',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to rename_recpt_columns did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Run me!¶

Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.

renamed_df = utils.load_object_from_publicdata('renamed_df.dill')

Exercise 7: (1 points)¶

adjust_term

Your task: define adjust_term as follows:

Anonymize student data by shifting the year component of all term values by a specified number of years.

Inputs:

df: A DataFrame containing student records with a 'term' column
n_years: An integer specifying how many years to shift (can be positive or negative)

Return:

A new DataFrame with the year component of all term values adjusted

Requirements:

The 'term' column format is 'YYYYMM' where:
- The first 4 digits (YYYY) represent the year
- The last 2 digits (MM) represent the starting month
Adjust only the year component to each term value
The data frame you return must have the same data types as the input data frame
Do not modify the original input DataFrame
Do not use SQLite

Example:

Given an example df and n_years = 1:

Input:

	student_id	term
0	1	201801
1	2	201805

Output:

	student_id	term
0	1	201901
1	2	201905

### Solution - Exercise 7  
def adjust_term(df: pd.DataFrame, n_years: int) -> pd.DataFrame:
    ### BEGIN SOLUTION
    new_term_df = df.copy()
    new_term_df['term'] = new_term_df['term'].astype(int) + (n_years * 100)
    new_term_df['term'] = new_term_df['term'].astype(str)

    return new_term_df
    ### END SOLUTION

### Demo function call
demo_result_adjust_term = adjust_term(renamed_df, 2)
demo_result_adjust_term[["student_id", "term"]].head(10)

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_adjust_term_TRUE = utils.load_object_from_publicdata('demo_result_adjust_term_TRUE')

The demo should display this output.

	student_id	term
0	100052749	201901
1	100052749	201905
2	100052749	201908
3	100636982	201701
4	100636982	201708
5	100636982	201901
6	100636982	201905
7	100636982	201908
8	100636982	202201
9	100636982	202205

The cell below will test your solution for adjust_term (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(adjust_term),
              ex_name='adjust_term',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to adjust_term did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(adjust_term),
              ex_name='adjust_term',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to adjust_term did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Run me!¶

Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.

new_term_df = utils.load_object_from_publicdata('new_term_df.dill')

Exercise 8: (2 points)¶

generate_fake_ids

Your task: define generate_fake_ids as follows:

Create anonymized student identifiers to protect student privacy in the dataset.

Inputs:

df: A DataFrame containing student records with a student_id column

Return:

A new DataFrame where the student_id column has been replaced with an anon_id column containing anonymized identifiers

Requirements:

Use the provided generate_md5_hash helper function to create anonymized IDs
- This function takes a student_id as input and returns a hashed identifier
Replace the 'student_id' column with a new 'anon_id' column containing the hashed identifiers
Maintain the same column ordering as the input DataFrame (with 'anon_id' in the same position as 'student_id' was)
If the 'student_id' column does not exist in the input, return the DataFrame unchanged
Do not modify the original input DataFrame
Do not use SQLite

Note: The generate_md5_hash function is already implemented and available for you to use

We have defined a helper function, generate_md5_hash as follows:

Outputs an MD5 hash for a given original_id input.

### Helper Function
def generate_md5_hash(original_id) -> str:
    return hashlib.md5(str(original_id).encode()).hexdigest()

### Solution - Exercise 8  
def generate_fake_ids(df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    student_ids_df = df.copy()
    if 'student_id' not in student_ids_df.columns:
        return student_ids_df
    student_ids_df['student_id'] = student_ids_df['student_id'].astype(str)

    student_ids_df['student_id'] = student_ids_df['student_id'].apply(generate_md5_hash)
    student_ids_df = student_ids_df.rename(columns={'student_id': 'anon_id'})
    
    return student_ids_df
    ### END SOLUTION

### Demo function call
demo_result_generate_fake_ids = generate_fake_ids(new_term_df)
demo_result_generate_fake_ids[["anon_id"]].head(10)

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_generate_fake_ids_TRUE = utils.load_object_from_publicdata('demo_result_generate_fake_ids_TRUE')

The demo should display this output.

	anon_id
0	9ee43c210f34e3f962279d84d72633d6
1	9ee43c210f34e3f962279d84d72633d6
2	9ee43c210f34e3f962279d84d72633d6
3	06dff90a4976f6fa8c9abc1f698e27ee
4	06dff90a4976f6fa8c9abc1f698e27ee
5	06dff90a4976f6fa8c9abc1f698e27ee
6	06dff90a4976f6fa8c9abc1f698e27ee
7	06dff90a4976f6fa8c9abc1f698e27ee
8	06dff90a4976f6fa8c9abc1f698e27ee
9	06dff90a4976f6fa8c9abc1f698e27ee

The cell below will test your solution for generate_fake_ids (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(generate_fake_ids),
              ex_name='generate_fake_ids',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to generate_fake_ids did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(generate_fake_ids),
              ex_name='generate_fake_ids',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to generate_fake_ids did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Run me!¶

Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.

student_ids_df = utils.load_object_from_publicdata('student_ids_df.dill')

Exercise 9: (2 points)¶

deidentify_columns

Your task: define deidentify_columns as follows:

Standardize and deidentify values in categorical columns by replacing original values with generic identifiers.

Inputs:

df: A DataFrame containing student records
columns: A list of column names whose values should be deidentified

Return:

A new DataFrame with standardized values in the specified columns

Requirements:

For each specified column:
1. Identify all unique values in the column
2. Sort these unique values alphabetically
3. Replace each value with a new identifier following the pattern column_name_X
  - Where column_name is the original column name
  - And X is a number starting at 1, corresponding to the position in the sorted list
Preserve the original column order of the DataFrame
Do not modify the original input DataFrame
Do not use SQLite

Example:

Given a DataFrame with a gender column containing values F and M:

Before:

   student_id  gender  age
0          1       F   21
1          2       M   19
2          3       F   22

After deidentify_columns(df, ['gender']):

   student_id     gender  age
0          1  gender_1   21
1          2  gender_2   19
2          3  gender_1   22

Hint:

The pandas Series.map() function is useful for replacing values based on a dictionary. See the documentation for more information: pandas.Series.map
The enumerate() function can be used to generate numbers for the new identifiers. See the documentation for more information: Python enumerate()

### Solution - Exercise 9  
def deidentify_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    ### BEGIN SOLUTION
       
    deidentified_df = df.copy()
    for column in columns:
        unique_values = sorted(deidentified_df[column].unique())
        value_map = {value: f"{column}_{i+1}" for i, value in enumerate(unique_values)}
        deidentified_df[column] = deidentified_df[column].map(value_map)
    
    return deidentified_df
    ### END SOLUTION

### Demo function call
target_columns = ['major_code', 'gender', 'ethnicity', 'state']
demo_result_deidentify_columns = deidentify_columns(student_ids_df, target_columns)
demo_result_deidentify_columns[target_columns].head(10)

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_deidentify_columns_TRUE = utils.load_object_from_publicdata('demo_result_deidentify_columns_TRUE')

The demo should display this output.

	major_code	gender	ethnicity	state
0	major_code_47	gender_1	ethnicity_5	state_25
1	major_code_47	gender_1	ethnicity_5	state_25
2	major_code_47	gender_1	ethnicity_5	state_25
3	major_code_18	gender_1	ethnicity_2	state_58
4	major_code_50	gender_1	ethnicity_2	state_58
5	major_code_18	gender_1	ethnicity_2	state_58
6	major_code_18	gender_1	ethnicity_2	state_58
7	major_code_18	gender_1	ethnicity_2	state_58
8	major_code_18	gender_1	ethnicity_2	state_58
9	major_code_18	gender_1	ethnicity_2	state_58

The cell below will test your solution for deidentify_columns (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 9  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(deidentify_columns),
              ex_name='deidentify_columns',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to deidentify_columns did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(deidentify_columns),
              ex_name='deidentify_columns',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to deidentify_columns did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Run me!¶

Whether you solved the previous exercise or not, run the following code cell. It will load the proper results into memory for the next exercise.

deidentified_df = utils.load_object_from_publicdata('deidentified_df.dill')

Exercise 10: (3 points)¶

perturb_gpa

Your task: define perturb_gpa as follows:

Apply data perturbation to GPA values to protect student privacy while preserving statistical properties.

Inputs:

df: A DataFrame containing student records with 'semester_gpa' and scholarship columns ending with '_recpt'
seed: An integer seed value for the random number generator (default: 6040)

Return:

A new DataFrame with the same structure as the input but with perturbed 'semester_gpa' values

Requirements:

Perturb 'semester_gpa' values based on scholarship recipient status
Use the provided random number generator instance rng to ensure reproducibility
Do not modify the original input DataFrame
Do not use SQLite
Follow the exact calculation order specified in the steps below to maintain consistency

Steps:

Check if 'semester_gpa' column exists in the DataFrame. If not, return the DataFrame unchanged.
Identify scholarship columns (those ending with '_recpt' suffix).
If no scholarship columns exist, return the DataFrame unchanged.
Identify students with no scholarships (all '_recpt' columns = 0):
- Calculate the mean semester_gpa for this group
- Perturb their GPAs using a normal distribution with the group mean and standard deviation = 0.1
For each specific scholarship type, IN THE ORDER THEY APPEAR IN THE COLUMNS LIST:
- Identify students who received only this specific scholarship (this '_recpt' column = 1 and all others = 0)
- Calculate the mean semester_gpa for students with this specific scholarship
- Perturb their GPAs using a normal distribution with this scholarship-specific mean and standard deviation = 0.1
Leave semester_gpa values unchanged for students with multiple scholarships.
Return the DataFrame with scholarship-specific perturbed values.

Important Note on Random Number Generation: The random number generator maintains its state between calls. This means the sequence of random numbers depends on the order of operations. You must process the student groups in the exact order specified above (no-scholarship students first, then single-scholarship students in column order) to ensure reproducible results. Each call to rng.normal() advances the generator's internal state.

Example:

If a group has a mean GPA of 3.5:

A student with a GPA of 3.6 might be perturbed to 3.45
A student with a GPA of 3.4 might be perturbed to 3.58
All while maintaining a group mean close to 3.5

Hints:

The random number generator rng is already created for you using np.random.default_rng(seed)
Use boolean masks to select the appropriate rows for each scholarship group
Use rng.normal(loc=mean, scale=standard_deviation, size=number_of_values) to generate perturbed values
For more information on NumPy random number generator state, see this Stack Overflow discussion
Use boolean indexing to update the 'semester_gpa' values

### Solution - Exercise 10  
def perturb_gpa(df: pd.DataFrame, seed: int = 6040) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    ### BEGIN SOLUTION

    if 'semester_gpa' not in df.columns:
        return df
    
    recpt_columns = [col for col in df.columns if col.endswith('_recpt')]
    
    if not recpt_columns:
        return df
    
    result_df = df.copy()
    
    zeros_mask = (result_df[recpt_columns] == 0).all(axis=1)
    mean_zeros = result_df.loc[zeros_mask, 'semester_gpa'].mean()
    
    num_zeros = zeros_mask.sum()
    if num_zeros > 0:
        perturb_zeros = rng.normal(mean_zeros, 0.1, size=num_zeros)
        result_df.loc[zeros_mask, 'semester_gpa'] = perturb_zeros
    
    for recpt_col in recpt_columns:
        other_cols = [col for col in recpt_columns if col != recpt_col]
        single_mask = (result_df[recpt_col] == 1) & (result_df[other_cols] == 0).all(axis=1)
        
        mean_gpa = result_df.loc[single_mask, 'semester_gpa'].mean()
        
        if single_mask.sum() == 0 or np.isnan(mean_gpa):
            continue
        
        num = single_mask.sum()
        perturbations = rng.normal(mean_gpa, 0.1, size=num)
        result_df.loc[single_mask, 'semester_gpa'] = perturbations
    
    return result_df
    ### END SOLUTION

### Demo function call
demo_result_perturb_gpa = perturb_gpa(deidentified_df, seed=SEED)
filtered_demo_result_perturb_gpa = demo_result_perturb_gpa[['semester_gpa', 'scholarship_4_recpt', 'scholarship_12_recpt']]
no_scholarships = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 0) & 
                                            (filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 0)]
both_scholarships = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 1) & 
                                            (filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 1)]
only_scholarship1 = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 1) & 
                                            (filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 0)]
only_scholarship2 = filtered_demo_result_perturb_gpa[(filtered_demo_result_perturb_gpa['scholarship_4_recpt'] == 0) & 
                                            (filtered_demo_result_perturb_gpa['scholarship_12_recpt'] == 1)]
sample = pd.concat([
    no_scholarships.head(3), both_scholarships.head(3), only_scholarship1.head(2), only_scholarship2.head(2)
]).reset_index(drop=True)
sample

Run me!¶

Whether your solution is working or not, run the following code cell. It will load the proper results into memory and show the expected output for the demo cell above.

### Run Me!!!
demo_result_perturb_gpa_TRUE = utils.load_object_from_publicdata('demo_result_perturb_gpa_TRUE')

The demo should display this output.

	semester_gpa	scholarship_4_recpt	scholarship_12_recpt
0	2.867951	0	0
1	2.662993	0	0
2	2.854889	0	0
3	2.890000	1	1
4	2.920000	1	1
5	2.890000	1	1
6	3.061051	1	0
7	2.480000	1	0
8	3.060911	0	1
9	3.201331	0	1

The cell below will test your solution for perturb_gpa (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 10  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(perturb_gpa),
              ex_name='perturb_gpa',
              key=b'O0BU75J0b9vpmE9I87bfmsoeq8QjLtdkUmFw3nrTbWQ=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to perturb_gpa did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(perturb_gpa),
              ex_name='perturb_gpa',
              key=b'oA5C_k4ec6_fPwNqcV2KHNqkNId_fOV35lsPuoXv2VM=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
if e: raise e
assert passed, 'The solution to perturb_gpa did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Fin

Midterm 2, Spring 2025: Data Deidentification¶

The Problem¶

Your Overall Task¶

Run me!¶

The Tables¶

Schema Diagram¶

Helper Functions¶

Exercise 0: (1 points)¶

Part 1: Exploring and Compiling Data with SQL¶

Exercise 1: (2 points)¶

Run me!¶

Exercise 2: (1 points)¶

Run me!¶

Exercise 3: (2 points)¶

Run me!¶

Exercise 4: (3 points)¶

Run me!¶

Exercise 5: (3 points)¶

Run me!¶

Part 2: Data Deidentification with Pandas and Numpy¶

Run me!¶

Exercise 6: (2 points)¶

Run me!¶

Run me!¶

Exercise 7: (1 points)¶

Run me!¶

Run me!¶

Exercise 8: (2 points)¶

Run me!¶

Run me!¶

Exercise 9: (2 points)¶

Run me!¶

Run me!¶

Exercise 10: (3 points)¶

Run me!¶

`Midterm 2, Spring 2025`: `Data Deidentification`¶