Spring 2026 Midterm 2: Predicting At-Risk Students with Learning Analytics Data¶Version 1.0.0
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of SQL queries and aggregations, pandas feature engineering, NumPy classification metrics. It has 11 exercises numbered 0 to 10. There are 22 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 2 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 3 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 1 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 1 point(s)
Exercise 10 - : 1 point(s)
Final reminders:
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
from pprint import pprint
import sqlite3
import pandas as pd
import numpy as np
DATABASE_URL = r'./resource/asnlib/publicdata/oulad2.db'
conn = sqlite3.connect(DATABASE_URL)
This exam explores early identification of at-risk students using the Open University Learning Analytics Dataset (OULAD). Student dropout is a critical challenge in distance education. At the Open University, approximately 78% of students fail to complete their degree. The highest dropout rates occur within the first few weeks of a course, making early intervention essential but difficult.
Traditional predictive models rely on legacy data from previous courses to identify students at risk of failing or withdrawing. However, this approach cannot be used for new courses since they lack historical data. The Ouroboros methodology, introduced by Hlosta et al. (2017), proposes a solution: a "self-learning" model that uses data from the current course rather than legacy data.
In this exam, you will implement components of the Ouroboros approach for early risk prediction. The methodology centers on a critical insight: the first assessment (A1) serves as a key indicator of student success. Students who submit A1 have dramatically different outcomes than those who don't, and their behaviors can be used to predict which currently-enrolled students are at risk of not submitting.
The exam has three parts:
Explore the OULAD Database Use SQL queries to understand course-level outcomes, assessment submission patterns, and Virtual Learning Environment (VLE) engagement data that reveal early behavioral indicators of student risk.
Engineer Predictive Features Use Python, Pandas, and NumPy to transform raw student interaction data into meaningful features (VLE clicks, days active, registration timing) that capture engagement patterns before assessment deadlines.
Implement Temporal Modeling Implement the Ouroboros "self-learning" concept by creating temporal train/test splits and calculating evaluation metrics that account for severe class imbalance inherent in early prediction windows.
By the end of this exam, you will have implemented key parts of the pipeline that identifies at-risk students without relying on legacy course data.
The Open University Learning Analytics Dataset (OULAD) contains anonymized data from 32,593 students across 7 modules and 22 course presentations spanning 2013-2014.
The database consists of 7 tables:
Core Student & Course Tables:
studentInfo - Student demographics and final course resultscourses - Module and presentationstudentRegistration - Enrollment dates and withdrawal timestampsAssessment Tables:
assessments - Assessment definitions including type, deadlines, and weightsstudentAssessment - Individual student scores and submission datesVLE Interaction Tables:
vle - Learning materials (activity_type, availability weeks) organized by modulestudentVle - Daily student interaction logsSchema Diagram:

The OULAD contains 32,593 students across 22 course presentations, with over 10 million VLE interaction records and detailed assessment submission data.
Pass rates vary significantly across courses. Some courses see less than 50% of students pass while others exceed 60%. Understanding these baseline statistics provides context into why early intervention matters and helps us identify which courses might benefit most from predictive modeling.
Before we can analyze how students interact with learning materials, we need to understand what types of materials exist and how they're distributed across courses.
The following exercise explores the distribution of learning materials. This foundational knowledge shapes how we'll aggregate VLE interactions into meaningful features for predict.
vle_activity_summary
Your task: define vle_activity_summary_query as follows:
Write a SQL query that summarizes Virtual Learning Environment (VLE) activity types and their usage patterns across all courses.
Inputs: None
Return: query: a query that produces a table with 3 columns:
activity_type: The type of VLE activityresource_count: Total number of VLE resources of each activity type across all coursescourses_using: Number of distinct course presentations utilizing each activity typeRequirements/steps:
vle tableactivity_typeresource_count: count all VLE entries of each activity_typecourses_using: count distinct course (combination of code_module and code_presentation) of each activity_typeresource_count in descending order then courses_using in ascending orderactivity_type, resource_count, courses_using in that order### Solution - Exercise 0
vle_activity_summary_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
vle_activity_summary_query = '''
SELECT activity_type,
COUNT(*) AS resource_count,
COUNT(DISTINCT code_module || '_' || code_presentation) AS courses_using
FROM vle
GROUP BY activity_type
ORDER BY resource_count DESC,
courses_using ASC;
'''
### END SOLUTION
### Demo function call
demo_result_vle_activity_summary = pd.read_sql(vle_activity_summary_query, conn)
demo_result_vle_activity_summary
The demo should display this output.
| activity_type | resource_count | courses_using | |
|---|---|---|---|
| 0 | resource | 807 | 4 |
| 1 | subpage | 122 | 4 |
| 2 | oucontent | 77 | 4 |
| 3 | forumng | 56 | 4 |
| 4 | url | 50 | 4 |
| 5 | quiz | 19 | 4 |
| 6 | oucollaborate | 6 | 3 |
| 7 | glossary | 5 | 4 |
| 8 | questionnaire | 4 | 1 |
| 9 | homepage | 4 | 4 |
| 10 | sharedsubpage | 3 | 3 |
| 11 | ouelluminate | 1 | 1 |
The cell below will test your solution for vle_activity_summary (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(vle_activity_summary_query),
ex_name='vle_activity_summary',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to vle_activity_summary did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The first assessment (A1) is an important indicator. Submission of A1 correlates strongly with eventual course completion. The following exercise examines this relationship by analyzing submission patterns across courses. You will explore differences between students who submit A1 early versus those who submit late. This behavioral difference is fundamental to the self-learning approach: early submitters provide the signal that enables prediction of which current students are at risk.
course_pass_rate
Your task: define course_pass_rate_query as follows:
Write a SQL query that calculates pass rate statistics for each course presentation.
Inputs: None
Return: query: a query that produces a table with 5 columns:
code_module: The course module codecode_presentation: The presentation codetotal_students: Total number of students enrolled in the course presentationpassed_students: Number of students who achieved 'Pass' or 'Distinction' as their final resultpass_rate: Percentage of students who passed (rounded to 2 decimal places)Requirements/steps:
courses table with the studentInfo table using both code_module and code_presentationcode_module and code_presentationtotal_students by counting all students in each groupingpassed_students by summing students whose final_result is either 'Pass' or 'Distinction'pass_rate as a percentage: (passed_students / total_students) * 100, rounded to 2 decimal placespass_rate descending, then code_presentation ascending (to break ties deterministically)code_module, code_presentation, total_students, passed_students, pass_rate in that order### Solution - Exercise 1
course_pass_rate_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
course_pass_rate_query = '''
SELECT c.code_module,
c.code_presentation,
COUNT(s.id_student) AS total_students,
SUM(CASE WHEN s.final_result IN ('Pass', 'Distinction') THEN 1 ELSE 0 END) AS passed_students,
ROUND(100.0 * SUM(CASE WHEN s.final_result IN ('Pass', 'Distinction') THEN 1 ELSE 0 END) / COUNT(s.id_student), 2) AS pass_rate
FROM courses AS c
JOIN studentInfo AS s
ON c.code_module = s.code_module
AND c.code_presentation = s.code_presentation
GROUP BY c.code_module, c.code_presentation
HAVING COUNT(s.id_student) >= 500
ORDER BY pass_rate DESC, c.code_presentation ASC;
'''
### END SOLUTION
### Demo function call
demo_result_course_pass_rate = pd.read_sql(course_pass_rate_query, conn)
demo_result_course_pass_rate
The demo should display this output.
| code_module | code_presentation | total_students | passed_students | pass_rate | |
|---|---|---|---|---|---|
| 0 | BBB | 2014J | 2292 | 1152 | 50.26 |
| 1 | BBB | 2013J | 2237 | 1072 | 47.92 |
| 2 | BBB | 2013B | 1767 | 803 | 45.44 |
| 3 | BBB | 2014B | 1613 | 727 | 45.07 |
The cell below will test your solution for course_pass_rate (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(course_pass_rate_query),
ex_name='course_pass_rate',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to course_pass_rate did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Not all A1 submitters are equal. Students who submit assessments early versus late show different engagement patterns in the Virtual Learning Environment.
In the following exercise, you'll compare VLE interaction patterns before the median A1 submission date. This reveals behavioral differences that separate engaged students from those who may be struggling. These differences are visible long before the assignment deadline.
DEBUG_early_late_submitters
Your task: define DEBUG_early_late_submitters_query as follows:
Write a SQL query that compares VLE engagement patterns between students who submitted the first assessment early versus late.
Inputs: None
Return: query: a query that produces a table with 5 columns:
submission_category: Either 'Early' or 'Late' based on submission timingstudent_count: Total number of unique students in each categoryavg_total_clicks: Average total VLE clicks per student (rounded to 2 decimal places)avg_days_active: Average number of distinct days with VLE activity (rounded to 2 decimal places)avg_submission_day: Average day when the assessment was submitted (rounded to 2 decimal places)Requirements/steps:
studentAssessment with the assessments table on id_assessmentid_student, code_module, and code_presentation. Merge with the previous studentAssessment and assessments table on id_student, code_module, and code_presentation:total_clicks: SUM of sum_click from studentVle tabledays_active: COUNT of DISTINCT date values from studentVle tablecode_module = 'BBB' and code_presentation = '2013B'date_submitted <= 25date_submitted > 25id_assessment = 14984 (the first assessment)submission_categorysubmission_category, calculate the four aggregate metrics (student_count, avg_total_clicks (rounded to 2 decimal places), avg_days_active (rounded to 2 decimal places), avg_submission_day (rounded to 2 decimal places))total_clicks and days_activesubmission_category, student_count, avg_total_clicks, avg_days_active, avg_submission_day in that order### Solution - Exercise 2
# DEBUG_early_late_submitters_query = '''
# SELECT
# CASE
# WHEN sa.date_submitted < 25 THEN 'Early'
# ELSE 'Late'
# END AS submission_category,
# COUNT(sa.id_student) AS student_count,
# AVG(sv.total_clicks) AS avg_total_clicks,
# AVG(sv.days_active) AS avg_days_active,
# AVG(sa.date_submitted) AS avg_submission_day
# FROM studentAssessment sa
# INNER JOIN assessments a ON sa.id_assessment = a.id_assessment
# INNER JOIN (
# SELECT
# id_student,
# code_module,
# code_presentation,
# SUM(sum_click) as total_clicks,
# COUNT(date) as days_active
# FROM studentVle
# GROUP BY id_student, code_module, code_presentation
# ) sv
# ON sa.id_student = sv.id_student
# AND a.code_module = sv.code_module
# AND a.code_presentation = sv.code_presentation
# WHERE sa.id_assessment = 14984
# GROUP BY submission_category
# ORDER BY submission_category;
# '''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
DEBUG_early_late_submitters_query = '''
SELECT
CASE
WHEN sa.date_submitted <= 25 THEN 'Early'
ELSE 'Late'
END AS submission_category,
COUNT(DISTINCT sa.id_student) AS student_count,
ROUND(AVG(COALESCE(sv.total_clicks, 0)), 2) AS avg_total_clicks,
ROUND(AVG(COALESCE(sv.days_active, 0)), 2) AS avg_days_active,
ROUND(AVG(sa.date_submitted), 2) AS avg_submission_day
FROM studentAssessment sa
INNER JOIN assessments a ON sa.id_assessment = a.id_assessment
LEFT JOIN (
SELECT
id_student,
code_module,
code_presentation,
SUM(sum_click) as total_clicks,
COUNT(DISTINCT date) as days_active
FROM studentVle
WHERE code_module = 'BBB'
AND code_presentation = '2013B'
GROUP BY id_student, code_module, code_presentation
) sv
ON sa.id_student = sv.id_student
AND a.code_module = sv.code_module
AND a.code_presentation = sv.code_presentation
WHERE sa.id_assessment = 14984
GROUP BY submission_category
ORDER BY submission_category;
'''
### END SOLUTION
### Demo function call
demo_result_DEBUG_early_late_submitters = pd.read_sql(DEBUG_early_late_submitters_query, conn)
demo_result_DEBUG_early_late_submitters
The demo should display this output.
| submission_category | student_count | avg_total_clicks | avg_days_active | avg_submission_day | |
|---|---|---|---|---|---|
| 0 | Early | 1276 | 1022.25 | 59.48 | 17.45 |
| 1 | Late | 76 | 377.47 | 23.51 | 34.68 |
The cell below will test your solution for DEBUG_early_late_submitters (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(DEBUG_early_late_submitters_query),
ex_name='DEBUG_early_late_submitters',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_early_late_submitters did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
There are additional early indicators beyond submission patterns that predict at-risk students. VLE engagement metrics total clicks and days active correlate strongly with course outcomes.
The following exercise introduces the importance of behavioral features. Students with low engagement early in the course likely need intervention.
vle_engagement_risk
Your task: define vle_engagement_risk_query as follows:
Write a SQL query that identifies at-risk students based on low VLE engagement in the first 30 days of a course.
Inputs: None
Return: query: a query that produces a table with 5 columns:
id_student: The student identifiertotal_clicks: Total sum of VLE clicks across all interactions before day 30days_active: Count of distinct days with VLE activity before day 30at_risk_flag: Binary flag (0 or 1) indicating risk statusfinal_result: The student's final course outcomeRequirements/steps:
studentVle records to code_module = 'BBB', code_presentation = '2013J', and date < 30studentVle with studentInfo on id_student, code_module, and code_presentationid_student and final_resulttotal_clicks: SUM of sum_click valuesdays_active: COUNT of DISTINCT date valuesat_risk_flag using a CASE statement:total_clicks < 50 OR days_active < 5final_result column from the studentInfo tableid_student ascending, then final_result ascendingid_student, total_clicks, days_active, at_risk_flag, final_result in that order### Solution - Exercise 3
vle_engagement_risk_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
vle_engagement_risk_query = '''
SELECT sv.id_student,
SUM(sv.sum_click) AS total_clicks,
COUNT(DISTINCT sv.date) AS days_active,
CASE
WHEN SUM(sv.sum_click) < 50 OR COUNT(DISTINCT sv.date) < 5
THEN 1
ELSE 0
END AS at_risk_flag,
s.final_result
FROM studentVle AS sv
JOIN studentInfo AS s
ON sv.id_student = s.id_student
AND sv.code_module = s.code_module
AND sv.code_presentation = s.code_presentation
WHERE sv.code_module = 'BBB'
AND sv.code_presentation = '2013J'
AND sv.date < 30
GROUP BY sv.id_student, s.final_result
ORDER BY sv.id_student ASC, s.final_result ASC;
'''
### END SOLUTION
### Demo function call
demo_result_vle_engagement_risk = pd.read_sql(vle_engagement_risk_query, conn)
demo_result_vle_engagement_risk.head(20)
The demo should display this output.
| id_student | total_clicks | days_active | at_risk_flag | final_result | |
|---|---|---|---|---|---|
| 0 | 23798 | 152 | 12 | 0 | Distinction |
| 1 | 27759 | 123 | 12 | 0 | Fail |
| 2 | 30091 | 219 | 8 | 0 | Pass |
| 3 | 31014 | 409 | 26 | 0 | Withdrawn |
| 4 | 31849 | 414 | 17 | 0 | Pass |
| 5 | 37622 | 88 | 4 | 1 | Pass |
| 6 | 38234 | 34 | 5 | 1 | Fail |
| 7 | 47855 | 5 | 3 | 1 | Withdrawn |
| 8 | 48040 | 307 | 26 | 0 | Pass |
| 9 | 48503 | 121 | 8 | 0 | Pass |
| 10 | 51301 | 502 | 36 | 0 | Withdrawn |
| 11 | 52797 | 280 | 15 | 0 | Pass |
| 12 | 52899 | 448 | 21 | 0 | Pass |
| 13 | 53360 | 22 | 3 | 1 | Withdrawn |
| 14 | 54388 | 42 | 1 | 1 | Withdrawn |
| 15 | 55026 | 118 | 13 | 0 | Pass |
| 16 | 56340 | 4 | 1 | 1 | Fail |
| 17 | 56789 | 124 | 8 | 0 | Pass |
| 18 | 57079 | 112 | 12 | 0 | Fail |
| 19 | 63044 | 74 | 6 | 0 | Fail |
The cell below will test your solution for vle_engagement_risk (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(vle_engagement_risk_query),
ex_name='vle_engagement_risk',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to vle_engagement_risk did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
students_by_course = utils.load_object_from_publicdata('students_by_course')
registrations_by_course = utils.load_object_from_publicdata('registrations_by_course')
vle_by_course = utils.load_object_from_publicdata('vle_by_course')
STUDENT_FEATURES = utils.load_object_from_publicdata('STUDENT_FEATURES')
REGISTRATION_FEATURES = utils.load_object_from_publicdata('REGISTRATION_FEATURES')
RESULT_FEATURES = utils.load_object_from_publicdata('RESULT_FEATURES')
Now we will shift from exploration to implementation. You'll transform raw VLE interaction data into structured features for machine learning and implement the temporal modeling framework that enables self-learning predictions.
The analysis requires that we create features that capture engagement patterns before assessment submissions. We need to aggregate VLE interactions, calculate days active, and extract registration timing given a temporal cutoff day. In the following exercise, you will produce one row per student with engineered behavioral features computed only using data available up to the cutoff day. These features are the foundation for subsequent modeling.
create_student_features
Your task: define create_student_features as follows:
Create a comprehensive student feature matrix that combines demographics, registration timing, and VLE engagement metrics.
Inputs:
students_df (DataFrame): Students from studentInfo table with columns:
code_module, code_presentation, id_student, age_band, highest_education, disability, num_of_prev_attempts, final_resultregistrations_df (DataFrame): Registration records from studentRegistration table with columns:
code_module, code_presentation, id_student, date_registrationvle_df (DataFrame): VLE interaction records from studentVle table with columns:
code_module, code_presentation, id_student, date, sum_clickcutoff_day (int): Temporal cutoff for VLE activity analysis (only include VLE activity before this day)Return: A pandas DataFrame with the following structure:
id_student (student identifier)code_module and code_presentation for course identificationRequirements/steps:
students_df using only the columns in STUDENT_FEATURESregistrations_df (on right) to add registration timing features:code_module, code_presentation, and id_studentregistrations_df dataframe, keep only the columns in REGISTRATION_FEATURESvle_df:vle_df to only include records where date < cutoff_daycode_module, code_presentation, and id_studenttotal_clicks: SUM of sum_clicktotal_active_days: COUNT of DISTINCT date valuesavg_clicks_per_day which equals total_clicks/total_active_dayscode_module, code_presentation, and id_studenttotal_clicks, total_active_days, avg_clicks_per_day) are (int64, int64, float64)days_registered_before_start equal to date_registrationRESULT_FEATURES indexed by id_studentNote: The preloaded lists STUDENT_FEATURES, REGISTRATION_FEATURES, and RESULT_FEATURES are provided to save typing.
### Solution - Exercise 4
def create_student_features(students_df, registrations_df, vle_df, cutoff_day):
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
features = students_df[STUDENT_FEATURES].copy()
features = features.merge(
registrations_df[REGISTRATION_FEATURES],
on=['code_module', 'code_presentation', 'id_student'],
how='left'
)
vle_filtered = vle_df[vle_df['date'] < cutoff_day].copy()
vle_agg = vle_filtered.groupby(['code_module', 'code_presentation', 'id_student']).agg(
total_clicks=('sum_click', 'sum'),
total_active_days=('date', 'nunique')
).reset_index()
vle_agg['avg_clicks_per_day'] = vle_agg['total_clicks'] / vle_agg['total_active_days']
features = features.merge(vle_agg, on=['code_module', 'code_presentation', 'id_student'], how='left')
features['total_clicks'] = features['total_clicks'].fillna(0).astype(int)
features['total_active_days'] = features['total_active_days'].fillna(0).astype(int)
features['avg_clicks_per_day'] = features['avg_clicks_per_day'].fillna(0.0)
features['days_registered_before_start'] = features['date_registration']
result = features[RESULT_FEATURES].set_index('id_student')
return result
### END SOLUTION
### Demo function call
demo_students = students_by_course[('AAA', '2013J')].head(50)
demo_registrations = registrations_by_course[('AAA', '2013J')]
demo_vle = vle_by_course[('AAA', '2013J')]
demo_cutoff = 30
result = create_student_features(demo_students, demo_registrations, demo_vle, demo_cutoff)
result.head(10)
The demo should display this output.
| code_module | code_presentation | age_band | highest_education | disability | num_of_prev_attempts | days_registered_before_start | total_clicks | total_active_days | avg_clicks_per_day | final_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id_student | |||||||||||
| 11391 | AAA | 2013J | 55<= | HE Qualification | N | 0 | -159.0 | 419 | 9 | 46.555556 | Pass |
| 28400 | AAA | 2013J | 35-55 | HE Qualification | N | 0 | -53.0 | 618 | 19 | 32.526316 | Pass |
| 30268 | AAA | 2013J | 35-55 | A Level or Equivalent | Y | 0 | -92.0 | 281 | 12 | 23.416667 | Withdrawn |
| 31604 | AAA | 2013J | 35-55 | A Level or Equivalent | N | 0 | -52.0 | 503 | 23 | 21.869565 | Pass |
| 32885 | AAA | 2013J | 0-35 | Lower Than A Level | N | 0 | -176.0 | 567 | 24 | 23.625000 | Pass |
| 38053 | AAA | 2013J | 35-55 | A Level or Equivalent | N | 0 | -110.0 | 727 | 32 | 22.718750 | Pass |
| 45462 | AAA | 2013J | 0-35 | HE Qualification | N | 0 | -67.0 | 561 | 15 | 37.400000 | Pass |
| 45642 | AAA | 2013J | 0-35 | A Level or Equivalent | N | 0 | -29.0 | 423 | 20 | 21.150000 | Pass |
| 52130 | AAA | 2013J | 0-35 | A Level or Equivalent | N | 0 | -33.0 | 366 | 19 | 19.263158 | Pass |
| 53025 | AAA | 2013J | 55<= | Post Graduate Qualification | N | 0 | -179.0 | 571 | 20 | 28.550000 | Pass |
The cell below will test your solution for create_student_features (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(create_student_features),
ex_name='create_student_features',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to create_student_features did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
all_students = utils.load_object_from_publicdata('all_students')
all_student_assessments = utils.load_object_from_publicdata('all_student_assessments')
In the following exercise, you'll create binary labels indicating whether students submitted A1 (1) or did not submit (0). These labels must be merged with the features from Exercise 4. Together, the features and labels create our machine learning dataset, with each row representing a student's behavioral profile and their eventual submission outcome.
create_submission_labels
Example: we have defined create_submission_labels as follows:
Generate binary submission labels indicating whether students submitted the first assessment within a specific temporal window.
THIS IS A FREE EXERCISE. Submit your exam to obtain your free point
Inputs:
students_df (DataFrame): Students from studentInfo table with columns including:
code_module, code_presentation, id_studentstudent_assessments_df (DataFrame): Assessment submissions from studentAssessment table merged with assessments table, containing columns including:
id_student, id_assessment, date_submitted, code_module, code_presentationfirst_assessment_id (int): ID of the first assessment to analyzelabel_start_day (int): Start of the label window (inclusive)label_end_day (int): End of the label window (inclusive)Return: A pandas DataFrame with the following columns:
id_student: Student identifiercode_module: Course module codecode_presentation: Course presentation codelabel: Binary label (1 if submitted within window, 0 otherwise)Requirements/steps:
student_assessments_df to only include records where id_assessment equals first_assessment_idlabel_start_day <= date_submitted <= label_end_dayid_student, code_module, and code_presentationstudents_df (on left) with submission DataFrame (from step 2):id_student, code_module, and code_presentation from students_dfid_student, code_module, and code_presentationstudents_df are included, even those who never submittedlabel column which corresponds to 1 for students who submitted within the window, 0 otherwiseid_student, code_module, code_presentation and label all indexed by id_student### Solution - Exercise 5
def create_submission_labels(students_df, student_assessments_df, first_assessment_id, label_start_day, label_end_day):
filtered_assessments = student_assessments_df[
student_assessments_df['id_assessment'] == first_assessment_id
]
submitted_in_window = filtered_assessments[
(filtered_assessments['date_submitted'] >= label_start_day) &
(filtered_assessments['date_submitted'] <= label_end_day)
][['id_student', 'code_module', 'code_presentation']].drop_duplicates()
result = students_df[['id_student', 'code_module', 'code_presentation']].copy()
result = result.merge(
submitted_in_window,
on=['id_student', 'code_module', 'code_presentation'],
how='left',
indicator=True
)
result['label'] = (result['_merge'] == 'both').astype(int)
result = result.drop(columns=['_merge'])
result = result.set_index('id_student')
return result
### Demo function call
demo_students = all_students[
(all_students['code_module'] == 'AAA') &
(all_students['code_presentation'] == '2013J')
].head(50)
demo_student_assessments = all_student_assessments[
(all_student_assessments['code_module'] == 'AAA') &
(all_student_assessments['code_presentation'] == '2013J')
]
course_assessments = demo_student_assessments.sort_values('date_submitted')
first_assessment_id = course_assessments.iloc[0]['id_assessment'] if len(course_assessments) > 0 else None
result = create_submission_labels(
demo_students,
demo_student_assessments,
first_assessment_id,
label_start_day=0,
label_end_day=30
)
print(result['label'].iloc[:10])
The demo should display this printed output.
id_student
11391 1
28400 1
30268 0
31604 1
32885 1
38053 1
45462 1
45642 1
52130 1
53025 1
Name: label, dtype: int64
The test cell below will always pass. Please submit to collect your free points for create_submission_labels (exercise 5).
### Test Cell - Exercise 5
print('Passed! Please submit.')
A difficulty with the Ouroboros approach is class imbalance. 75% or more of students submit A1. This imbalance can lead to misleading evaluation metrics. For example, if 90% of students submit A1, a model that always predicts "submit" would achieve 90% accuracy but would be useless for identifying at-risk students.
This exercise quantifies the imbalance ratio and calculates how many samples we have in each class. Understanding this imbalance is essential because it affects which evaluation metrics we should use.
analyze_class_imbalance
Your task: define analyze_class_imbalance as follows:
Analyze class imbalance in binary classification labels. In early prediction scenarios, positive examples (students who submit) are typically much fewer than negative examples (students who don't submit), creating severe class imbalance. Understanding this imbalance is crucial for model evaluation and metric selection.
Inputs:
labels (Series): Binary labels (0 or 1) indexed by id_studentReturn: A Python dictionary containing the following keys and values:
'total_samples': Total number of labels (integer)'positive_count': Count of labels with value 1 (integer)'negative_count': Count of labels with value 0 (integer)'positive_ratio': Ratio of positive samples to total samples (float, rounded to 4 decimal places)'imbalance_ratio': Ratio of majority class to minority class (float, rounded to 4 decimal places)'majority_class': The more frequent class label (integer, either 0 or 1)Requirements/steps:
total_samplespositive_countnegative_countpositive_count / total_samples, rounded to 4 decimal places and name it positive_ratiomax(positive_count, negative_count) / min(positive_count, negative_count), rounded to 4 decimal places and name it imbalance_ratiopositive_count or negative_count is 0, then set imbalance_ratio to infmajority_classmajority_class to 1### Solution - Exercise 6
def analyze_class_imbalance(labels):
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
total_samples = len(labels)
positive_count = int(labels.sum())
negative_count = total_samples - positive_count
positive_ratio = round(positive_count / total_samples, 4)
if negative_count == 0:
imbalance_ratio = float('inf')
elif positive_count == 0:
imbalance_ratio = float('inf')
else:
imbalance_ratio = round(max(positive_count, negative_count) / min(positive_count, negative_count), 4)
majority_class = 1 if positive_count >= negative_count else 0
return {
'total_samples': total_samples,
'positive_count': positive_count,
'negative_count': negative_count,
'positive_ratio': positive_ratio,
'imbalance_ratio': imbalance_ratio,
'majority_class': majority_class
}
### END SOLUTION
### Demo function call
demo_labels = pd.Series(
[1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1] * 2,
index=range(11391, 11391 + 40)
)
result = analyze_class_imbalance(demo_labels)
pprint(result)
The demo should display this printed output.
{'imbalance_ratio': 4.0,
'majority_class': 1,
'negative_count': 8,
'positive_count': 32,
'positive_ratio': 0.8,
'total_samples': 40}
The cell below will test your solution for analyze_class_imbalance (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=analyze_class_imbalance,
ex_name='analyze_class_imbalance',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to analyze_class_imbalance did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
all_student_features = utils.load_object_from_publicdata('all_student_features')
all_submission_labels = utils.load_object_from_publicdata('all_submission_labels')
Not all behavioral metrics are equally informative. In the following exercise, you will compute correlations between VLE engagement features and submission labels to show which patterns have the strongest predictive power.
By identifying which features correlate most with submission outcomes, we can focus on the most relevant behavioral signals. This step is crucial for building an effective predictive model.
compute_feature_correlations
Your task: define compute_feature_correlations as follows:
Compute Pearson correlation coefficients between numerical features and binary outcome labels. This analysis identifies which features have the strongest linear relationships with the outcome, guiding feature selection for predictive modeling.
Inputs:
features_df (DataFrame): Student features with id_student as the indexoutcome_series (Series): Binary outcomes (0 or 1) with id_student as the indexnumerical_features (list): List of column names for numerical features to analyzeReturn: A pandas DataFrame with three columns:
feature_name: Name of the feature (string)correlation: Pearson correlation coefficient with the outcome (float, rounded to 4 decimal places)abs_correlation: Absolute value of the correlation (float, rounded to 4 decimal places)The DataFrame should be sorted by abs_correlation with the strongest correlations first.
Requirements/steps:
features_df and outcome_series using their shared id_student indexnumerical_features:outcome_seriesfeature_name, correlation, and abs_correlation as specified aboveabs_correlation in descending orderHint: Use pandas .corr() method to calculate Pearson correlations.
### Solution - Exercise 7
def compute_feature_correlations(features_df, outcome_series, numerical_features):
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
indexes=list(set(outcome_series.index) & set(features_df.index))
aligned_features = features_df.loc[indexes]
aligned_outcome = outcome_series.loc[indexes]
correlations = []
for feature in numerical_features:
if feature not in aligned_features.columns:
continue
feature_data = aligned_features[feature]
valid_mask = feature_data.notna() & aligned_outcome.notna()
feature_clean = feature_data[valid_mask]
outcome_clean = aligned_outcome[valid_mask]
if len(feature_clean) > 1 and feature_clean.std() > 0:
corr = feature_clean.corr(outcome_clean)
correlations.append({
'feature_name': feature,
'correlation': round(corr, 4),
'abs_correlation': round(abs(corr), 4)
})
result_df = pd.DataFrame(correlations)
result_df = result_df.sort_values('abs_correlation', ascending=False).reset_index(drop=True)
return result_df
### END SOLUTION
### Demo function call
demo_features = pd.DataFrame({
'num_of_prev_attempts': [0, 1, 0, 2, 0, 1, 3, 0, 1, 0],
'days_registered_before_start': [-50, -30, -20, -60, -10, -45, -25, -15, -35, -5],
'total_clicks': [120, 80, 150, 45, 200, 90, 60, 180, 70, 160],
'total_active_days': [15, 10, 18, 5, 20, 12, 8, 19, 9, 17],
'avg_clicks_per_day': [8.0, 8.0, 8.3, 9.0, 10.0, 7.5, 7.5, 9.5, 7.8, 9.4]
}, index=pd.Index(range(11391, 11401), name='id_student'))
demo_labels = pd.Series(
[1, 0, 1, 0, 1, 1, 0, 1, 0, 1],
index=pd.Index(range(11396, 11406), name='id_student')
)
demo_numerical_features = [
'num_of_prev_attempts',
'days_registered_before_start',
'total_clicks',
'total_active_days',
'avg_clicks_per_day'
]
demo_result_compute_feature_correlations = compute_feature_correlations(
demo_features,
demo_labels,
demo_numerical_features
)
demo_result_compute_feature_correlations
The demo should display this output.
| feature_name | correlation | abs_correlation | |
|---|---|---|---|
| 0 | total_active_days | 0.8474 | 0.8474 |
| 1 | total_clicks | 0.7873 | 0.7873 |
| 2 | num_of_prev_attempts | -0.7454 | 0.7454 |
| 3 | avg_clicks_per_day | 0.6168 | 0.6168 |
| 4 | days_registered_before_start | 0.2887 | 0.2887 |
The cell below will test your solution for compute_feature_correlations (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(compute_feature_correlations),
ex_name='compute_feature_correlations',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to compute_feature_correlations did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
all_students = utils.load_object_from_publicdata('all_students')
all_student_assessments = utils.load_object_from_publicdata('all_student_assessments')
In the following exercise, you will implement the core Ouroboros temporal windowing concept. Given a current_day and the cutoff_day for A1, we must split students into training and testing sets based on their submission timing.
Training set: Students who already submitted A1 by current_day. Their features (computed from behavior before they submitted) and their submission timing (early vs. late) provide labeled examples for learning.
Testing set: Students who haven't submitted yet. These are the at-risk candidates we want to identify. We compute their current behavioral features and predict whether they will eventually submit.
This is "self-learning": recent submitters train the model to recognize patterns that predict outcomes for students still enrolled. As each day passes, more students move from the test set to the training set, continuously improving our predictions.
This is a DEBUG EXERCISE. We have provided the code and requirements. The task is to identify and fix any issues to ensure the function works as intended.
Example: demo case (current_day=15, cutoff_day=25)
Day: 0────────[current_day=15]───────────[cutoff_day=25]──>
Student Submission day Label
──────── ────────────────────────────────────────────────── ──────────
S1 ●(5) | | train = 1
S2 ●──(10) | | train = 1
S3 ●──(12) | | train = 1
S4 | | test = 0 (never)
S5 | | test = 0 (never)
S6 | ●─(18) | test = 1
S7 | ●──────(25) | test = 1
S8 | | ●─(30) test = 0 (past cutoff)
S9 | | test = 0 (never)
S10 | | test = 0 (never)
◄──── TRAIN ────►◄────── TEST WINDOW ──────►
● marks the day a student submitted A1; · means no submission yet at current_daycurrent_day → TRAIN (train_labels)test_labels); label = 1 only if submission falls within the test window (current_day, cutoff_day]DEBUG_create_temporal_split
Your task: define DEBUG_create_temporal_split as follows:
Implement the temporal windowing methodology to create train/test splits for self-learning prediction.
Note: This is a debugging exercise!
Inputs:
students_df (DataFrame): Student records with columns:
id_student, code_module, code_presentationstudent_assessments_df (DataFrame): Assessment submissions with columns:
id_student, id_assessment, date_submitted, code_module, code_presentationfirst_assessment_id (int): ID of the first assessment to predictcurrent_day (int): Current day for making predictionscutoff_day (int): Deadline for assessment submissionReturn: A tuple containing two pandas Series: (train_labels, test_labels)
train_labels: Series indexed by id_student for all course studentscurrent_day (inclusive)current_daytest_labels: Series indexed by id_studentcurrent_day (i.e., students with train_labels == 0)current_day and by cutoff_daycutoff_dayRequirements/steps:
code_module and code_presentation from students_df first row. Filter both students_df and student_assessments_df to keep only rows where code_module and code_presentation match these extracted values.student_assessments_df to only include records where id_assessment equals first_assessment_idcourse_assessments (step 4 result), identify students who submitted by current_day (inclusive): date_submitted <= current_daycourse_students (step 3 result), assign label = 1 if they appear in the submitted set, 0 otherwiseid_student and set index.name = 'id_student'course_students whose id_student is NOT in the submitted-by-current set (i.e., students with train_labels == 0); this subset defines both the test population and the test Series indexcurrent_day and by cutoff_day (inclusive): current_day < date_submitted <= cutoff_daycutoff_dayid_student and set index.name = 'id_student'test_labels contains only students with train_labels == 0(train_labels, test_labels)### Solution - Exercise 8
def DEBUG_create_temporal_split(students_df, student_assessments_df,
first_assessment_id, current_day, cutoff_day):
###
### YOUR CODE HERE
###
# target_module = students_df['code_module'].iloc[0]
# target_presentation = students_df['code_presentation'].iloc[0]
# course_students = students_df.copy()
# course_assessments = student_assessments_df[
# (student_assessments_df['code_module'] == target_module) &
# (student_assessments_df['code_presentation'] == target_presentation) &
# (student_assessments_df['id_assessment'] == first_assessment_id)
# ].copy()
# submitted_by_current = course_assessments[
# course_assessments['date_submitted'] < current_day
# ]['id_student'].unique()
# submitted_after_current = course_assessments[
# course_assessments['date_submitted'] > current_day
# ]['id_student'].unique()
# train_labels = pd.Series(
# course_students['id_student'].isin(submitted_by_current).astype(int).values,
# index=course_students['id_student']
# )
# test_labels = pd.Series(
# course_students['id_student'].isin(submitted_after_current).astype(int).values,
# index=course_students['id_student']
# )
# test_labels.index.name = 'id_student'
# return train_labels, test_labels
### BEGIN SOLUTION
target_module = students_df['code_module'].iloc[0]
target_presentation = students_df['code_presentation'].iloc[0]
course_students = students_df[
(students_df['code_module'] == target_module) &
(students_df['code_presentation'] == target_presentation)
].copy()
course_assessments = student_assessments_df[
(student_assessments_df['code_module'] == target_module) &
(student_assessments_df['code_presentation'] == target_presentation) &
(student_assessments_df['id_assessment'] == first_assessment_id)
].copy()
submitted_by_current = course_assessments[
course_assessments['date_submitted'] <= current_day
]['id_student'].unique()
submitted_after_current = course_assessments[
(course_assessments['date_submitted'] > current_day) &
(course_assessments['date_submitted'] <= cutoff_day)
]['id_student'].unique()
train_labels = pd.Series(
course_students['id_student'].isin(submitted_by_current).astype(int).values,
index=course_students['id_student']
)
train_labels.index.name = 'id_student'
students_not_submitted_by_current = course_students[
~course_students['id_student'].isin(submitted_by_current)
]
test_labels = pd.Series(
students_not_submitted_by_current['id_student'].isin(submitted_after_current).astype(int).values,
index=students_not_submitted_by_current['id_student']
)
test_labels.index.name = 'id_student'
return train_labels, test_labels
### END SOLUTION
### Demo function call
demo_students = pd.DataFrame({
'id_student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'code_module': ['AAA'] * 10,
'code_presentation': ['2013J'] * 10
})
demo_student_assessments = pd.DataFrame({
'id_student': [1, 2, 3, 6, 7, 8],
'id_assessment': [1001] * 6,
'date_submitted': [5, 10, 12, 18, 25, 30],
'code_module': ['AAA'] * 6,
'code_presentation': ['2013J'] * 6
})
first_assessment_id = 1001
current_day = 15
cutoff_day = 25
train_labels, test_labels = DEBUG_create_temporal_split(
demo_students, demo_student_assessments,
first_assessment_id, current_day, cutoff_day
)
print('Training set:')
pprint(train_labels)
print('\nTesting set:')
pprint(test_labels)
The demo should display this printed output.
Training set:
id_student
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
dtype: int64
Testing set:
id_student
4 0
5 0
6 1
7 1
8 0
9 0
10 0
dtype: int64
The cell below will test your solution for DEBUG_create_temporal_split (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_create_temporal_split,
ex_name='DEBUG_create_temporal_split',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=40)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_create_temporal_split did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Now we need to evaluate predictions on this imbalanced dataset, however, standard accuracy is misleading when 90% of students submit A1. A naive classifier that always predicts "will submit" achieves 90% accuracy but can not help to identify at-risk students.
Instead, you will calculate precision, recall, and F1-score using a 0.5 probability threshold. These metrics balance the tradeoff between identifying at-risk students and avoiding false alarms. The F1-score combines precision and recall, providing a single metric that reflects model quality on imbalanced classification tasks.
Use the following formulas to compute these metrics:
calculate_metrics
Your task: define calculate_metrics as follows:
Calculate standard binary classification metrics (precision, recall, and F1-score) for predicted probabilities.
Inputs:
y_pred_proba (array): Predicted probabilities for the positive class (values between 0.0 and 1.0)y_true (array): True binary labels (0 or 1)Return: A Python dictionary containing three metrics:
'precision': Proportion of positive predictions that are correct (float, rounded to 4 decimal places)'recall': Proportion of actual positives correctly identified (float, rounded to 4 decimal places)'f1_score': Harmonic mean of precision and recall (float, rounded to 4 decimal places)Requirements/steps:
y_pred_proba >= 0.5y_pred_proba < 0.5precision = True Positives / (True Positives + False Positives)recall = True Positives / (True Positives + False Negatives)f1_score = 2 * (precision * recall) / (precision + recall)Hint: The metric should be 0.0 when division by zero occurs.
### Solution - Exercise 9
def calculate_metrics(y_pred_proba, y_true):
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
y_pred_proba = np.array(y_pred_proba)
y_true = np.array(y_true)
y_pred = (y_pred_proba >= 0.5).astype(int)
true_positives = np.sum((y_pred == 1) & (y_true == 1))
false_positives = np.sum((y_pred == 1) & (y_true == 0))
true_negatives = np.sum((y_pred == 0) & (y_true == 0))
false_negatives = np.sum((y_pred == 0) & (y_true == 1))
if true_positives + false_positives > 0:
precision = true_positives / (true_positives + false_positives)
else:
precision = 0.0
if true_positives + false_negatives > 0:
recall = true_positives / (true_positives + false_negatives)
else:
recall = 0.0
if precision + recall > 0:
f1_score = 2 * (precision * recall) / (precision + recall)
else:
f1_score = 0.0
return {
'precision': round(precision, 4),
'recall': round(recall, 4),
'f1_score': round(f1_score, 4)
}
### END SOLUTION
### Demo function call
from pprint import pprint
demo_y_pred_proba = np.array([0.9, 0.8, 0.7, 0.6, 0.55, 0.45, 0.3, 0.2, 0.1, 0.05])
demo_y_true = np.array([1, 1, 1, 0, 1, 0, 0, 1, 0, 0])
result = calculate_metrics(demo_y_pred_proba, demo_y_true)
pprint(result)
The demo should display this printed output.
{'f1_score': 0.8, 'precision': 0.8, 'recall': 0.8}
The cell below will test your solution for calculate_metrics (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=calculate_metrics,
ex_name='calculate_metrics',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calculate_metrics did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Universities can't provide intensive support to every student. They must prioritize due to limited resources. We need to be able to answer the question, "If we can only intervene with a certain percent of students we are confident are at-risk, how accurate are our top predictions?"
Top-K Precision, which you will implement in the following exercise, answers this question. With K=10%, we rank all students by predicted risk and evaluate only the top 10%. This will allow student support teams to make the most of their limited resources.
This is a DEBUG EXERCISE. We have provided the code and requirements. The task is to identify and fix any issues to ensure the function works as intended.
DEBUG_calculate_topk_precision
Your task: define DEBUG_calculate_topk_precision as follows:
Calculate precision among the top K% of predictions ranked by predicted probability.
Note: This is a debugging exercise!
Inputs:
y_pred_proba (array): Predicted probabilities for the positive class (values between 0.0 and 1.0)y_true (array): True binary labels (0 or 1)k_percent (float): Percentage of top predictions to consider (e.g., 0.10 for top 10%, 0.25 for top 25%)Return: A float representing the top-K precision, rounded to 2 decimal places.
Requirements/steps:
k = ceil(k_percent * total_number_of_samples)y_pred_proba with the highest probabilities firsttop_k_precision = number_of_true_positives_in_top_k / k### Solution - Exercise 10
def DEBUG_calculate_topk_precision(y_pred_proba, y_true, k_percent):
###
### YOUR CODE HERE
###
# y_pred_proba = np.array(y_pred_proba)
# y_true = np.array(y_true)
# k = int(len(y_pred_proba) * k_percent)
# if k < 1:
# k = 1
# sorted_indices = np.argsort(y_pred_proba)
# top_k_indices = sorted_indices[:k]
# top_k_labels = y_true[top_k_indices]
# precision = np.sum(top_k_labels) / k
# return precision
### BEGIN SOLUTION
y_pred_proba = np.array(y_pred_proba)
y_true = np.array(y_true)
k = int(np.ceil(len(y_pred_proba) * k_percent))
if k < 1:
k = 1
sorted_indices = np.argsort(y_pred_proba)[::-1]
top_k_indices = sorted_indices[:k]
top_k_labels = y_true[top_k_indices]
precision = np.sum(top_k_labels) / k
return round(precision, 2)
### END SOLUTION
### Demo function call
demo_y_pred_proba = np.array([0.95, 0.88, 0.82, 0.76, 0.65, 0.58, 0.45, 0.32, 0.21, 0.15,
0.92, 0.84, 0.73, 0.61, 0.54, 0.48, 0.38, 0.28, 0.18, 0.09])
demo_y_true = np.array([1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 0, 0])
demo_k_percent = 0.10
result = DEBUG_calculate_topk_precision(demo_y_pred_proba, demo_y_true, demo_k_percent)
print(f"Top-K Precision (K={demo_k_percent*100}%): {result}")
The demo should display this printed output.
Top-K Precision (K=10.0%): 1.0
The cell below will test your solution for DEBUG_calculate_topk_precision (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 10
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_calculate_topk_precision,
ex_name='DEBUG_calculate_topk_precision',
key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_calculate_topk_precision did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')