`Spring 2026 Midterm 2`: `Predicting At-Risk Students with Learning Analytics Data`¶

Version 1.0.0

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of SQL queries and aggregations, pandas feature engineering, NumPy classification metrics. It has 11 exercises numbered 0 to 10. There are 22 available points. However to earn 100% the threshold is 14 points. (Therefore once you hit 14 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 2 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 3 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 1 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 1 point(s)
Exercise 10 - : 1 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
from pprint import pprint
import sqlite3 
import pandas as pd
import numpy as np

DATABASE_URL = r'./resource/asnlib/publicdata/oulad2.db'
conn = sqlite3.connect(DATABASE_URL)

The Problem¶

This exam explores early identification of at-risk students using the Open University Learning Analytics Dataset (OULAD). Student dropout is a critical challenge in distance education. At the Open University, approximately 78% of students fail to complete their degree. The highest dropout rates occur within the first few weeks of a course, making early intervention essential but difficult.

Traditional predictive models rely on legacy data from previous courses to identify students at risk of failing or withdrawing. However, this approach cannot be used for new courses since they lack historical data. The Ouroboros methodology, introduced by Hlosta et al. (2017), proposes a solution: a "self-learning" model that uses data from the current course rather than legacy data.

Your Overall Task¶

In this exam, you will implement components of the Ouroboros approach for early risk prediction. The methodology centers on a critical insight: the first assessment (A1) serves as a key indicator of student success. Students who submit A1 have dramatically different outcomes than those who don't, and their behaviors can be used to predict which currently-enrolled students are at risk of not submitting.

The exam has three parts:

Explore the OULAD Database Use SQL queries to understand course-level outcomes, assessment submission patterns, and Virtual Learning Environment (VLE) engagement data that reveal early behavioral indicators of student risk.
Engineer Predictive Features Use Python, Pandas, and NumPy to transform raw student interaction data into meaningful features (VLE clicks, days active, registration timing) that capture engagement patterns before assessment deadlines.
Implement Temporal Modeling Implement the Ouroboros "self-learning" concept by creating temporal train/test splits and calculating evaluation metrics that account for severe class imbalance inherent in early prediction windows.

By the end of this exam, you will have implemented key parts of the pipeline that identifies at-risk students without relying on legacy course data.

The Data¶

The Open University Learning Analytics Dataset (OULAD) contains anonymized data from 32,593 students across 7 modules and 22 course presentations spanning 2013-2014.

The database consists of 7 tables:

Core Student & Course Tables:

studentInfo - Student demographics and final course results
courses - Module and presentation
studentRegistration - Enrollment dates and withdrawal timestamps

Assessment Tables:

assessments - Assessment definitions including type, deadlines, and weights
studentAssessment - Individual student scores and submission dates

VLE Interaction Tables:

vle - Learning materials (activity_type, availability weeks) organized by module
studentVle - Daily student interaction logs

Schema Diagram:

Part 1: Exploring Course Outcomes and Student Behavior with SQL¶

The OULAD contains 32,593 students across 22 course presentations, with over 10 million VLE interaction records and detailed assessment submission data.

Pass rates vary significantly across courses. Some courses see less than 50% of students pass while others exceed 60%. Understanding these baseline statistics provides context into why early intervention matters and helps us identify which courses might benefit most from predictive modeling.

Introduction: Exercise 0¶

Before we can analyze how students interact with learning materials, we need to understand what types of materials exist and how they're distributed across courses.

The following exercise explores the distribution of learning materials. This foundational knowledge shapes how we'll aggregate VLE interactions into meaningful features for predict.

Exercise 0: (2 points)¶

vle_activity_summary

Your task: define vle_activity_summary_query as follows:

Write a SQL query that summarizes Virtual Learning Environment (VLE) activity types and their usage patterns across all courses.

Inputs: None

Return: query: a query that produces a table with 3 columns:

activity_type: The type of VLE activity
resource_count: Total number of VLE resources of each activity type across all courses
courses_using: Number of distinct course presentations utilizing each activity type

Requirements/steps:

Query the vle table
Group results by activity_type
For resource_count: count all VLE entries of each activity_type
For courses_using: count distinct course (combination of code_module and code_presentation) of each activity_type
Order the results by resource_count in descending order then courses_using in ascending order
Return activity_type, resource_count, courses_using in that order

### Solution - Exercise 0  
vle_activity_summary_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
vle_activity_summary_query = '''
SELECT activity_type,
       COUNT(*) AS resource_count,
       COUNT(DISTINCT code_module || '_' || code_presentation) AS courses_using
  FROM vle
 GROUP BY activity_type
 ORDER BY resource_count DESC,
          courses_using ASC;
'''
### END SOLUTION

### Demo function call
demo_result_vle_activity_summary = pd.read_sql(vle_activity_summary_query, conn)
demo_result_vle_activity_summary

The demo should display this output.

	activity_type	resource_count	courses_using
0	resource	807	4
1	subpage	122	4
2	oucontent	77	4
3	forumng	56	4
4	url	50	4
5	quiz	19	4
6	oucollaborate	6	3
7	glossary	5	4
8	questionnaire	4	1
9	homepage	4	4
10	sharedsubpage	3	3
11	ouelluminate	1	1

The cell below will test your solution for vle_activity_summary (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 0  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(vle_activity_summary_query),
              ex_name='vle_activity_summary',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to vle_activity_summary did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 1.76 seconds
memory after test: 3.83 MB
memory peak during test: 90.31 MB
Passed! Please submit.

Introduction: Exercise 1¶

The first assessment (A1) is an important indicator. Submission of A1 correlates strongly with eventual course completion. The following exercise examines this relationship by analyzing submission patterns across courses. You will explore differences between students who submit A1 early versus those who submit late. This behavioral difference is fundamental to the self-learning approach: early submitters provide the signal that enables prediction of which current students are at risk.

Exercise 1: (2 points)¶

course_pass_rate

Your task: define course_pass_rate_query as follows:

Write a SQL query that calculates pass rate statistics for each course presentation.

Inputs: None

Return: query: a query that produces a table with 5 columns:

code_module: The course module code
code_presentation: The presentation code
total_students: Total number of students enrolled in the course presentation
passed_students: Number of students who achieved 'Pass' or 'Distinction' as their final result
pass_rate: Percentage of students who passed (rounded to 2 decimal places)

Requirements/steps:

Join the courses table with the studentInfo table using both code_module and code_presentation
Group results by code_module and code_presentation
Calculate total_students by counting all students in each grouping
Calculate passed_students by summing students whose final_result is either 'Pass' or 'Distinction'
Calculate pass_rate as a percentage: (passed_students / total_students) * 100, rounded to 2 decimal places
Filter to include only groupings with at least 500 students
Order results by pass_rate descending, then code_presentation ascending (to break ties deterministically)
Return code_module, code_presentation, total_students, passed_students, pass_rate in that order

### Solution - Exercise 1  
course_pass_rate_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
course_pass_rate_query = '''
SELECT c.code_module,
       c.code_presentation,
       COUNT(s.id_student) AS total_students,
       SUM(CASE WHEN s.final_result IN ('Pass', 'Distinction') THEN 1 ELSE 0 END) AS passed_students,
       ROUND(100.0 * SUM(CASE WHEN s.final_result IN ('Pass', 'Distinction') THEN 1 ELSE 0 END) / COUNT(s.id_student), 2) AS pass_rate
  FROM courses AS c
  JOIN studentInfo AS s
    ON c.code_module = s.code_module
   AND c.code_presentation = s.code_presentation
 GROUP BY c.code_module, c.code_presentation
HAVING COUNT(s.id_student) >= 500
 ORDER BY pass_rate DESC, c.code_presentation ASC;
'''
### END SOLUTION

### Demo function call
demo_result_course_pass_rate = pd.read_sql(course_pass_rate_query, conn)
demo_result_course_pass_rate

The demo should display this output.

	code_module	code_presentation	total_students	passed_students	pass_rate
0	BBB	2014J	2292	1152	50.26
1	BBB	2013J	2237	1072	47.92
2	BBB	2013B	1767	803	45.44
3	BBB	2014B	1613	727	45.07

The cell below will test your solution for course_pass_rate (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(course_pass_rate_query),
              ex_name='course_pass_rate',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to course_pass_rate did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 1.66 seconds
memory after test: 0.53 MB
memory peak during test: 75.99 MB
Passed! Please submit.

Introduction: Exercise 2¶

Not all A1 submitters are equal. Students who submit assessments early versus late show different engagement patterns in the Virtual Learning Environment.

In the following exercise, you'll compare VLE interaction patterns before the median A1 submission date. This reveals behavioral differences that separate engaged students from those who may be struggling. These differences are visible long before the assignment deadline.

Exercise 2: (3 points)¶

DEBUG_early_late_submitters

Your task: define DEBUG_early_late_submitters_query as follows:

Write a SQL query that compares VLE engagement patterns between students who submitted the first assessment early versus late.

Inputs: None

Return: query: a query that produces a table with 5 columns:

submission_category: Either 'Early' or 'Late' based on submission timing
student_count: Total number of unique students in each category
avg_total_clicks: Average total VLE clicks per student (rounded to 2 decimal places)
avg_days_active: Average number of distinct days with VLE activity (rounded to 2 decimal places)
avg_submission_day: Average day when the assessment was submitted (rounded to 2 decimal places)

Requirements/steps:

This is a debugging exercise! You are free to delete the provided code and rewrite.
Bugs exist in the function provided below. Identify and fix them all.
Join studentAssessment with the assessments table on id_assessment
Calculate total VLE engagement metrics for each id_student, code_module, and code_presentation. Merge with the previous studentAssessment and assessments table on id_student, code_module, and code_presentation:
- total_clicks: SUM of sum_click from studentVle table
- days_active: COUNT of DISTINCT date values from studentVle table
- Filter to course code_module = 'BBB' and code_presentation = '2013B'
Categorize students into submission categories using a CASE statement:
- 'Early': students with date_submitted <= 25
- 'Late': students with date_submitted > 25
Filter analysis to id_assessment = 14984 (the first assessment)
Group results by submission_category
For each submission_category, calculate the four aggregate metrics (student_count, avg_total_clicks (rounded to 2 decimal places), avg_days_active (rounded to 2 decimal places), avg_submission_day (rounded to 2 decimal places))
Students with no VLE activity should default to 0 for total_clicks and days_active
Return submission_category, student_count, avg_total_clicks, avg_days_active, avg_submission_day in that order

### Solution - Exercise 2  
# DEBUG_early_late_submitters_query = '''
# SELECT
#     CASE
#         WHEN sa.date_submitted < 25 THEN 'Early'
#         ELSE 'Late'
#     END AS submission_category,
#     COUNT(sa.id_student) AS student_count,
#     AVG(sv.total_clicks) AS avg_total_clicks,
#     AVG(sv.days_active) AS avg_days_active,
#     AVG(sa.date_submitted) AS avg_submission_day
# FROM studentAssessment sa
# INNER JOIN assessments a ON sa.id_assessment = a.id_assessment
# INNER JOIN (
#     SELECT
#         id_student,
#         code_module,
#         code_presentation,
#         SUM(sum_click) as total_clicks,
#         COUNT(date) as days_active
#     FROM studentVle
#     GROUP BY id_student, code_module, code_presentation
# ) sv
#     ON sa.id_student = sv.id_student
#     AND a.code_module = sv.code_module
#     AND a.code_presentation = sv.code_presentation
# WHERE sa.id_assessment = 14984
# GROUP BY submission_category
# ORDER BY submission_category;
# '''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
DEBUG_early_late_submitters_query = '''
SELECT
    CASE
        WHEN sa.date_submitted <= 25 THEN 'Early'
        ELSE 'Late'
    END AS submission_category,
    COUNT(DISTINCT sa.id_student) AS student_count,
    ROUND(AVG(COALESCE(sv.total_clicks, 0)), 2) AS avg_total_clicks,
    ROUND(AVG(COALESCE(sv.days_active, 0)), 2) AS avg_days_active,
    ROUND(AVG(sa.date_submitted), 2) AS avg_submission_day
FROM studentAssessment sa
INNER JOIN assessments a ON sa.id_assessment = a.id_assessment
LEFT JOIN (
    SELECT
        id_student,
        code_module,
        code_presentation,
        SUM(sum_click) as total_clicks,
        COUNT(DISTINCT date) as days_active
    FROM studentVle
    WHERE code_module = 'BBB'
      AND code_presentation = '2013B'
    GROUP BY id_student, code_module, code_presentation
) sv
    ON sa.id_student = sv.id_student
    AND a.code_module = sv.code_module
    AND a.code_presentation = sv.code_presentation
WHERE sa.id_assessment = 14984
GROUP BY submission_category
ORDER BY submission_category;
'''
### END SOLUTION

### Demo function call
demo_result_DEBUG_early_late_submitters = pd.read_sql(DEBUG_early_late_submitters_query, conn)
demo_result_DEBUG_early_late_submitters

The demo should display this output.

	submission_category	student_count	avg_total_clicks	avg_days_active	avg_submission_day
0	Early	1276	1022.25	59.48	17.45
1	Late	76	377.47	23.51	34.68

The cell below will test your solution for DEBUG_early_late_submitters (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(DEBUG_early_late_submitters_query),
              ex_name='DEBUG_early_late_submitters',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_early_late_submitters did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 4.86 seconds
memory after test: 0.42 MB
memory peak during test: 297.21 MB
Passed! Please submit.

Introduction: Exercise 3¶

There are additional early indicators beyond submission patterns that predict at-risk students. VLE engagement metrics total clicks and days active correlate strongly with course outcomes.

The following exercise introduces the importance of behavioral features. Students with low engagement early in the course likely need intervention.

Exercise 3: (2 points)¶

vle_engagement_risk

Your task: define vle_engagement_risk_query as follows:

Write a SQL query that identifies at-risk students based on low VLE engagement in the first 30 days of a course.

Inputs: None

Return: query: a query that produces a table with 5 columns:

id_student: The student identifier
total_clicks: Total sum of VLE clicks across all interactions before day 30
days_active: Count of distinct days with VLE activity before day 30
at_risk_flag: Binary flag (0 or 1) indicating risk status
final_result: The student's final course outcome

Requirements/steps:

Filter studentVle records to code_module = 'BBB', code_presentation = '2013J', and date < 30
Join the filtered studentVle with studentInfo on id_student, code_module, and code_presentation
Group results by id_student and final_result
Calculate VLE engagement metrics for each student:
- total_clicks: SUM of sum_click values
- days_active: COUNT of DISTINCT date values
Create at_risk_flag using a CASE statement:
- Set flag to 1 if total_clicks < 50 OR days_active < 5
- Set flag to 0 otherwise
Include the final_result column from the studentInfo table
Order results by id_student ascending, then final_result ascending
Return id_student, total_clicks, days_active, at_risk_flag, final_result in that order

### Solution - Exercise 3  
vle_engagement_risk_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
vle_engagement_risk_query = '''
SELECT sv.id_student,
       SUM(sv.sum_click) AS total_clicks,
       COUNT(DISTINCT sv.date) AS days_active,
       CASE 
           WHEN SUM(sv.sum_click) < 50 OR COUNT(DISTINCT sv.date) < 5 
           THEN 1 
           ELSE 0 
       END AS at_risk_flag,
       s.final_result
  FROM studentVle AS sv
  JOIN studentInfo AS s
    ON sv.id_student = s.id_student
   AND sv.code_module = s.code_module
   AND sv.code_presentation = s.code_presentation
 WHERE sv.code_module = 'BBB'
   AND sv.code_presentation = '2013J'
   AND sv.date < 30
 GROUP BY sv.id_student, s.final_result
 ORDER BY sv.id_student ASC, s.final_result ASC;
'''
### END SOLUTION

### Demo function call
demo_result_vle_engagement_risk = pd.read_sql(vle_engagement_risk_query, conn)
demo_result_vle_engagement_risk.head(20)

The demo should display this output.

	id_student	total_clicks	days_active	at_risk_flag	final_result
0	23798	152	12	0	Distinction
1	27759	123	12	0	Fail
2	30091	219	8	0	Pass
3	31014	409	26	0	Withdrawn
4	31849	414	17	0	Pass
5	37622	88	4	1	Pass
6	38234	34	5	1	Fail
7	47855	5	3	1	Withdrawn
8	48040	307	26	0	Pass
9	48503	121	8	0	Pass
10	51301	502	36	0	Withdrawn
11	52797	280	15	0	Pass
12	52899	448	21	0	Pass
13	53360	22	3	1	Withdrawn
14	54388	42	1	1	Withdrawn
15	55026	118	13	0	Pass
16	56340	4	1	1	Fail
17	56789	124	8	0	Pass
18	57079	112	12	0	Fail
19	63044	74	6	0	Fail

The cell below will test your solution for vle_engagement_risk (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(vle_engagement_risk_query),
              ex_name='vle_engagement_risk',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to vle_engagement_risk did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 2.71 seconds
memory after test: 0.46 MB
memory peak during test: 179.99 MB
Passed! Please submit.

### Run Me!!!
students_by_course = utils.load_object_from_publicdata('students_by_course')
registrations_by_course = utils.load_object_from_publicdata('registrations_by_course')
vle_by_course = utils.load_object_from_publicdata('vle_by_course')
STUDENT_FEATURES = utils.load_object_from_publicdata('STUDENT_FEATURES')
REGISTRATION_FEATURES = utils.load_object_from_publicdata('REGISTRATION_FEATURES')
RESULT_FEATURES = utils.load_object_from_publicdata('RESULT_FEATURES')

Part 2: Feature Engineering and Temporal Modeling¶

Now we will shift from exploration to implementation. You'll transform raw VLE interaction data into structured features for machine learning and implement the temporal modeling framework that enables self-learning predictions.

Introduction: Exercise 4¶

The analysis requires that we create features that capture engagement patterns before assessment submissions. We need to aggregate VLE interactions, calculate days active, and extract registration timing given a temporal cutoff day. In the following exercise, you will produce one row per student with engineered behavioral features computed only using data available up to the cutoff day. These features are the foundation for subsequent modeling.

Exercise 4: (2 points)¶

create_student_features

Your task: define create_student_features as follows:

Create a comprehensive student feature matrix that combines demographics, registration timing, and VLE engagement metrics.

Inputs:

students_df (DataFrame): Students from studentInfo table with columns: code_module, code_presentation, id_student, age_band, highest_education, disability, num_of_prev_attempts, final_result
registrations_df (DataFrame): Registration records from studentRegistration table with columns: code_module, code_presentation, id_student, date_registration
vle_df (DataFrame): VLE interaction records from studentVle table with columns: code_module, code_presentation, id_student, date, sum_click
cutoff_day (int): Temporal cutoff for VLE activity analysis (only include VLE activity before this day)

Return: A pandas DataFrame with the following structure:

Index: id_student (student identifier)
Columns: All feature columns plus code_module and code_presentation for course identification

Requirements/steps:

Start with students_df using only the columns in STUDENT_FEATURES
Merge with registrations_df (on right) to add registration timing features:
- Use a left join on code_module, code_presentation, and id_student
- Within registrations_df dataframe, keep only the columns in REGISTRATION_FEATURES
Calculate VLE features from vle_df:
- Filter vle_df to only include records where date < cutoff_day
- Group by code_module, code_presentation, and id_student
- Calculate two aggregated metrics:
  - total_clicks: SUM of sum_click
  - total_active_days: COUNT of DISTINCT date values
- Calculate avg_clicks_per_day which equals total_clicks/total_active_days
Merge the VLE features (on right) with the features DataFrame (from Step 2) using a left join on code_module, code_presentation, and id_student
Fill any missing VLE feature values with 0 (for students with no VLE activity)
Ensure dtypes for (total_clicks, total_active_days, avg_clicks_per_day) are (int64, int64, float64)
Create a column days_registered_before_start equal to date_registration
Return only columns from RESULT_FEATURES indexed by id_student

Note: The preloaded lists STUDENT_FEATURES, REGISTRATION_FEATURES, and RESULT_FEATURES are provided to save typing.

### Solution - Exercise 4  
def create_student_features(students_df, registrations_df, vle_df, cutoff_day):
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    
    features = students_df[STUDENT_FEATURES].copy()
    
    features = features.merge(
        registrations_df[REGISTRATION_FEATURES], 
        on=['code_module', 'code_presentation', 'id_student'], 
        how='left'
    )

    vle_filtered = vle_df[vle_df['date'] < cutoff_day].copy() 
    vle_agg = vle_filtered.groupby(['code_module', 'code_presentation', 'id_student']).agg(
        total_clicks=('sum_click', 'sum'),
        total_active_days=('date', 'nunique')
    ).reset_index()
    
    vle_agg['avg_clicks_per_day'] = vle_agg['total_clicks'] / vle_agg['total_active_days']
    
    features = features.merge(vle_agg, on=['code_module', 'code_presentation', 'id_student'], how='left')
    
    features['total_clicks'] = features['total_clicks'].fillna(0).astype(int)
    features['total_active_days'] = features['total_active_days'].fillna(0).astype(int)
    features['avg_clicks_per_day'] = features['avg_clicks_per_day'].fillna(0.0)
    
    features['days_registered_before_start'] = features['date_registration']
    
    result = features[RESULT_FEATURES].set_index('id_student')
    
    return result
    ### END SOLUTION
    
### Demo function call
demo_students = students_by_course[('AAA', '2013J')].head(50)
demo_registrations = registrations_by_course[('AAA', '2013J')]
demo_vle = vle_by_course[('AAA', '2013J')]

demo_cutoff = 30

result = create_student_features(demo_students, demo_registrations, demo_vle, demo_cutoff)

result.head(10)

The demo should display this output.

	code_module	code_presentation	age_band	highest_education	disability	num_of_prev_attempts	days_registered_before_start	total_clicks	total_active_days	avg_clicks_per_day	final_result
id_student
11391	AAA	2013J	55<=	HE Qualification	N	0	-159.0	419	9	46.555556	Pass
28400	AAA	2013J	35-55	HE Qualification	N	0	-53.0	618	19	32.526316	Pass
30268	AAA	2013J	35-55	A Level or Equivalent	Y	0	-92.0	281	12	23.416667	Withdrawn
31604	AAA	2013J	35-55	A Level or Equivalent	N	0	-52.0	503	23	21.869565	Pass
32885	AAA	2013J	0-35	Lower Than A Level	N	0	-176.0	567	24	23.625000	Pass
38053	AAA	2013J	35-55	A Level or Equivalent	N	0	-110.0	727	32	22.718750	Pass
45462	AAA	2013J	0-35	HE Qualification	N	0	-67.0	561	15	37.400000	Pass
45642	AAA	2013J	0-35	A Level or Equivalent	N	0	-29.0	423	20	21.150000	Pass
52130	AAA	2013J	0-35	A Level or Equivalent	N	0	-33.0	366	19	19.263158	Pass
53025	AAA	2013J	55<=	Post Graduate Qualification	N	0	-179.0	571	20	28.550000	Pass

The cell below will test your solution for create_student_features (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(create_student_features),
              ex_name='create_student_features',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to create_student_features did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 7.33 seconds
memory after test: 0.86 MB
memory peak during test: 253.98 MB
Passed! Please submit.

### Run Me!!!
all_students = utils.load_object_from_publicdata('all_students')
all_student_assessments = utils.load_object_from_publicdata('all_student_assessments')

Introduction: Exercise 5¶

In the following exercise, you'll create binary labels indicating whether students submitted A1 (1) or did not submit (0). These labels must be merged with the features from Exercise 4. Together, the features and labels create our machine learning dataset, with each row representing a student's behavioral profile and their eventual submission outcome.

Exercise 5: (1 points)¶

create_submission_labels

Example: we have defined create_submission_labels as follows:

Generate binary submission labels indicating whether students submitted the first assessment within a specific temporal window.

THIS IS A FREE EXERCISE. Submit your exam to obtain your free point

Inputs:

students_df (DataFrame): Students from studentInfo table with columns including: code_module, code_presentation, id_student
student_assessments_df (DataFrame): Assessment submissions from studentAssessment table merged with assessments table, containing columns including: id_student, id_assessment, date_submitted, code_module, code_presentation
first_assessment_id (int): ID of the first assessment to analyze
label_start_day (int): Start of the label window (inclusive)
label_end_day (int): End of the label window (inclusive)

Return: A pandas DataFrame with the following columns:

id_student: Student identifier
code_module: Course module code
code_presentation: Course presentation code
label: Binary label (1 if submitted within window, 0 otherwise)

Requirements/steps:

Filter student_assessments_df to only include records where id_assessment equals first_assessment_id
Create a submission DataFrame from the filtered student assessments (from step 1):
- Filter submissions to those where label_start_day <= date_submitted <= label_end_day
- Only keep the columns id_student, code_module, and code_presentation
Merge the students_df (on left) with submission DataFrame (from step 2):
- Only keep the columns id_student, code_module, and code_presentation from students_df
- Use a left join on id_student, code_module, and code_presentation
- This ensures all students from students_df are included, even those who never submitted
Create label column which corresponds to 1 for students who submitted within the window, 0 otherwise
Select and return exactly four columns: id_student, code_module, code_presentation and label all indexed by id_student

### Solution - Exercise 5  
def create_submission_labels(students_df, student_assessments_df, first_assessment_id, label_start_day, label_end_day):
    filtered_assessments = student_assessments_df[
        student_assessments_df['id_assessment'] == first_assessment_id
    ]
    
    submitted_in_window = filtered_assessments[
        (filtered_assessments['date_submitted'] >= label_start_day) &
        (filtered_assessments['date_submitted'] <= label_end_day)
    ][['id_student', 'code_module', 'code_presentation']].drop_duplicates()
    
    result = students_df[['id_student', 'code_module', 'code_presentation']].copy()
    
    result = result.merge(
        submitted_in_window,
        on=['id_student', 'code_module', 'code_presentation'],
        how='left',
        indicator=True
    )
    
    result['label'] = (result['_merge'] == 'both').astype(int)
    
    result = result.drop(columns=['_merge'])
    
    result = result.set_index('id_student')
    
    return result

### Demo function call
demo_students = all_students[
    (all_students['code_module'] == 'AAA') &
    (all_students['code_presentation'] == '2013J')
].head(50)

demo_student_assessments = all_student_assessments[
    (all_student_assessments['code_module'] == 'AAA') &
    (all_student_assessments['code_presentation'] == '2013J')
]

course_assessments = demo_student_assessments.sort_values('date_submitted')
first_assessment_id = course_assessments.iloc[0]['id_assessment'] if len(course_assessments) > 0 else None

result = create_submission_labels(
    demo_students,
    demo_student_assessments,
    first_assessment_id,
    label_start_day=0,
    label_end_day=30
)

print(result['label'].iloc[:10])

id_student
11391    1
28400    1
30268    0
31604    1
32885    1
38053    1
45462    1
45642    1
52130    1
53025    1
Name: label, dtype: int64

The demo should display this printed output.

id_student
11391    1
28400    1
30268    0
31604    1
32885    1
38053    1
45462    1
45642    1
52130    1
53025    1
Name: label, dtype: int64

The test cell below will always pass. Please submit to collect your free points for create_submission_labels (exercise 5).

### Test Cell - Exercise 5  


print('Passed! Please submit.')

Passed! Please submit.

Introduction: Exercise 6¶

A difficulty with the Ouroboros approach is class imbalance. 75% or more of students submit A1. This imbalance can lead to misleading evaluation metrics. For example, if 90% of students submit A1, a model that always predicts "submit" would achieve 90% accuracy but would be useless for identifying at-risk students.

This exercise quantifies the imbalance ratio and calculates how many samples we have in each class. Understanding this imbalance is essential because it affects which evaluation metrics we should use.

Exercise 6: (2 points)¶

analyze_class_imbalance

Your task: define analyze_class_imbalance as follows:

Analyze class imbalance in binary classification labels. In early prediction scenarios, positive examples (students who submit) are typically much fewer than negative examples (students who don't submit), creating severe class imbalance. Understanding this imbalance is crucial for model evaluation and metric selection.

Inputs:

labels (Series): Binary labels (0 or 1) indexed by id_student

Return: A Python dictionary containing the following keys and values:

'total_samples': Total number of labels (integer)
'positive_count': Count of labels with value 1 (integer)
'negative_count': Count of labels with value 0 (integer)
'positive_ratio': Ratio of positive samples to total samples (float, rounded to 4 decimal places)
'imbalance_ratio': Ratio of majority class to minority class (float, rounded to 4 decimal places)
'majority_class': The more frequent class label (integer, either 0 or 1)

Requirements/steps:

Count the total number of samples and name it total_samples
Count the number of positive labels (value = 1) and name it positive_count
Count the number of negative labels (value = 0) and name it negative_count
Calculate the positive ratio: positive_count / total_samples, rounded to 4 decimal places and name it positive_ratio
Calculate the imbalance ratio: max(positive_count, negative_count) / min(positive_count, negative_count), rounded to 4 decimal places and name it imbalance_ratio
- If either positive_count or negative_count is 0, then set imbalance_ratio to inf
Determine the majority class: the label (0 or 1) that appears more frequently and name it majority_class
- If classes are equal, set majority_class to 1
Return all six metrics in a dictionary with the exact keys and values specified above

### Solution - Exercise 6  
def analyze_class_imbalance(labels):
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    total_samples = len(labels)
    positive_count = int(labels.sum())
    negative_count = total_samples - positive_count
    
    positive_ratio = round(positive_count / total_samples, 4)
    
    if negative_count == 0:
        imbalance_ratio = float('inf')
    elif positive_count == 0:
        imbalance_ratio = float('inf')
    else:
        imbalance_ratio = round(max(positive_count, negative_count) / min(positive_count, negative_count), 4)
    
    majority_class = 1 if positive_count >= negative_count else 0
    
    return {
        'total_samples': total_samples,
        'positive_count': positive_count,
        'negative_count': negative_count,
        'positive_ratio': positive_ratio,
        'imbalance_ratio': imbalance_ratio,
        'majority_class': majority_class
    }
    ### END SOLUTION
    
### Demo function call
demo_labels = pd.Series(
    [1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1] * 2,
    index=range(11391, 11391 + 40)
)

result = analyze_class_imbalance(demo_labels)

pprint(result)

{'imbalance_ratio': 4.0,
 'majority_class': 1,
 'negative_count': 8,
 'positive_count': 32,
 'positive_ratio': 0.8,
 'total_samples': 40}

The demo should display this printed output.

{'imbalance_ratio': 4.0,
 'majority_class': 1,
 'negative_count': 8,
 'positive_count': 32,
 'positive_ratio': 0.8,
 'total_samples': 40}

The cell below will test your solution for analyze_class_imbalance (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=analyze_class_imbalance,
              ex_name='analyze_class_imbalance',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to analyze_class_imbalance did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.13 seconds
memory after test: 0.06 MB
memory peak during test: 1.51 MB
Passed! Please submit.

### Run Me!!!
all_student_features = utils.load_object_from_publicdata('all_student_features')
all_submission_labels = utils.load_object_from_publicdata('all_submission_labels')

Introduction: Exercise 7¶

Not all behavioral metrics are equally informative. In the following exercise, you will compute correlations between VLE engagement features and submission labels to show which patterns have the strongest predictive power.

By identifying which features correlate most with submission outcomes, we can focus on the most relevant behavioral signals. This step is crucial for building an effective predictive model.

Exercise 7: (3 points)¶

compute_feature_correlations

Your task: define compute_feature_correlations as follows:

Compute Pearson correlation coefficients between numerical features and binary outcome labels. This analysis identifies which features have the strongest linear relationships with the outcome, guiding feature selection for predictive modeling.

Inputs:

features_df (DataFrame): Student features with id_student as the index
outcome_series (Series): Binary outcomes (0 or 1) with id_student as the index
numerical_features (list): List of column names for numerical features to analyze

Return: A pandas DataFrame with three columns:

feature_name: Name of the feature (string)
correlation: Pearson correlation coefficient with the outcome (float, rounded to 4 decimal places)
abs_correlation: Absolute value of the correlation (float, rounded to 4 decimal places)

The DataFrame should be sorted by abs_correlation with the strongest correlations first.

Requirements/steps:

Align the features_df and outcome_series using their shared id_student index
For each feature in numerical_features:
- Extract the feature values as a Series
- Exclude NaN values
- If there are 2+ values for that feature and the Standard Deviation of those values is positive:
  - Calculate the Pearson correlation coefficient of the feature with outcome_series
  - Calculate the absolute value of the correlation
  - Round both the correlation and absolute value of the correlation to 4 decimal places
Create a DataFrame with feature_name, correlation, and abs_correlation as specified above
Sort the DataFrame by abs_correlation in descending order
Return the sorted DataFrame

Hint: Use pandas .corr() method to calculate Pearson correlations.

### Solution - Exercise 7  
def compute_feature_correlations(features_df, outcome_series, numerical_features):
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    indexes=list(set(outcome_series.index) & set(features_df.index))
    aligned_features = features_df.loc[indexes]
    aligned_outcome = outcome_series.loc[indexes]
    
    correlations = []
    
    for feature in numerical_features:
        if feature not in aligned_features.columns:
            continue
        
        feature_data = aligned_features[feature]
        
        valid_mask = feature_data.notna() & aligned_outcome.notna()
        feature_clean = feature_data[valid_mask]
        outcome_clean = aligned_outcome[valid_mask]
        
        if len(feature_clean) > 1 and feature_clean.std() > 0:
            corr = feature_clean.corr(outcome_clean)
            correlations.append({
                'feature_name': feature,
                'correlation': round(corr, 4),
                'abs_correlation': round(abs(corr), 4)
            })
    
    result_df = pd.DataFrame(correlations)
    result_df = result_df.sort_values('abs_correlation', ascending=False).reset_index(drop=True)
    
    return result_df
    ### END SOLUTION
    
### Demo function call
demo_features = pd.DataFrame({
    'num_of_prev_attempts': [0, 1, 0, 2, 0, 1, 3, 0, 1, 0],
    'days_registered_before_start': [-50, -30, -20, -60, -10, -45, -25, -15, -35, -5],
    'total_clicks': [120, 80, 150, 45, 200, 90, 60, 180, 70, 160],
    'total_active_days': [15, 10, 18, 5, 20, 12, 8, 19, 9, 17],
    'avg_clicks_per_day': [8.0, 8.0, 8.3, 9.0, 10.0, 7.5, 7.5, 9.5, 7.8, 9.4]
}, index=pd.Index(range(11391, 11401), name='id_student'))

demo_labels = pd.Series(
    [1, 0, 1, 0, 1, 1, 0, 1, 0, 1],
    index=pd.Index(range(11396, 11406), name='id_student')
)

demo_numerical_features = [
    'num_of_prev_attempts',
    'days_registered_before_start',
    'total_clicks',
    'total_active_days',
    'avg_clicks_per_day'
]

demo_result_compute_feature_correlations = compute_feature_correlations(
    demo_features, 
    demo_labels, 
    demo_numerical_features
)

demo_result_compute_feature_correlations

The demo should display this output.

	feature_name	correlation	abs_correlation
0	total_active_days	0.8474	0.8474
1	total_clicks	0.7873	0.7873
2	num_of_prev_attempts	-0.7454	0.7454
3	avg_clicks_per_day	0.6168	0.6168
4	days_registered_before_start	0.2887	0.2887

The cell below will test your solution for compute_feature_correlations (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(compute_feature_correlations),
              ex_name='compute_feature_correlations',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to compute_feature_correlations did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 1.03 seconds
memory after test: 0.11 MB
memory peak during test: 3.41 MB
Passed! Please submit.

### Run Me!!!
all_students = utils.load_object_from_publicdata('all_students')
all_student_assessments = utils.load_object_from_publicdata('all_student_assessments')

Introduction: Exercise 8¶

In the following exercise, you will implement the core Ouroboros temporal windowing concept. Given a current_day and the cutoff_day for A1, we must split students into training and testing sets based on their submission timing.

Training set: Students who already submitted A1 by current_day. Their features (computed from behavior before they submitted) and their submission timing (early vs. late) provide labeled examples for learning.
Testing set: Students who haven't submitted yet. These are the at-risk candidates we want to identify. We compute their current behavioral features and predict whether they will eventually submit.

This is "self-learning": recent submitters train the model to recognize patterns that predict outcomes for students still enrolled. As each day passes, more students move from the test set to the training set, continuously improving our predictions.

This is a DEBUG EXERCISE. We have provided the code and requirements. The task is to identify and fix any issues to ensure the function works as intended.

Example: demo case (current_day=15, cutoff_day=25)

Day:   0────────[current_day=15]───────────[cutoff_day=25]──>

Student  Submission day                                         Label
──────── ──────────────────────────────────────────────────    ──────────
S1       ●(5)          |                          |            train = 1
S2         ●──(10)     |                          |            train = 1
S3           ●──(12)   |                          |            train = 1
S4                     |                          |            test  = 0  (never)
S5                     |                          |            test  = 0  (never)
S6                     |     ●─(18)               |            test  = 1
S7                     |              ●──────(25) |            test  = 1
S8                     |                          |  ●─(30)    test  = 0  (past cutoff)
S9                     |                          |            test  = 0  (never)
S10                    |                          |            test  = 0  (never)
       ◄──── TRAIN ────►◄────── TEST WINDOW ──────►

● marks the day a student submitted A1; · means no submission yet at current_day
Students who submitted on or before current_day → TRAIN (train_labels)
All other students → TEST (test_labels); label = 1 only if submission falls within the test window (current_day, cutoff_day]

Exercise 8: (3 points)¶

DEBUG_create_temporal_split

Your task: define DEBUG_create_temporal_split as follows:

Implement the temporal windowing methodology to create train/test splits for self-learning prediction.

Note: This is a debugging exercise!

Inputs:

students_df (DataFrame): Student records with columns: id_student, code_module, code_presentation
student_assessments_df (DataFrame): Assessment submissions with columns: id_student, id_assessment, date_submitted, code_module, code_presentation
first_assessment_id (int): ID of the first assessment to predict
current_day (int): Current day for making predictions
cutoff_day (int): Deadline for assessment submission

Return: A tuple containing two pandas Series: (train_labels, test_labels)

train_labels: Series indexed by id_student for all course students
Value = 1 if student submitted between day 0 and current_day (inclusive)
Value = 0 if student did not submit by current_day
test_labels: Series indexed by id_student
Only includes students who had NOT submitted by current_day (i.e., students with train_labels == 0)
Value = 1 if student submitted after current_day and by cutoff_day
Value = 0 if student never submitted or submitted after cutoff_day

Requirements/steps:

This is a debugging exercise! You are free to delete the provided code and rewrite.
Bugs exist in the function provided below. Identify and fix them all.
Extract code_module and code_presentation from students_df first row. Filter both students_df and student_assessments_df to keep only rows where code_module and code_presentation match these extracted values.
Filter student_assessments_df to only include records where id_assessment equals first_assessment_id
Create training labels for all course students:
- From course_assessments (step 4 result), identify students who submitted by current_day (inclusive): date_submitted <= current_day
- Using all students from course_students (step 3 result), assign label = 1 if they appear in the submitted set, 0 otherwise
- Index the Series by id_student and set index.name = 'id_student'
Create test labels for students who have not yet submitted:
- Start from the subset of course_students whose id_student is NOT in the submitted-by-current set (i.e., students with train_labels == 0); this subset defines both the test population and the test Series index
- Among these students, identify those who submitted after current_day and by cutoff_day (inclusive): current_day < date_submitted <= cutoff_day
- Assign label = 1 for students who submitted in this window, 0 for students who never submitted or submitted after cutoff_day
- Index the Series by id_student and set index.name = 'id_student'
Ensure both Series contain only integer values (0 or 1)
The train and test sets are mutually exclusive by construction: test_labels contains only students with train_labels == 0
Return the tuple (train_labels, test_labels)

### Solution - Exercise 8  
def DEBUG_create_temporal_split(students_df, student_assessments_df,
                                 first_assessment_id, current_day, cutoff_day):
    ###
    ### YOUR CODE HERE
    ###
#     target_module = students_df['code_module'].iloc[0]
#     target_presentation = students_df['code_presentation'].iloc[0]

#     course_students = students_df.copy()

#     course_assessments = student_assessments_df[
#         (student_assessments_df['code_module'] == target_module) &
#         (student_assessments_df['code_presentation'] == target_presentation) &
#         (student_assessments_df['id_assessment'] == first_assessment_id)
#     ].copy()

#     submitted_by_current = course_assessments[
#         course_assessments['date_submitted'] < current_day
#     ]['id_student'].unique()

#     submitted_after_current = course_assessments[
#         course_assessments['date_submitted'] > current_day
#     ]['id_student'].unique()

#     train_labels = pd.Series(
#         course_students['id_student'].isin(submitted_by_current).astype(int).values,
#         index=course_students['id_student']
#     )

#     test_labels = pd.Series(
#         course_students['id_student'].isin(submitted_after_current).astype(int).values,
#         index=course_students['id_student']
#     )
#     test_labels.index.name = 'id_student'

#     return train_labels, test_labels

    ### BEGIN SOLUTION
    target_module = students_df['code_module'].iloc[0]
    target_presentation = students_df['code_presentation'].iloc[0]

    course_students = students_df[
        (students_df['code_module'] == target_module) &
        (students_df['code_presentation'] == target_presentation)
    ].copy()

    course_assessments = student_assessments_df[
        (student_assessments_df['code_module'] == target_module) &
        (student_assessments_df['code_presentation'] == target_presentation) &
        (student_assessments_df['id_assessment'] == first_assessment_id)
    ].copy()

    submitted_by_current = course_assessments[
        course_assessments['date_submitted'] <= current_day
    ]['id_student'].unique()

    submitted_after_current = course_assessments[
        (course_assessments['date_submitted'] > current_day) &
        (course_assessments['date_submitted'] <= cutoff_day)
    ]['id_student'].unique()

    train_labels = pd.Series(
        course_students['id_student'].isin(submitted_by_current).astype(int).values,
        index=course_students['id_student']
    )
    train_labels.index.name = 'id_student'

    students_not_submitted_by_current = course_students[
        ~course_students['id_student'].isin(submitted_by_current)
    ]

    test_labels = pd.Series(
        students_not_submitted_by_current['id_student'].isin(submitted_after_current).astype(int).values,
        index=students_not_submitted_by_current['id_student']
    )
    test_labels.index.name = 'id_student'

    return train_labels, test_labels
    ### END SOLUTION

### Demo function call
demo_students = pd.DataFrame({
    'id_student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'code_module': ['AAA'] * 10,
    'code_presentation': ['2013J'] * 10
})

demo_student_assessments = pd.DataFrame({
    'id_student': [1, 2, 3, 6, 7, 8],
    'id_assessment': [1001] * 6,
    'date_submitted': [5, 10, 12, 18, 25, 30],
    'code_module': ['AAA'] * 6,
    'code_presentation': ['2013J'] * 6
})

first_assessment_id = 1001
current_day = 15
cutoff_day = 25

train_labels, test_labels = DEBUG_create_temporal_split(
    demo_students, demo_student_assessments,
    first_assessment_id, current_day, cutoff_day
)

print('Training set:')
pprint(train_labels)
print('\nTesting set:')
pprint(test_labels)

Training set:
id_student
1     1
2     1
3     1
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64

Testing set:
id_student
4     0
5     0
6     1
7     1
8     0
9     0
10    0
dtype: int64

The demo should display this printed output.

Training set:
id_student
1     1
2     1
3     1
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64

Testing set:
id_student
4     0
5     0
6     1
7     1
8     0
9     0
10    0
dtype: int64

The cell below will test your solution for DEBUG_create_temporal_split (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_create_temporal_split,
              ex_name='DEBUG_create_temporal_split',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=40)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_create_temporal_split did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 5.49 seconds
memory after test: 0.40 MB
memory peak during test: 163.16 MB
Passed! Please submit.

Introduction: Exercise 9¶

Now we need to evaluate predictions on this imbalanced dataset, however, standard accuracy is misleading when 90% of students submit A1. A naive classifier that always predicts "will submit" achieves 90% accuracy but can not help to identify at-risk students.

Instead, you will calculate precision, recall, and F1-score using a 0.5 probability threshold. These metrics balance the tradeoff between identifying at-risk students and avoiding false alarms. The F1-score combines precision and recall, providing a single metric that reflects model quality on imbalanced classification tasks.

Use the following formulas to compute these metrics:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-score = 2 (Precision Recall) / (Precision + Recall)

Exercise 9: (1 points)¶

calculate_metrics

Your task: define calculate_metrics as follows:

Calculate standard binary classification metrics (precision, recall, and F1-score) for predicted probabilities.

Inputs:

y_pred_proba (array): Predicted probabilities for the positive class (values between 0.0 and 1.0)
y_true (array): True binary labels (0 or 1)

Return: A Python dictionary containing three metrics:

'precision': Proportion of positive predictions that are correct (float, rounded to 4 decimal places)
'recall': Proportion of actual positives correctly identified (float, rounded to 4 decimal places)
'f1_score': Harmonic mean of precision and recall (float, rounded to 4 decimal places)

Requirements/steps:

Convert predicted probabilities to binary predictions using a threshold of 0.5:
- Prediction = 1 if y_pred_proba >= 0.5
- Prediction = 0 if y_pred_proba < 0.5
Using the following logic:
- True Positives: prediction == 1 and actual == 1
- False Positives: prediction == 1 and actual == 0
- True Negatives: prediction == 0 and actual == 0
- False Negatives: prediction == 0 and actual == 1
Calculate precision:
- precision = True Positives / (True Positives + False Positives)
- If there are no positive predictions, precision should be 0.0
Calculate recall:
- recall = True Positives / (True Positives + False Negatives)
- If there are no actual positives, recall should be 0.0
Calculate F1-score:
- f1_score = 2 * (precision * recall) / (precision + recall)
- If both precision and recall are 0, F1-score should be 0.0
Round all three metrics to 4 decimal places
Return the metrics in a dictionary with the exact keys specified above

Hint: The metric should be 0.0 when division by zero occurs.

### Solution - Exercise 9  
def calculate_metrics(y_pred_proba, y_true):
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    y_pred_proba = np.array(y_pred_proba)
    y_true = np.array(y_true)
    
    y_pred = (y_pred_proba >= 0.5).astype(int)
    
    true_positives = np.sum((y_pred == 1) & (y_true == 1))
    false_positives = np.sum((y_pred == 1) & (y_true == 0))
    true_negatives = np.sum((y_pred == 0) & (y_true == 0))
    false_negatives = np.sum((y_pred == 0) & (y_true == 1))
    
    if true_positives + false_positives > 0:
        precision = true_positives / (true_positives + false_positives)
    else:
        precision = 0.0
    
    if true_positives + false_negatives > 0:
        recall = true_positives / (true_positives + false_negatives)
    else:
        recall = 0.0
    
    if precision + recall > 0:
        f1_score = 2 * (precision * recall) / (precision + recall)
    else:
        f1_score = 0.0
    
    return {
        'precision': round(precision, 4),
        'recall': round(recall, 4),
        'f1_score': round(f1_score, 4)
    }
    ### END SOLUTION
    
### Demo function call
from pprint import pprint

demo_y_pred_proba = np.array([0.9, 0.8, 0.7, 0.6, 0.55, 0.45, 0.3, 0.2, 0.1, 0.05])
demo_y_true = np.array([1, 1, 1, 0, 1, 0, 0, 1, 0, 0])

result = calculate_metrics(demo_y_pred_proba, demo_y_true)

pprint(result)

{'f1_score': 0.8, 'precision': 0.8, 'recall': 0.8}

The demo should display this printed output.

{'f1_score': 0.8, 'precision': 0.8, 'recall': 0.8}

The cell below will test your solution for calculate_metrics (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 9  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=calculate_metrics,
              ex_name='calculate_metrics',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to calculate_metrics did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.09 seconds
memory after test: 0.04 MB
memory peak during test: 1.45 MB
Passed! Please submit.

Introduction: Exercise 10¶

Universities can't provide intensive support to every student. They must prioritize due to limited resources. We need to be able to answer the question, "If we can only intervene with a certain percent of students we are confident are at-risk, how accurate are our top predictions?"

Top-K Precision, which you will implement in the following exercise, answers this question. With K=10%, we rank all students by predicted risk and evaluate only the top 10%. This will allow student support teams to make the most of their limited resources.

This is a DEBUG EXERCISE. We have provided the code and requirements. The task is to identify and fix any issues to ensure the function works as intended.

Exercise 10: (1 points)¶

DEBUG_calculate_topk_precision

Your task: define DEBUG_calculate_topk_precision as follows:

Calculate precision among the top K% of predictions ranked by predicted probability.

Note: This is a debugging exercise!

Inputs:

y_pred_proba (array): Predicted probabilities for the positive class (values between 0.0 and 1.0)
y_true (array): True binary labels (0 or 1)
k_percent (float): Percentage of top predictions to consider (e.g., 0.10 for top 10%, 0.25 for top 25%)

Return: A float representing the top-K precision, rounded to 2 decimal places.

Requirements/steps:

This is a debugging exercise! You are free to delete the provided code and rewrite.
Bugs exist in the function provided below. Identify and fix them all.
Calculate the number of samples to include: k = ceil(k_percent * total_number_of_samples)
Sort all predictions by y_pred_proba with the highest probabilities first
Select the top K predictions and their corresponding true labels
Calculate precision among these top K predictions:
- top_k_precision = number_of_true_positives_in_top_k / k
Round the result to 2 decimal places
Return the rounded top-K precision as a float

### Solution - Exercise 10  
def DEBUG_calculate_topk_precision(y_pred_proba, y_true, k_percent):
    ###
    ### YOUR CODE HERE
    ###
#     y_pred_proba = np.array(y_pred_proba)
#     y_true = np.array(y_true)

#     k = int(len(y_pred_proba) * k_percent)

#     if k < 1:
#         k = 1

#     sorted_indices = np.argsort(y_pred_proba)

#     top_k_indices = sorted_indices[:k]

#     top_k_labels = y_true[top_k_indices]

#     precision = np.sum(top_k_labels) / k

#     return precision

    ### BEGIN SOLUTION
    y_pred_proba = np.array(y_pred_proba)
    y_true = np.array(y_true)

    k = int(np.ceil(len(y_pred_proba) * k_percent))

    if k < 1:
        k = 1

    sorted_indices = np.argsort(y_pred_proba)[::-1]

    top_k_indices = sorted_indices[:k]

    top_k_labels = y_true[top_k_indices]

    precision = np.sum(top_k_labels) / k

    return round(precision, 2)
    ### END SOLUTION
    
### Demo function call
demo_y_pred_proba = np.array([0.95, 0.88, 0.82, 0.76, 0.65, 0.58, 0.45, 0.32, 0.21, 0.15,
                               0.92, 0.84, 0.73, 0.61, 0.54, 0.48, 0.38, 0.28, 0.18, 0.09])
demo_y_true = np.array([1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
                        1, 1, 1, 0, 0, 1, 0, 0, 0, 0])
demo_k_percent = 0.10

result = DEBUG_calculate_topk_precision(demo_y_pred_proba, demo_y_true, demo_k_percent)

print(f"Top-K Precision (K={demo_k_percent*100}%): {result}")

Top-K Precision (K=10.0%): 1.0

The demo should display this printed output.

Top-K Precision (K=10.0%): 1.0

The cell below will test your solution for DEBUG_calculate_topk_precision (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 10  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_calculate_topk_precision,
              ex_name='DEBUG_calculate_topk_precision',
              key=b'lHBM3PNfS53dAewPCes_bZJ3D8A_5sWq911kmfTrzK0=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_calculate_topk_precision did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.08 seconds
memory after test: 0.04 MB
memory peak during test: 1.63 MB
Passed! Please submit.

Spring 2026 Midterm 2: Predicting At-Risk Students with Learning Analytics Data¶

The Problem¶

Your Overall Task¶

The Data¶

Part 1: Exploring Course Outcomes and Student Behavior with SQL¶

Introduction: Exercise 0¶

Exercise 0: (2 points)¶

Introduction: Exercise 1¶

Exercise 1: (2 points)¶

Introduction: Exercise 2¶

Exercise 2: (3 points)¶

Introduction: Exercise 3¶

Exercise 3: (2 points)¶

Part 2: Feature Engineering and Temporal Modeling¶

Introduction: Exercise 4¶

Exercise 4: (2 points)¶

Introduction: Exercise 5¶

Exercise 5: (1 points)¶

Introduction: Exercise 6¶

Exercise 6: (2 points)¶

Introduction: Exercise 7¶

Exercise 7: (3 points)¶

Introduction: Exercise 8¶

Exercise 8: (3 points)¶

Introduction: Exercise 9¶

Exercise 9: (1 points)¶

Introduction: Exercise 10¶

Exercise 10: (1 points)¶

`Spring 2026 Midterm 2`: `Predicting At-Risk Students with Learning Analytics Data`¶