Final Exam, Fall 2023: Exam Analysis¶

Version 1.0

All of the header information is important. Please read it..

Topics, number of exercises: This problem builds on your knowledge of pandas, Numpy, and linear regression. It has 12 exercises exercises, numbered 0 to 11. There are 22 available points. However, to earn 100% the threshold is 15 points. (Therefore, once you hit 15 points, you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs ### load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Ex. 0: 1 (FREE) point(s)
  • Ex. 1: 2 point(s)
  • Ex. 2: 1 point(s)
  • Ex. 3: 1 (FREE) point(s)
  • Ex. 4: 2 point(s)
  • Ex. 5: 1 point(s)
  • Ex. 6: 1 (FREE) point(s)
  • Ex. 7: 3 point(s)
  • Ex. 8: 3 point(s)
  • Ex 9: 2 point(s)
  • Ex 10: 2 point(s)
  • Ex. 11: 3 point(s)

Exercises 0, 3, and 6 involve reading only (no coding), so they are "free" points. However, you still need to run the associated test cells and submit to get those points.

Final reminders:

  • We recommend submitting after every exercise.
  • Please review the generated grade report after you submit to see what errors were returned.
  • Stay calm, skip problems as needed, and take short breaks at your leisure.

Overview: Exam Examination¶

An instructor for another course has asked you to help analyze some of their exam data. The exams were completed electronically, in a format similar to our class's, but with some twists:

  • The class has multiple exams, named 'exam1', 'exam2', and so on. Exams were timed, but with different time limits for each exam.
  • Each exam consisted of multiple "milestones" (or "problems"). There were different numbers of problems for each exam, and different problems can be worth different numbers of points. (Thus, the total points on different exams could also differ.)
  • Students were allowed to work on the problems in any order and submit in any order, similar to our exams.

The dataset itself has two main parts:

  • The grades data, which shows, for each student, exam, and problem, how many points they scored.
  • The timing data, which shows, for each student and problem, when (i.e., at what date and time) the student started the problem and made their last submission thereof.

Each exam has its own grading and timing data, as you'll see.

Your overall task: You will help clean up and analyze this data, culminating in an assessment, based on linear regression, to see whether or not the last exam score can be predicted from the scores and timing information of earlier exams.

Setup¶

To get started, run the following code cells. Take note of preloaded modules, as it's likely you'll want to use them.

In [ ]:
### Global Imports
### BEGIN HIDDEN TESTS
%load_ext autoreload
%autoreload 2

if False: # set to True to set up
    REGENERATE_OUTPUTS = False

    import dill
    import hashlib
    def hash_check(f1, f2, verbose=True):
        with open(f1, 'rb') as f:
            h1 = hashlib.md5(f.read()).hexdigest()
        with open(f2, 'rb') as f:
            h2 = hashlib.md5(f.read()).hexdigest()
        if verbose:
            print(h1)
            print(h2)
        assert h1 == h2, f'The file "{f1}" has been modified'
    with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
        dill.dump(hash_check, f)
    del hash_check
    with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
        hash_check = dill.load(f)
    for fname in ['testers.py', '__init__.py', 'test_utils.py']:
        hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
    for fname in ['__init__.py', 'utils.py']:
        hash_check(f'cse6040/{fname}', f'resource/asnlib/public/cse6040/{fname}')
    del hash_check
### END HIDDEN TESTS

# Modules you may find useful:
import pandas as pd
import numpy as np

# Some extra functions that this notebook needs:
import cse6040.utils
from pprint import pprint # Pretty-printer

# Load the dataset:
print("\n==> Loading the dataset...")
from cse6040.examdata import load_dataset
grades_by_exid, times_by_exid = load_dataset()
print("\n==> Done executing the setup and ready for take-off!")

Ex. 0 (1 pt; FREE): Understanding the grades dataframe¶

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

The global variable grades_by_exam is a dictionary of pandas dataframes. Each key is an exam ID (e.g., 'exam2'), and each value is a dataframe holding the grading data for that exam. Let's take a look at an example, which are the grades for the exam with exam id 'exam1'.

In [ ]:
list(grades_by_exid.keys())

Evidently, there are three exams in this dataset. However, the code you write later should not assume three exams unless otherwise specified.

In [ ]:
grades_by_exid['exam1']

Note: The data has been anonymized with fake email addresses. Any resemblance to actual email addresses is purely coincidental.

Observations:

  • Recall that we are looking at just one exam, 'exam1', in the example above. There are similar dataframes for 'exam2' and 'exam3'.
  • There is one row per student, each identified by their unique email address. (Note: These are anonymized with fake email addresses. Any resemblance to real addresses is purely coincidental.)
  • Each column corresponds to an exam problem or "milestone." The value is the number of points the student got from an autograder.
  • If the student did not submit a solution for a problem, the score appears as a string with exactly eight hyphens: '--------'. For example, see row 1, where juan.quay@aol.com did not submit problem4.
  • Some students (probably) dropped the class or did not attempt the exam at all. In those cases, all problem scores for that student are missing. For instance, see rows 1544-1548 in the output.

Ex. 0 test (FREEBIE)¶

In [ ]:
### test_cell_ex0 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

Ex. 1 (2 pt): remove_invalid_grades¶

Given a grades dataframe, gdf, complete the function,

def remove_invalid_grades(gdf):
    ...

so that it returns clean copy of gdf.

Input: A grades dataframe, gdf. It will have a column named 'Email' and one or more columns named 'problem#' where # is an integer.

Your task: Copy this dataframe and clean it as follows:

  • Any problem that was not submitted, which will equal a string of 8 hyphens, '--------', should be replaced with a not-a-number (NaN) value, i.e., a Numpy np.nan.
  • Any row where none of the problems were submitted should be removed completely.
  • All problem columns should be converted to have a .dtype of float.

Output: Return a clean tibble copy, without modifying the input gdf. It should contain exactly the same columns as the input, even if the result is an empty dataframe, and the .dtype should be as specified above.

Caveat(s)/comment(s)/hint(s):

  1. Since we are asking for a tibble output, you should reset the index.
  2. The dataframe .replace() and .filter() methods may be helpful (but are not required).
  3. You may assume there is at least 1 problem column. However, do not assume the input has a fixed number of problems. That is, the input might have 'problem0' through 'problem4' as shown in Exercise 0, but it might also only have, say, two columns named 'problem1' and 'problem3', or it might have four problem columns named 'problem0', 'problem1', 'problem25', and 'problem234'. Your code should robustly handle these cases. The dataframe's .dropna() method may be helpful.

Ex. 1 demo input¶

In [ ]:
### Define demo inputs ###
demo_gdf_ex1 = grades_by_exid['exam1'].loc[[0, 1, 2, 683, 3, 957, 4, 1546]].reset_index(drop=True)
demo_gdf_ex1

Consider the demo input, demo_gdf_ex1, shown above. A correct output is:

Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7 10 10 10 10
1 juan.quay@aol.com 10 10 6.5 0 nan
2 gino.branston@gatech.edu 10 5 10 10 10
3 danny.jameriah@gmail.com 7 nan nan nan nan
4 robert.gioanna@gatech.edu 10 10 10 10 10
5 adam.evemarie@gmail.com 10 10 10 10 10

Observe that rows 5 and 7 from the input are removed entirely since all of their problem values are missing. By contrast, rows 1 and 3 from the input have some non-NaN values and thus are retained.

In addition, the output types (inspected via .info()) should match the following:

 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     6 non-null      object 
 1   problem0  6 non-null      float64
 2   problem1  5 non-null      float64
 3   problem2  5 non-null      float64
 4   problem3  5 non-null      float64
 5   problem4  4 non-null      float64
dtypes: float64(5), object(1)

Ex. 1 your code¶

In [ ]:
### Exercise 1 solution
def remove_invalid_grades(gdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    from numpy import nan
    gdf = gdf.replace(to_replace='--------', value=nan)
    problem_cols = gdf.filter(regex=r'problem\d+').columns
    gdf = gdf.dropna(how='all', subset=problem_cols)
    gdf[problem_cols] = gdf[problem_cols].astype(float)
    gdf = gdf.reset_index(drop=True)
    return gdf
    ### END SOLUTION

### demo function call ###
demo_result_ex1 = remove_invalid_grades(demo_gdf_ex1)
display(demo_result_ex1)
demo_result_ex1.info()

Ex. 1 test¶

The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1', 
    'func': remove_invalid_grades, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'gdf':{
            'dtype': 'df', # data type of param.
            'check_modified': True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 1 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, valid_grades_by_exid. It is the result of a correct implementation of remove_invalid_grades when run on grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
valid_grades_by_exid = cse6040.utils.load_obj_from_file('valid_grades.dill')

# Example:
display(valid_grades_by_exid['exam1'].iloc[:4])
valid_grades_by_exid['exam1'].info()

Ex. 2 (1 pt): calc_earned_points¶

Suppose you have a grades dataframe, valid_gdf, having only valid rows as calculated in Exercise 1. Complete the function,

def calc_earned_points(valid_gdf):
    ...

so that it returns a new copy of valid_gdf with an additional column holding the total points earned.

Input: A dataframe valid_gdf with only valid rows, per Exercise 1.

Your task: Create a copy and sum the points earned by each student.

Output: Return a new tibble with the same columns as valid_gdf plus an additional column, 'Earned', of .dtype float. This column should contain the student's total score, that is, the sum of points on their problems. Any NaN values should be ignored. You should not modify the input valid_gdf.

Ex. 2 demo input¶

In [ ]:
### Define demo inputs ###
demo_valid_gdf_ex2 = valid_grades_by_exid['exam1'].sample(6, random_state=3_456_789_012).reset_index(drop=True)
demo_valid_gdf_ex2

The demo included in the solution cell below should display the following output:

Email problem0 problem1 problem2 problem3 problem4 Earned
0 rene.amritpal@gmail.com 10 10 10 10 10 50
1 albert.averley@gmail.com 10 2 8 10 10 40
2 mary.yahshua@gatech.edu 7 2 6.5 5 nan 20.5
3 jillian.myler@gmail.com 10 10 10 10 1 41
4 raina.sheehan@gmail.com 10 10 10 5 10 45
5 allison.link@yahoo.com 7 2 nan nan nan 9

Note: Although the values appear to include integers, in fact, the .dtype of the columns should be float.

Ex. 2 your code¶

In [ ]:
### Exercise 2 solution ###
def calc_earned_points(valid_gdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    gdf = valid_gdf.copy()
    gdf['Earned'] = gdf.filter(regex=r'problem\d+').sum(axis='columns')
    return gdf.reset_index(drop=True)
    ### END SOLUTION

### demo function call ###
calc_earned_points(demo_valid_gdf_ex2)

Ex. 2 test¶

The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex2 ###
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_2', 
    'func': calc_earned_points, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'valid_gdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 2 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, earned_by_exid. It is the result of a correct implementation of calc_earned_grades when run on valid_grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
earned_by_exid = cse6040.utils.load_obj_from_file('earned_grades.dill')

# Example:
display(earned_by_exid['exam1'].iloc[:4])
earned_by_exid['exam1'].info()

Ex. 3 (1 FREE pt): Understanding the timings dataset¶

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

Similar to the grades dataset (see Ex. 0), each exam has a timings dataset. Here's what it looks like for 'exam1':

In [ ]:
demo_times3_ex3 = times_by_exid['exam3']
demo_times3_ex3

Observations:

  • Each row is for a student and a problem ("milestone" column). There may be multiple rows for the same student, but a (student, problem) pair appears at most once.
  • The 'start date-time' and 'submit date-time' columns hold timestamps, and their .dtypes are special pandas Timestamp objects. The mini-tutorial below shows you how they work.
  • The 'start date-time' value is when the student began the exam. It will have the same value for all rows of a given student.
  • The 'submit date-time' value is the timestamp of the student's last submission for that problem. That is, only one submission is recorded per student and problem.

Implications: One consequence of the data we have is that we cannot really know how the student navigated the exam: we know when they started, but if they jumped around or submitted a problem multiple times, the only information we have is the last submission for each problem. Nevertheless, we will use the information we have to guess at their timeline, after some clean up.

Timestamps: The printed timestamps have the following form:

2018-12-07 18:35:23

That corresponds to December 7, 2018, at 6:35 pm (and 23 seconds).

Luckily, you don't need to parse timestamps! These objects make it easy to sort and calculate time differences, which is what we'll need. For instance, we can check if one date is larger than another using the usual comparison-operators, like > or <=:

In [ ]:
(demo_times3_ex3['submit date-time'] > demo_times3_ex3['start date-time']).head()

And we can subtract times using the usual - operator, as we would for integers or floats:

In [ ]:
(demo_times3_ex3['submit date-time'] - demo_times3_ex3['start date-time']).head()

The difference is stored as a timedelta64 object. The way to read printed item 0 above is "20 hours, 45 minutes, and 19 seconds."

Time-delta objects can be converted into a regular float representing the number of seconds using a method called .total_seconds():

In [ ]:
(demo_times3_ex3['submit date-time'].iloc[0] - demo_times3_ex3['start date-time'].iloc[0]).total_seconds()

We will need this transformation to simplify some calculations later on, so tuck this fact away.

Some messiness: The data has some "impurities." In particular, check out the data for 'kenneth.bryland@gatech.edu':

In [ ]:
demo_times3_ex3_messy = demo_times3_ex3.loc[[206, 1111, 1780, 2511, 3375]]
demo_times3_ex3_messy

Observe that their 'problem0' and 'problem1' submission times occur before their start times, which we confirm next:

In [ ]:
demo_times3_ex3_messy['submit date-time'] > demo_times3_ex3_messy['start date-time']

While it turns out there is a reason for that, we will want to omit cases like that to simplify our analysis.

Ex. 3 test (FREEBIE)¶

In [ ]:
### test_cell_ex3 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

Ex. 4 (2 pts): remove_causality_violators¶

Suppose you are given a timings dataframe, tdf, per Exercise 3. Complete the function,

def remove_causality_violators(tdf):
    ...

so that it returns a new, cleaned copy that omits records for any student who has timestamps that "violate causality." A causality violation means that their submission time is strictly less than their start time.

Input: A timings dataframe, tdf, with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'.

Your task: Identify all students who have any causality violations and omit their rows from the output dataframe.

Output: Return a new tibble that includes only rows from tdf for students whose problem-records do not violate causality. The output column .dtypes should match those of the input. You should not modify the input, tdf.

Comment(s): If a student has even one problem that violates causality, you should omit all of their rows. That is, we will consider their entire exam data invalid for the purpose of our subsequent analysis.

Ex. 4 demo input¶

In [ ]:
### Define demo inputs ###
demo_tdf_ex4 = times_by_exid['exam3'].loc[[206, 279, 439, 1111, 1344, 1780, 2511, 3224, 3375]].reset_index(drop=True)
demo_tdf_ex4

The demo included in the solution cell below should display the following output:

email milestone start date-time submit date-time
0 daniel.tashera@gmail.com problem0 2018-12-09 01:12:56 2018-12-09 02:29:33
1 barbara.wilhemina@gatech.edu problem0 2018-12-10 02:43:28 2018-12-10 06:21:30
2 barbara.wilhemina@gatech.edu problem2 2018-12-10 02:43:28 2018-12-10 09:09:43
3 barbara.wilhemina@gatech.edu problem5 2018-12-10 02:43:28 2018-12-10 09:04:35

Observe that kenneth.bryland@gatech.edu is omitted from this output because their 'problem0' and 'problem1' records violate causality. (Even though their remaining problems obey causality, we still drop them because of the problematic rows.)

Ex. 4 your code¶

In [ ]:
### Exercise 4 solution ###
def remove_causality_violators(tdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    tdf = tdf.copy()
    violation = tdf['submit date-time'] < tdf['start date-time']
    violators = set(tdf[violation]['email'].unique())
    tdf = tdf[~tdf['email'].isin(violators)]
    tdf = tdf.reset_index(drop=True)
    return tdf
    
# Example of an incorrect solution
def remove_causality_violators__ERROR(tdf):
    df = tdf.copy()
    return df[df["submit date-time"] > df["start date-time"]]
    ### END SOLUTION

### demo function call ###
remove_causality_violators(demo_tdf_ex4)

Ex. 4 test¶

The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_4', 
    'func': remove_causality_violators, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 4 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, valid_times_by_exid. It is the result of a correct implementation of remove_causality_violaters when run on times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
valid_times_by_exid = cse6040.utils.load_obj_from_file('valid_times.dill')

# Example:
assert (valid_times_by_exid['exam3']['submit date-time'] >= valid_times_by_exid['exam3']['start date-time']).all()
display(valid_times_by_exid['exam1'].iloc[:4])
valid_times_by_exid['exam1'].info()

Ex. 5 (1 pt): clean_times_colnames¶

Let's tidy up the column names for the timings data. Complete the function,

def clean_times_colnames(tdf):
    ...

to return a new tibble from tdf with columns renamed as follows.

  • 'email' => 'Email'
  • 'milestone' => 'Milestone'
  • 'start date-time' => 'Start'
  • 'submit date-time' => 'Submit'

Element types for the columns (.dtype) should be preserved, and the input dataframe should not be modified.

Ex. 5 demo input¶

In [ ]:
### define demo inputs ###
demo_tdf_ex5 = valid_times_by_exid['exam2'].sample(5, random_state=567_890_123, replace=False).reset_index(drop=True)
demo_tdf_ex5

The demo included in the solution cell below should display the following output:

Email Milestone Start Submit
0 jazlyn.florin@gmail.com problem4 2018-11-04 12:45:10 2018-11-05 20:50:41
1 christian.miyisha@yahoo.com problem2 2018-11-03 07:38:00 2018-11-04 12:21:12
2 taryn.raneshia@yahoo.co.id problem0 2018-11-04 08:25:24 2018-11-05 17:59:02
3 joseph.sherburne@gmail.com problem1 2018-11-02 16:45:06 2018-11-03 01:47:58
4 thomas.dyanni@gmail.com problem3 2018-11-03 14:20:56 2018-11-04 23:31:18

Ex. 5 your code¶

In [ ]:
### Exercise 5 solution
def clean_times_colnames(tdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    return tdf.rename(columns={'email': 'Email',
                               'milestone': 'Milestone',
                               'start date-time': 'Start',
                               'submit date-time': 'Submit',
                               'dt': 'Dt'})
    ### END SOLUTION

### demo function call ###
clean_times_colnames(demo_tdf_ex5)

Ex. 5 test¶

The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_5', 
    'func': clean_times_colnames, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 5 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, clean_times_by_exid. It is the result of a correct implementation of clean_times_colnames when run on valid_times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
clean_times_by_exid = cse6040.utils.load_obj_from_file('clean_times.dill')

# Example:
display(clean_times_by_exid['exam1'].head())
clean_times_by_exid['exam1'].info()

Ex. 6 (1 pt; FREE): Submission sequences¶

Our next analysis will estimate the test-takers' behavior by analyzing their submission times.

For example, consider the submission times for 'sherry.ayleigh@aol.com', stored in the dataframe demo_tdf_ex6 below:

In [ ]:
demo_ex6_tdf = clean_times_by_exid['exam3'].loc[[34, 593, 1624, 2189, 2344, 3122]]
demo_ex6_tdf

Based on the submission times, ordering the rows by the 'Submit' timestamp would yield a sensible timeline of submission events:

Email Milestone Start Submit
34 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16
593 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10
1624 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28
3122 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13
2344 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48
2189 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21

We refer to this ordering of submission times as the submission sequence. It is our best guess as to how they moved through the exam if we assume they did one problem at a time. From this submission sequence for this student, we might conclude the following.

First, the student started the exam at 2018-12-06 08:32:21 and submitted 'problem0' at 2018-12-06 09:07:16. That would be:

In [ ]:
demo_ex6_tdf['Submit'].iloc[0] - demo_ex6_tdf['Start'].iloc[0]

or 34 minutes and 55 seconds, which is also 2,095 seconds = (2095/3600) ~ 0.582 hours, rounding to three decimal digits. In other words, we will assume that the student started the exam and then spent 0.582 hours on Problem 0.

Next, the student submitted 'problem1' at 2018-12-06 10:04:10. Relative to the previous submission time, that is 56 minutes and 54 seconds, or 3,414 seconds = (3414/3600) ~ 0.948 hours, again rounded to three digits. So we might guess that the student spent that much time on Problem 1.

Continuing, the sequence of problems solved was Problem 0, 1, 2, 5, 4, and finally, 3. The approximate number of hours spent on each are 0.58, 0.948, 1.438, 2.062, 4.326, and 2.426 hours, respectively.

Ex. 6 test (FREEBIE)¶

In [ ]:
### test_cell_ex6 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

Ex. 7 (3 pts): sequence_submissions¶

Suppose you are given a (clean) timings tibble, tdf, such as computed in Exercise 5. Complete the function,

def sequence_submissions(tdf):
    ...

so that it sequences the submissions for each student and returns a new tibble matching the format described below.

Input: A clean timings tibble, tdf, with the columns 'Email', 'Milestone', 'Start', and 'Stop', per Ex. 5.

Your task: For each student, order their submissions and determine the amount of time spent on each problem, as illustrated in Ex. 6.

Output: Without modifying the input tibble, return a new tibble with the following columns:

  • 'Email', 'Milestone', 'Start', and 'Submit', which keep the same values as given in tdf.
  • 'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits, corresponding to the estimated time spent on each problem.

That is, each row of the output will indicate the estimated time spent by the student on the given problem based on the estimated timeline.

Additional requirements: The rows of this new tibble should be sorted by 'Email' and 'Submit' timestamp in ascending order.

Ex. 7 demo input¶

In [ ]:
### Define demo inputs ###
demo_tdf_ex7 = cse6040.utils.load_df_from_file('demo_tdf_ex7.df')
demo_tdf_ex7

In the preceding demo input, there are three students: 'sherry.ayleigh@aol.com', 'chelsea.jaretzi@utexas.edu', and 'jessica.rinda@gatech.edu'. Observe that

  • 'sherry.ayleigh@aol.com' submitted Problems 0-5 inclusive.
  • 'chelsea.jaretzi@utexas.edu' submitted Problems 0, 1, 2, 3, and 5. (So they skipped Problem 4.)
  • 'jessica.rinda@gatech.edu' submitted Problems 1, 2, and 5. (So they skipped Problems 0, 3, and 4.)

A correct implementation of sequence_submissions on this input would produce:

Email Milestone Start Submit Hours
0 chelsea.jaretzi@utexas.edu problem1 2018-12-11 09:50:18 2018-12-12 02:34:35 16.738
1 chelsea.jaretzi@utexas.edu problem0 2018-12-11 09:50:18 2018-12-12 02:40:16 0.095
2 chelsea.jaretzi@utexas.edu problem5 2018-12-11 09:50:18 2018-12-12 03:45:49 1.092
3 chelsea.jaretzi@utexas.edu problem3 2018-12-11 09:50:18 2018-12-12 05:49:20 2.059
4 chelsea.jaretzi@utexas.edu problem2 2018-12-11 09:50:18 2018-12-12 06:17:25 0.468
5 jessica.rinda@gatech.edu problem1 2018-12-07 12:05:41 2018-12-07 18:04:19 5.977
6 jessica.rinda@gatech.edu problem5 2018-12-07 12:05:41 2018-12-07 18:11:47 0.125
7 jessica.rinda@gatech.edu problem2 2018-12-07 12:05:41 2018-12-07 18:21:51 0.167
8 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16 0.582
9 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10 0.948
10 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28 1.439
11 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13 2.062
12 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48 4.326
13 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21 2.426

Observe the ordering by 'Email' and 'Submit' in ascending order. Also observe that the values in 'Hours' are rounded to 3 decimal digits and correspond to differences in submission times (and submission to start time for the first entry).

Ex. 7 your code¶

In [ ]:
### Exercise 7 solution
def sequence_submissions(tdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    tdf = tdf.sort_values(['Email', 'Submit']).reset_index(drop=True)
    tf = tdf.groupby('Email').apply(calc_hours_one).reset_index(drop=True)
    tdf['Hours'] = tf 
    return tdf
    
def calc_hours_one(tdf): # `tdf` for **one** student; assume sorted by `'Submit'`
    def to_hours(e, d=3):
        return round(e.total_seconds() / 3600, 3)
    
    dt = (tdf['Submit'] - tdf['Start']).apply(to_hours)
    t0 = dt.iloc[0]
    tf = dt.diff().fillna(t0)
    return tf

def sequence_submissions__waldron(tdf):
    def calc_hours(group):
        group = group.sort_values('Submit').reset_index(drop=True)
        seq = pd.Series([group['Start'].iloc[0], *group['Submit']])
        group['Hours'] = seq.diff().dropna()\
            .apply(lambda td: td.total_seconds()/3600)\
            .round(3)\
            .reset_index(drop=True)
        return group
    return tdf.groupby('Email', as_index=False).apply(calc_hours).reset_index(drop=True)
    ### END SOLUTION
    
### demo function call ###
sequence_submissions(demo_tdf_ex7)

Ex. 7 test¶

The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7', 
    'func': sequence_submissions, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 2*10 ** (-3)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 7 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, hours_by_exid. It is the result of a correct implementation of clean_times_colnames when run on hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
hours_by_exid = cse6040.utils.load_obj_from_file('hours.dill')

# Example:
display(hours_by_exid['exam1'].head())
hours_by_exid['exam1'].info()

Ex. 8 (3 pts): merge_gt¶

Given a tibble pointsdf of earned points (per Ex. 2) and another tibble hoursdf of sequenced hours-per-problem (per Ex. 7), complete the function,

def merge_gt(pointsdf, hoursdf):
    ...

so that it merges these into a single tibble formatted as described below.

Inputs:

  • pointsdf: A tibble of earned points, as computed by Exercise 2. It should have the columns, 'Email', one or more columns named 'problem#' where # is an integer string, and 'Earned'.
  • 'hoursdf': A tibble of sequenced hours-per-problem, as computed in Exercise 7. It should have the columns, 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'.

Your tasks:

  • Melt a copy of pointsdf so that its 'problem#' columns become values of a new column named 'Milestone' and the point values become a new column named 'Points'. The 'Points' column should hold floats.
  • Merge this melted result with hoursdf on the common 'Email' and 'Milestone' key-columns. (Use an inner-join/merge so that only keys that match both inputs are retained.)
  • Finally, sort this result by 'Email' and 'Submit' in ascending order.

Output: Return the result as a new tibble with the columns 'Email', 'Start', 'Submit', 'Milestone', 'Points', and 'Hours'. You should not modify either input, pointsdf or hoursdf.

Ex. 8 demo input¶

In [ ]:
### Define demo inputs ###
demo_pointsdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['pointsdf']
demo_hoursdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['hoursdf']

print("* `demo_pointsdf_ex8`:")
display(demo_pointsdf_ex8)

print("* `demo_hoursdf_ex8`:")
display(demo_hoursdf_ex8)

In the above inputs, observe that only 'justin.jernee@gatech.edu' has data in both the points and hours tibbles. Therefore, the merged result will only include that student's data, since we are using an inner merge. For this input, the result is:

Email Earned Milestone Points Start Submit Hours
0 justin.jernee@gatech.edu 40 problem2 2 2018-12-08 10:53:50 2018-12-08 15:12:42 4.314
1 justin.jernee@gatech.edu 40 problem0 10 2018-12-08 10:53:50 2018-12-08 19:21:39 4.15
2 justin.jernee@gatech.edu 40 problem1 10 2018-12-08 10:53:50 2018-12-08 20:16:38 0.916
3 justin.jernee@gatech.edu 40 problem5 5 2018-12-08 10:53:50 2018-12-09 13:53:43 17.618
4 justin.jernee@gatech.edu 40 problem3 10 2018-12-08 10:53:50 2018-12-09 20:06:24 6.211
5 justin.jernee@gatech.edu 40 problem4 3 2018-12-08 10:53:50 2018-12-09 22:08:52 2.042

Observe the ordering of the rows. The problems appear in ascending order of submission time, per the sorting requirement of this exercise.

Ex. 8 your code¶

In [ ]:
### Exercise 8 solution
def merge_gt(pointsdf, hoursdf):
    pass # Can be removed
    ### BEGIN SOLUTION
    df = pointsdf.melt(id_vars=['Email', 'Earned'], var_name='Milestone', value_name='Points')
    df = df.merge(hoursdf, on=['Email', 'Milestone'], how='inner')
    df = df.sort_values(['Email', 'Submit']).reset_index(drop=True)
    return df
    ### END SOLUTION
    
### demo function call ###
merge_gt(demo_pointsdf_ex8, demo_hoursdf_ex8)

Ex. 8 test¶

The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_8', 
    'func': merge_gt, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'pointsdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        },
        'hoursdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 8 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, merged_by_exid. It is the result of a correct implementation of merge_gt when run on earned_by_exid and hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
merged_by_exid = cse6040.utils.load_obj_from_file('merged.dill')

# Example:
display(merged_by_exid['exam1'].head())
merged_by_exid['exam1'].info()

Ex. 9 (2 pts): concat_exams¶

Exercise 8 produces a (merged) dataframe for each exam. But when we have multiple exams, we may want to analyze them all at once.

Suppose you are given a dictionary of exams, df_by_exid, whose keys are exam IDs (strings) and values are dataframes. Complete the function,

def concat_exams(df_by_exid):
    ...

to "concatenate" these as explained below.

Input: df_by_exid. If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam. Assume that all dataframes in df_by_exid have the same columns.

Output:

  • Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
  • To tell which rows came from which dataframe, include a column named 'Exam' whose value records the originating exid key.

Comments/hints:

  • You are asked to return a tibble, so remember be sure your returned output has a clean index.
  • As long as the 'Exam' value is set correctly, the ordering of rows and columns does not matter.
  • Remember that you should not modify any of the input dataframes! If you do, you may observe strange behavior or cause the autograder to fail.
  • You may assume that all input dataframes are nonempty. However, you should not assume anything else about what keys exist or what the columns of the input dataframe are, other than all dataframes will have the same columns.

Ex. 9 demo input¶

In [ ]:
### Define demo inputs ###
demo_df_by_exid_ex9 = cse6040.utils.load_obj_from_file('demo_df_by_exid_ex9.dill')

for exid, df in demo_df_by_exid_ex9.items():
    print(f"\n*** exid = '{exid}' ***")
    display(demo_df_by_exid_ex9[exid])

For the preceding inputs, a valid concatenated output is:

Start Points Earned Email Submit Milestone Hours Exam
0 2018-12-09 08:55:36 2 38 chelsea.dilann@gatech.edu 2018-12-10 11:17:14 problem2 8.557 exam3
1 2018-12-08 12:27:33 5.5 25.5 kristopher.erandi@gmail.com 2018-12-09 16:09:17 problem0 2.831 exam3
2 2018-09-30 01:05:14 10 45 meredith.rakyla@hotmail.com 2018-09-30 19:11:33 problem2 3.386 exam1
3 2018-09-28 16:21:04 10 45 stacy.loneta@gmail.com 2018-09-28 17:51:52 problem0 1.513 exam1
4 2018-11-03 20:05:45 2 22 leta.kartier@gmail.com 2018-11-05 05:16:51 problem1 2.097 exam2
5 2018-11-04 20:02:24 5 16 christopher.jullianne@gatech.edu 2018-11-05 10:51:45 problem0 14.822 exam2
6 2018-11-03 06:59:28 10 50 lola.fysher@gmail.com 2018-11-03 08:06:09 problem1 1.111 exam2
7 2018-11-03 10:19:14 10 39 susan.tedrina@gatech.edu 2018-11-04 18:28:46 problem3 1.337 exam2

Ex. 9 your code¶

In [ ]:
### Exercise 9 solution
def concat_exams(df_by_exid):
    pass # Can be removed
    ### BEGIN SOLUTION
    from pandas import concat
    dfs = []
    for exid, df in df_by_exid.items():
        dfs.append(df.copy())
        dfs[-1]['Exam'] = exid
    return concat(dfs, ignore_index=True)
    ### END SOLUTION
    
### demo function call ###
concat_exams(demo_df_by_exid_ex9)

Ex. 9 test¶

The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex9
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_9', 
    'func': concat_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df_by_exid':{
            'dtype':'dict', # data type of param.
            'check_modified':False, # @TODO: FIXME
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 9 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, concatdf. It is the result of a correct implementation of concat_exams when run on merged_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [ ]:
concatdf = cse6040.utils.load_obj_from_file('concat.dill')
concatdf

Ex. 10 (2 pts): summarize_exams¶

Suppose you are given df, a concatenated dataframe like the one that produced in Ex. 9. Complete the function,

def summarize_exams(df):
    ...

so that it determines the mean exam score and mean time to complete the exam for every exam, returning the result as a dataframe (see below).

Input: A dataframe, similar to the output of Exercise 9. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

  • 'Email' (string): a student email address
  • 'Exam' (string): the ID of an exam
  • 'Milestone' (string): the problem number
  • 'Points' (float): the points scored on that problem (milestone) for that exam
  • 'Hours' (float): the estimated number of hours spent on that problem for that exam

Your task: For each student and exam, you need to determine their total points and total time. Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.

Output: Return a pandas dataframe with three columns:

  • 'Exam' (string): the exam ID
  • 'Points' (float): the total points for that exam averaged over all students
  • 'Hours' (float): the total time in hours for that exam averaged over all students

Comments/hints: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!

Ex. 10 demo input¶

In [ ]:
### Define demo inputs ###

demo_df_ex10 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex10
The above input should produce: | | Exam | Points | Hours | |---:|:-------|---------:|--------:| | 0 | exam1 | 45 | 14.3085 | | 1 | exam3 | 42 | 31.255 | As a spot check, observe that the input dataset has two students. Consider their `'exam3'` scores: student `'matthew.monise@hotmail.com'` got a total of 38 points and `'paul.laconya@gmail.com'` got 46 points, yielding an average of (38+46)/2 = 42 points.

Ex. 10 your code¶

In [ ]:
### Exercise 10 solution
def summarize_exams(df):
    pass # Can be removed
    ### BEGIN SOLUTION
    df = sum_by_student_exam(df)
    df = avg_by_student(df)
    return df

def sum_by_student_exam(df):
    return df.groupby(['Exam', 'Email'])[['Points', 'Hours']].sum().reset_index()

def avg_by_student(df_stats_by_student_exam):
    df = df_stats_by_student_exam
    df = df.groupby('Exam')[['Points', 'Hours']].mean().reset_index()
    return df
    ### END SOLUTION
    
### demo function call ###
summarize_exams(demo_df_ex10)

Ex. 10 test¶

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex10
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_10', 
    'func': summarize_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 10 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, exam_summary. It is the result of a correct implementation of summarize_exams when run on concatdf from the full dataset.

In [ ]:
exam_summary = cse6040.utils.load_obj_from_file('exam_summary.dill')
exam_summary

Final task: A simple regression¶

For the case of three exams, the instructor would like to know whether the first two scores and the timing numbers can predict the third exam score. To carry out this analysis, the instructor needs to build a data matrix and wants your help.

Ex. 11 (3 pts): form_data_table¶

Suppose you are given the same input as Ex. 10, that is, a concatenated dataframe df like the one that produced in Ex. 9. Complete the function,

def form_data_table(df):
    ...

so that it constructs a dataframe whose values will (later) become the entries of a linear-regression data matrix.

Input: The same dataframe as Ex. 10. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

  • 'Email' (string): a student email address
  • 'Exam' (string): the ID of an exam
  • 'Milestone' (string): the problem number
  • 'Points' (float): the points scored on that problem (milestone) for that exam
  • 'Hours' (float): the estimated number of hours spent on that problem for that exam

Output: Construct and return a new dataframe where each row corresponds with a student, and the columns are as follows:

  1. 'Email' (string): The student email. There will be one row per student in the output.
  2. 'Points#' (float): One column of total points scored for each exam present in the input. For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums, with the values being the total points earned by the student on the corresponding exam.
  3. 'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.

As a final step, if there are any NaNs, drop the corresponding rows.

The idea behind this format is that each student will be an "observation" or "sample" for linear regression, and the columns will serve as predictors or features.

Comments/hints: As usual, you should not modify the input dataframe, df.

Ex. 11 demo input¶

In [ ]:
### Define demo inputs ###
demo_df_ex11 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex11
For the sample input above, a correct implementation would produce: | | Email | Hours1 | Hours3 | Points1 | Points3 | |---:|:---------------------------|---------:|---------:|----------:|----------:| | 0 | matthew.monise@hotmail.com | 16.349 | 35.548 | 45 | 38 | | 1 | paul.laconya@gmail.com | 12.268 | 26.962 | 45 | 46 | This format makes it easy to, say, try to apply one of our models, like using linear regression to predict `'Points3'` from `'Hours2'` and `'Points2'`—assuming that is even a good model, which it might not be!

Ex. 11 your code¶

In [ ]:
### Exercise 11 solution
def form_data_table(df):
    pass # Can be removed
    ### BEGIN SOLUTION
    def clean_colname(t):
        assert isinstance(t, tuple)
        return t[0] + t[1].replace('exam', '')

    df = df.pivot_table(index='Email', values=['Points', 'Hours'], columns='Exam', aggfunc='sum')
    df = df.reset_index()
    df.columns = [clean_colname(c) for c in df.columns.to_flat_index()]
    return df.dropna()
    ### END SOLUTION
    
### demo function call ###
form_data_table(demo_df_ex11)

Ex. 11 test¶

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### test_cell_ex11
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
    hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
    hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_11', 
    'func': form_data_table, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
### END HIDDEN TESTS
print('Passed! Please submit.')

Ex. 11 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, data_table. It is the result of a correct implementation of form_data_table when run on concatdf from the full dataset.

In [ ]:
data_table = cse6040.utils.load_obj_from_file('data_table.dill')
data_table

Fin (+ postscript)¶

If you have made it this far, congratulations on completing the exam. Don't forget to submit!

Postscript: We went through a lot of work in this exam to get the data table. So if we try to regress, say, 'Points3' from the other columns, how well does that work? Here is a code fragment to check. What does the result say?

In [ ]:
# Predictors:
data_cols = data_table[['Points1', 'Hours1', 'Points2', 'Hours2']]
X = data_cols.values
X /= X.max(axis=0) # Normalize columns

# Response:
resp_col = 'Points3'
y = data_table[resp_col].values
y /= y.max()

from sklearn.linear_model import LinearRegression
model_type = LinearRegression(fit_intercept=True, positive=True)
model = model_type.fit(X, y)

print("* Predictors:", data_cols)
print("* Response:", resp_col)
print("* Model coefficients:", model.coef_)
print("* Model intercept:", model.intercept_)

y_pred = model.predict(X)
print(f"* Mean relative error: {abs((y_pred - y)/y).mean()*100:.1f}%")