Final Exam, Fall 2023: Exam Analysis

Version 1.0

All of the header information is important. Please read it..

Topics, number of exercises: This problem builds on your knowledge of pandas, Numpy, and linear regression. It has 12 exercises exercises, numbered 0 to 11. There are 22 available points. However, to earn 100% the threshold is 15 points. (Therefore, once you hit 15 points, you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs ### load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Ex. 0: 1 (FREE) point(s)
  • Ex. 1: 2 point(s)
  • Ex. 2: 1 point(s)
  • Ex. 3: 1 (FREE) point(s)
  • Ex. 4: 2 point(s)
  • Ex. 5: 1 point(s)
  • Ex. 6: 1 (FREE) point(s)
  • Ex. 7: 3 point(s)
  • Ex. 8: 3 point(s)
  • Ex 9: 2 point(s)
  • Ex 10: 2 point(s)
  • Ex. 11: 3 point(s)

Exercises 0, 3, and 6 involve reading only (no coding), so they are "free" points. However, you still need to run the associated test cells and submit to get those points.

Final reminders:

  • We recommend submitting after every exercise.
  • Please review the generated grade report after you submit to see what errors were returned.
  • Stay calm, skip problems as needed, and take short breaks at your leisure.

Overview: Exam Examination

An instructor for another course has asked you to help analyze some of their exam data. The exams were completed electronically, in a format similar to our class's, but with some twists:

  • The class has multiple exams, named 'exam1', 'exam2', and so on. Exams were timed, but with different time limits for each exam.
  • Each exam consisted of multiple "milestones" (or "problems"). There were different numbers of problems for each exam, and different problems can be worth different numbers of points. (Thus, the total points on different exams could also differ.)
  • Students were allowed to work on the problems in any order and submit in any order, similar to our exams.

The dataset itself has two main parts:

  • The grades data, which shows, for each student, exam, and problem, how many points they scored.
  • The timing data, which shows, for each student and problem, when (i.e., at what date and time) the student started the problem and made their last submission thereof.

Each exam has its own grading and timing data, as you'll see.

Your overall task: You will help clean up and analyze this data, culminating in an assessment, based on linear regression, to see whether or not the last exam score can be predicted from the scores and timing information of earlier exams.

Setup

To get started, run the following code cells. Take note of preloaded modules, as it's likely you'll want to use them.

In [1]:
### Global Imports
###
### AUTOGRADER TEST - DO NOT REMOVE
###

# Modules you may find useful:
import pandas as pd
import numpy as np

# Some extra functions that this notebook needs:
import cse6040.utils
from pprint import pprint # Pretty-printer

# Load the dataset:
print("\n==> Loading the dataset...")
from cse6040.examdata import load_dataset
grades_by_exid, times_by_exid = load_dataset()
print("\n==> Done executing the setup and ready for take-off!")
==> Loading the dataset...
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam3.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam1.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam2.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam3.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam1.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam2.csv'...
  ==> Done!

==> Done executing the setup and ready for take-off!

Ex. 0 (1 pt; FREE): Understanding the grades dataframe

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

The global variable grades_by_exam is a dictionary of pandas dataframes. Each key is an exam ID (e.g., 'exam2'), and each value is a dataframe holding the grading data for that exam. Let's take a look at an example, which are the grades for the exam with exam id 'exam1'.

In [4]:
list(grades_by_exid.keys())
Out[4]:
['exam3', 'exam1', 'exam2']

Evidently, there are three exams in this dataset. However, the code you write later should not assume three exams unless otherwise specified.

In [5]:
grades_by_exid['exam1']
Out[5]:
Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7 10 10 10 10
1 juan.quay@aol.com 10 10 6.5 0 --------
2 gino.branston@gatech.edu 10 5 10 10 10
3 robert.gioanna@gatech.edu 10 10 10 10 10
4 adam.evemarie@gmail.com 10 10 10 10 10
... ... ... ... ... ... ...
1544 francis.auden@gmail.com -------- -------- -------- -------- --------
1545 alexander.amilianna@noahliot.com -------- -------- -------- -------- --------
1546 robert.jyn@live.com -------- -------- -------- -------- --------
1547 donald.bernon@gatech.edu -------- -------- -------- -------- --------
1548 christopher.julies@hotmail.com -------- -------- -------- -------- --------

1549 rows × 6 columns

Note: The data has been anonymized with fake email addresses. Any resemblance to actual email addresses is purely coincidental.

Observations:

  • Recall that we are looking at just one exam, 'exam1', in the example above. There are similar dataframes for 'exam2' and 'exam3'.
  • There is one row per student, each identified by their unique email address. (Note: These are anonymized with fake email addresses. Any resemblance to real addresses is purely coincidental.)
  • Each column corresponds to an exam problem or "milestone." The value is the number of points the student got from an autograder.
  • If the student did not submit a solution for a problem, the score appears as a string with exactly eight hyphens: '--------'. For example, see row 1, where juan.quay@aol.com did not submit problem4.
  • Some students (probably) dropped the class or did not attempt the exam at all. In those cases, all problem scores for that student are missing. For instance, see rows 1544-1548 in the output.

Ex. 0 test (FREEBIE)

In [2]:
### test_cell_ex0 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 1 (2 pt): remove_invalid_grades

Given a grades dataframe, gdf, complete the function,

def remove_invalid_grades(gdf):
    ...

so that it returns clean copy of gdf.

Input: A grades dataframe, gdf. It will have a column named 'Email' and one or more columns named 'problem#' where # is an integer.

Your task: Copy this dataframe and clean it as follows:

  • Any problem that was not submitted, which will equal a string of 8 hyphens, '--------', should be replaced with a not-a-number (NaN) value, i.e., a Numpy np.nan.
  • Any row where none of the problems were submitted should be removed completely.
  • All problem columns should be converted to have a .dtype of float.

Output: Return a clean tibble copy, without modifying the input gdf. It should contain exactly the same columns as the input, even if the result is an empty dataframe, and the .dtype should be as specified above.

Caveat(s)/comment(s)/hint(s):

  1. Since we are asking for a tibble output, you should reset the index.
  2. The dataframe .replace() and .filter() methods may be helpful (but are not required).
  3. You may assume there is at least 1 problem column. However, do not assume the input has a fixed number of problems. That is, the input might have 'problem0' through 'problem4' as shown in Exercise 0, but it might also only have, say, two columns named 'problem1' and 'problem3', or it might have four problem columns named 'problem0', 'problem1', 'problem25', and 'problem234'. Your code should robustly handle these cases. The dataframe's .dropna() method may be helpful.

Ex. 1 demo input

In [3]:
### Define demo inputs ###
demo_gdf_ex1 = grades_by_exid['exam1'].loc[[0, 1, 2, 683, 3, 957, 4, 1546]].reset_index(drop=True)
demo_gdf_ex1
Out[3]:
Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7 10 10 10 10
1 juan.quay@aol.com 10 10 6.5 0 --------
2 gino.branston@gatech.edu 10 5 10 10 10
3 danny.jameriah@gmail.com 7 -------- -------- -------- --------
4 robert.gioanna@gatech.edu 10 10 10 10 10
5 leslie.natalyia@hotmail.com -------- -------- -------- -------- --------
6 adam.evemarie@gmail.com 10 10 10 10 10
7 robert.jyn@live.com -------- -------- -------- -------- --------

Consider the demo input, demo_gdf_ex1, shown above. A correct output is:

Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7 10 10 10 10
1 juan.quay@aol.com 10 10 6.5 0 nan
2 gino.branston@gatech.edu 10 5 10 10 10
3 danny.jameriah@gmail.com 7 nan nan nan nan
4 robert.gioanna@gatech.edu 10 10 10 10 10
5 adam.evemarie@gmail.com 10 10 10 10 10

Observe that rows 5 and 7 from the input are removed entirely since all of their problem values are missing. By contrast, rows 1 and 3 from the input have some non-NaN values and thus are retained.

In addition, the output types (inspected via .info()) should match the following:

 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     6 non-null      object 
 1   problem0  6 non-null      float64
 2   problem1  5 non-null      float64
 3   problem2  5 non-null      float64
 4   problem3  5 non-null      float64
 5   problem4  4 non-null      float64
dtypes: float64(5), object(1)

Ex. 1 your code

In [7]:
### Exercise 1 solution
def remove_invalid_grades(gdf):
    '''
    INPUT:
    A grades dataframe called gdf. Columns are 'Email' and one or more columns named 'problem#' where # is an integer.
    
    GOAL: 
    Copy this dataframe and clean it as follows:
    -Replace 8 hyphens '--------' with np.nan
    -Remove rows where no problems were submitted
    -Convert problem columns to be float
    
    Return this cleaned data frame.
    
    STRATEGY:
    1. Make copy of input data frame. Let's call this 'gdf_final'
    2. Get a list of all of the columns that start with 'problem' 
    3. Replace any values in those problem columns that contain 8 hyphens with np.nan
    4. Remove any rows where the problem columns are all np.nan now
    5. Convert problem columns to be float
    6. Reset index and return this data frame
    '''
    
    # SOLUTION:
    gdf_final = gdf.copy()    
    
    # Most robust (safest) way is to use Filter + Regex, specify digit after problem:
    problem_columns = list(gdf_final.filter(regex=r'^problem\d+', axis=1))
    
    # Alternative way: find all columns that start with 'problem':
    # https://stackoverflow.com/questions/27275236/how-to-select-all-columns-whose-names-start-with-x-in-a-pandas-dataframe
    #problem_columns = [col for col in gdf_final if col.startswith('problem')]
    
    # Super Risky Way (may or may not work depending on column input/order):
    #problem_columns = list(gdf_final.columns[1:])
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
    gdf_final[problem_columns] = gdf_final[problem_columns].replace('--------', np.nan)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
    gdf_final = gdf_final.dropna(how='all', subset=problem_columns)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
    gdf_final[problem_columns] = gdf_final[problem_columns].astype('float64')
    
    return gdf_final.reset_index(drop=True)

### demo function call ###
demo_result_ex1 = remove_invalid_grades(demo_gdf_ex1)
display(demo_result_ex1)
demo_result_ex1.info()
Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7.0 10.0 10.0 10.0 10.0
1 juan.quay@aol.com 10.0 10.0 6.5 0.0 NaN
2 gino.branston@gatech.edu 10.0 5.0 10.0 10.0 10.0
3 danny.jameriah@gmail.com 7.0 NaN NaN NaN NaN
4 robert.gioanna@gatech.edu 10.0 10.0 10.0 10.0 10.0
5 adam.evemarie@gmail.com 10.0 10.0 10.0 10.0 10.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     6 non-null      object 
 1   problem0  6 non-null      float64
 2   problem1  5 non-null      float64
 3   problem2  5 non-null      float64
 4   problem3  5 non-null      float64
 5   problem4  4 non-null      float64
dtypes: float64(5), object(1)
memory usage: 416.0+ bytes

Ex. 1 test

The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [8]:
### test_cell_ex1
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1', 
    'func': remove_invalid_grades, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'gdf':{
            'dtype': 'df', # data type of param.
            'check_modified': True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 1 RUNME (precomputed solutions)

The cell below creates a precomputed object, valid_grades_by_exid. It is the result of a correct implementation of remove_invalid_grades when run on grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [9]:
valid_grades_by_exid = cse6040.utils.load_obj_from_file('valid_grades.dill')

# Example:
display(valid_grades_by_exid['exam1'].iloc[:4])
valid_grades_by_exid['exam1'].info()
Email problem0 problem1 problem2 problem3 problem4
0 sammie.jaydaa@gmail.com 7.0 10.0 10.0 10.0 10.0
1 juan.quay@aol.com 10.0 10.0 6.5 0.0 NaN
2 gino.branston@gatech.edu 10.0 5.0 10.0 10.0 10.0
3 robert.gioanna@gatech.edu 10.0 10.0 10.0 10.0 10.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688 entries, 0 to 687
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     688 non-null    object 
 1   problem0  684 non-null    float64
 2   problem1  672 non-null    float64
 3   problem2  654 non-null    float64
 4   problem3  655 non-null    float64
 5   problem4  638 non-null    float64
dtypes: float64(5), object(1)
memory usage: 32.4+ KB

Ex. 2 (1 pt): calc_earned_points

Suppose you have a grades dataframe, valid_gdf, having only valid rows as calculated in Exercise 1. Complete the function,

def calc_earned_points(valid_gdf):
    ...

so that it returns a new copy of valid_gdf with an additional column holding the total points earned.

Input: A dataframe valid_gdf with only valid rows, per Exercise 1.

Your task: Create a copy and sum the points earned by each student.

Output: Return a new tibble with the same columns as valid_gdf plus an additional column, 'Earned', of .dtype float. This column should contain the student's total score, that is, the sum of points on their problems. Any NaN values should be ignored. You should not modify the input valid_gdf.

Ex. 2 demo input

In [10]:
### Define demo inputs ###
demo_valid_gdf_ex2 = valid_grades_by_exid['exam1'].sample(6, random_state=3_456_789_012).reset_index(drop=True)
demo_valid_gdf_ex2
Out[10]:
Email problem0 problem1 problem2 problem3 problem4
0 rene.amritpal@gmail.com 10.0 10.0 10.0 10.0 10.0
1 albert.averley@gmail.com 10.0 2.0 8.0 10.0 10.0
2 mary.yahshua@gatech.edu 7.0 2.0 6.5 5.0 NaN
3 jillian.myler@gmail.com 10.0 10.0 10.0 10.0 1.0
4 raina.sheehan@gmail.com 10.0 10.0 10.0 5.0 10.0
5 allison.link@yahoo.com 7.0 2.0 NaN NaN NaN

The demo included in the solution cell below should display the following output:

Email problem0 problem1 problem2 problem3 problem4 Earned
0 rene.amritpal@gmail.com 10 10 10 10 10 50
1 albert.averley@gmail.com 10 2 8 10 10 40
2 mary.yahshua@gatech.edu 7 2 6.5 5 nan 20.5
3 jillian.myler@gmail.com 10 10 10 10 1 41
4 raina.sheehan@gmail.com 10 10 10 5 10 45
5 allison.link@yahoo.com 7 2 nan nan nan 9

Note: Although the values appear to include integers, in fact, the .dtype of the columns should be float.

Ex. 2 your code

In [11]:
### Exercise 2 solution ###
def calc_earned_points(valid_gdf):
    '''
    INPUT:
    A dataframe 'valid_gdf'
    
    GOAL:
    Create a copy and sum the points earned by each student.
    Return new data frame with same columns + Earned column. 
    Earned column should be a float. Ignore NaNs.
    
    STRATEGY:
    1. Make a copy of the input data frame. Let's call this 'valid_gdf_final'
    2. Sum the problem columns and create a new column called 'Earned'
    3. Convert 'Earned' column to be float
    4. Return the data frame
    '''
    
    # SOLUTION:
    valid_gdf_final = valid_gdf.copy()

    # Using same method as in Exercise 1 to identify problem columns
    problem_columns = list(valid_gdf_final.filter(regex=r'^problem\d+', axis=1))
    
    valid_gdf_final['Earned'] = valid_gdf_final[problem_columns].sum(axis=1)
    
    valid_gdf_final['Earned'] = valid_gdf_final['Earned'].astype('float64')
    
    return valid_gdf_final
    

### demo function call ###
calc_earned_points(demo_valid_gdf_ex2)
Out[11]:
Email problem0 problem1 problem2 problem3 problem4 Earned
0 rene.amritpal@gmail.com 10.0 10.0 10.0 10.0 10.0 50.0
1 albert.averley@gmail.com 10.0 2.0 8.0 10.0 10.0 40.0
2 mary.yahshua@gatech.edu 7.0 2.0 6.5 5.0 NaN 20.5
3 jillian.myler@gmail.com 10.0 10.0 10.0 10.0 1.0 41.0
4 raina.sheehan@gmail.com 10.0 10.0 10.0 5.0 10.0 45.0
5 allison.link@yahoo.com 7.0 2.0 NaN NaN NaN 9.0

Ex. 2 test

The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [12]:
### test_cell_ex2 ###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_2', 
    'func': calc_earned_points, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'valid_gdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 2 RUNME (precomputed solutions)

The cell below creates a precomputed object, earned_by_exid. It is the result of a correct implementation of calc_earned_grades when run on valid_grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [13]:
earned_by_exid = cse6040.utils.load_obj_from_file('earned_grades.dill')

# Example:
display(earned_by_exid['exam1'].iloc[:4])
earned_by_exid['exam1'].info()
Email problem0 problem1 problem2 problem3 problem4 Earned
0 sammie.jaydaa@gmail.com 7.0 10.0 10.0 10.0 10.0 47.0
1 juan.quay@aol.com 10.0 10.0 6.5 0.0 NaN 26.5
2 gino.branston@gatech.edu 10.0 5.0 10.0 10.0 10.0 45.0
3 robert.gioanna@gatech.edu 10.0 10.0 10.0 10.0 10.0 50.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688 entries, 0 to 687
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     688 non-null    object 
 1   problem0  684 non-null    float64
 2   problem1  672 non-null    float64
 3   problem2  654 non-null    float64
 4   problem3  655 non-null    float64
 5   problem4  638 non-null    float64
 6   Earned    688 non-null    float64
dtypes: float64(6), object(1)
memory usage: 37.8+ KB

Ex. 3 (1 FREE pt): Understanding the timings dataset

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

Similar to the grades dataset (see Ex. 0), each exam has a timings dataset. Here's what it looks like for 'exam1':

In [14]:
demo_times3_ex3 = times_by_exid['exam3']
demo_times3_ex3
Out[14]:
email milestone start date-time submit date-time
0 kathy.tavaughn@gmail.com problem0 2018-12-07 18:35:23 2018-12-08 15:20:42
1 jeri.robynn@gmail.com problem0 2018-12-08 18:22:20 2018-12-09 21:34:09
2 jacolyn.deborahh@live.de problem0 2018-12-08 11:08:02 2018-12-08 18:25:06
3 alex.mesias@gmail.com problem0 2018-12-10 12:38:00 2018-12-11 21:32:30
4 james.marcisha@gatech.edu problem0 2018-12-10 10:09:25 2018-12-11 05:52:07
... ... ... ... ...
3453 patricia.ariha@gatech.edu problem5 2018-12-08 11:43:24 2018-12-09 17:15:05
3454 conrad.rhonisha@vt.edu problem5 2018-12-07 07:10:28 2018-12-07 06:06:09
3455 robert.burnest@gmail.com problem5 2018-12-09 11:38:58 2018-12-10 11:08:58
3456 dustin.maleeah@gmail.com problem5 2018-12-10 15:35:53 2018-12-11 16:23:29
3457 alexandra.kogan@gmail.com problem5 2018-12-09 15:02:24 2018-12-10 10:45:23

3458 rows × 4 columns

Observations:

  • Each row is for a student and a problem ("milestone" column). There may be multiple rows for the same student, but a (student, problem) pair appears at most once.
  • The 'start date-time' and 'submit date-time' columns hold timestamps, and their .dtypes are special pandas Timestamp objects. The mini-tutorial below shows you how they work.
  • The 'start date-time' value is when the student began the exam. It will have the same value for all rows of a given student.
  • The 'submit date-time' value is the timestamp of the student's last submission for that problem. That is, only one submission is recorded per student and problem.

Implications: One consequence of the data we have is that we cannot really know how the student navigated the exam: we know when they started, but if they jumped around or submitted a problem multiple times, the only information we have is the last submission for each problem. Nevertheless, we will use the information we have to guess at their timeline, after some clean up.

Timestamps: The printed timestamps have the following form:

2018-12-07 18:35:23

That corresponds to December 7, 2018, at 6:35 pm (and 23 seconds).

Luckily, you don't need to parse timestamps! These objects make it easy to sort and calculate time differences, which is what we'll need. For instance, we can check if one date is larger than another using the usual comparison-operators, like > or <=:

In [15]:
(demo_times3_ex3['submit date-time'] > demo_times3_ex3['start date-time']).head()
Out[15]:
0    True
1    True
2    True
3    True
4    True
dtype: bool

And we can subtract times using the usual - operator, as we would for integers or floats:

In [16]:
(demo_times3_ex3['submit date-time'] - demo_times3_ex3['start date-time']).head()
Out[16]:
0   0 days 20:45:19
1   1 days 03:11:49
2   0 days 07:17:04
3   1 days 08:54:30
4   0 days 19:42:42
dtype: timedelta64[ns]

The difference is stored as a timedelta64 object. The way to read printed item 0 above is "20 hours, 45 minutes, and 19 seconds."

Time-delta objects can be converted into a regular float representing the number of seconds using a method called .total_seconds():

In [17]:
(demo_times3_ex3['submit date-time'].iloc[0] - demo_times3_ex3['start date-time'].iloc[0]).total_seconds()
Out[17]:
74719.0

We will need this transformation to simplify some calculations later on, so tuck this fact away.

Some messiness: The data has some "impurities." In particular, check out the data for 'kenneth.bryland@gatech.edu':

In [19]:
demo_times3_ex3_messy = demo_times3_ex3.loc[[206, 1111, 1780, 2511, 3375]]
demo_times3_ex3_messy
Out[19]:
email milestone start date-time submit date-time
206 kenneth.bryland@gatech.edu problem0 2018-12-08 01:27:05 2018-12-08 00:54:28
1111 kenneth.bryland@gatech.edu problem1 2018-12-08 01:27:05 2018-12-08 01:26:03
1780 kenneth.bryland@gatech.edu problem2 2018-12-08 01:27:05 2018-12-08 12:04:59
2511 kenneth.bryland@gatech.edu problem4 2018-12-08 01:27:05 2018-12-08 13:14:38
3375 kenneth.bryland@gatech.edu problem5 2018-12-08 01:27:05 2018-12-08 13:38:08

Observe that their 'problem0' and 'problem1' submission times occur before their start times, which we confirm next:

In [20]:
demo_times3_ex3_messy['submit date-time'] > demo_times3_ex3_messy['start date-time']
Out[20]:
206     False
1111    False
1780     True
2511     True
3375     True
dtype: bool

While it turns out there is a reason for that, we will want to omit cases like that to simplify our analysis.

Ex. 3 test (FREEBIE)

In [21]:
### test_cell_ex3 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 4 (2 pts): remove_causality_violators

Suppose you are given a timings dataframe, tdf, per Exercise 3. Complete the function,

def remove_causality_violators(tdf):
    ...

so that it returns a new, cleaned copy that omits records for any student who has timestamps that "violate causality." A causality violation means that their submission time is strictly less than their start time.

Input: A timings dataframe, tdf, with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'.

Your task: Identify all students who have any causality violations and omit their rows from the output dataframe.

Output: Return a new tibble that includes only rows from tdf for students whose problem-records do not violate causality. The output column .dtypes should match those of the input. You should not modify the input, tdf.

Comment(s): If a student has even one problem that violates causality, you should omit all of their rows. That is, we will consider their entire exam data invalid for the purpose of our subsequent analysis.

Ex. 4 demo input

In [22]:
### Define demo inputs ###
demo_tdf_ex4 = times_by_exid['exam3'].loc[[206, 279, 439, 1111, 1344, 1780, 2511, 3224, 3375]].reset_index(drop=True)
demo_tdf_ex4
Out[22]:
email milestone start date-time submit date-time
0 kenneth.bryland@gatech.edu problem0 2018-12-08 01:27:05 2018-12-08 00:54:28
1 daniel.tashera@gmail.com problem0 2018-12-09 01:12:56 2018-12-09 02:29:33
2 barbara.wilhemina@gatech.edu problem0 2018-12-10 02:43:28 2018-12-10 06:21:30
3 kenneth.bryland@gatech.edu problem1 2018-12-08 01:27:05 2018-12-08 01:26:03
4 barbara.wilhemina@gatech.edu problem2 2018-12-10 02:43:28 2018-12-10 09:09:43
5 kenneth.bryland@gatech.edu problem2 2018-12-08 01:27:05 2018-12-08 12:04:59
6 kenneth.bryland@gatech.edu problem4 2018-12-08 01:27:05 2018-12-08 13:14:38
7 barbara.wilhemina@gatech.edu problem5 2018-12-10 02:43:28 2018-12-10 09:04:35
8 kenneth.bryland@gatech.edu problem5 2018-12-08 01:27:05 2018-12-08 13:38:08

The demo included in the solution cell below should display the following output:

email milestone start date-time submit date-time
0 daniel.tashera@gmail.com problem0 2018-12-09 01:12:56 2018-12-09 02:29:33
1 barbara.wilhemina@gatech.edu problem0 2018-12-10 02:43:28 2018-12-10 06:21:30
2 barbara.wilhemina@gatech.edu problem2 2018-12-10 02:43:28 2018-12-10 09:09:43
3 barbara.wilhemina@gatech.edu problem5 2018-12-10 02:43:28 2018-12-10 09:04:35

Observe that kenneth.bryland@gatech.edu is omitted from this output because their 'problem0' and 'problem1' records violate causality. (Even though their remaining problems obey causality, we still drop them because of the problematic rows.)

Ex. 4 your code

In [24]:
### Exercise 4 solution ###
def remove_causality_violators(tdf):
    '''
    INPUT:
    A timings dataframe 'tdf' with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'
    
    GOAL:
    Find rows where submit date_time < start date-time.
    Return a new dataframe with these bad rows omitted. 
    Omit all rows for that student if one or more bad rows exist.
    
    STRATEGY:
    1. Make a copy of input data frame. Let's call this 'tdf_final'
    2. Find rows where 'submit date-time' < 'start date-time'. Let's call these 'tdf_bad_rows'.
    3. Get a list of all of the student emails in the bad rows identified in Step 2. Let's call this 'bad_emails_list'
    4. Grab the rows from tdf_final where the email address is NOT in 'bad_emails_list'
    5. Reset index and return tdf_final
    '''
    
    # SOLUTION:
    tdf_final = tdf.copy()
    
    tdf_bad_rows = tdf_final[tdf_final['submit date-time'] < tdf_final['start date-time']]
    
    bad_emails_list = list(tdf_bad_rows['email'])
    
    tdf_final = tdf_final[~tdf_final['email'].isin(bad_emails_list)]
    
    return tdf_final.reset_index(drop=True)
    
    

### demo function call ###
remove_causality_violators(demo_tdf_ex4)
Out[24]:
email milestone start date-time submit date-time
0 daniel.tashera@gmail.com problem0 2018-12-09 01:12:56 2018-12-09 02:29:33
1 barbara.wilhemina@gatech.edu problem0 2018-12-10 02:43:28 2018-12-10 06:21:30
2 barbara.wilhemina@gatech.edu problem2 2018-12-10 02:43:28 2018-12-10 09:09:43
3 barbara.wilhemina@gatech.edu problem5 2018-12-10 02:43:28 2018-12-10 09:04:35

Ex. 4 test

The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [25]:
### test_cell_ex4
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_4', 
    'func': remove_causality_violators, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 4 RUNME (precomputed solutions)

The cell below creates a precomputed object, valid_times_by_exid. It is the result of a correct implementation of remove_causality_violaters when run on times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [26]:
valid_times_by_exid = cse6040.utils.load_obj_from_file('valid_times.dill')

# Example:
assert (valid_times_by_exid['exam3']['submit date-time'] >= valid_times_by_exid['exam3']['start date-time']).all()
display(valid_times_by_exid['exam1'].iloc[:4])
valid_times_by_exid['exam1'].info()
email milestone start date-time submit date-time
0 thelma.ousman@gmail.com problem0 2018-09-28 11:24:09 2018-09-28 19:06:18
1 christopher.dailee@gmail.com problem0 2018-09-28 08:42:26 2018-09-28 17:45:59
2 paul.jawaski@gmail.com problem0 2018-09-29 07:40:21 2018-09-29 22:50:55
3 matthew.taw@gmail.com problem0 2018-09-30 08:05:01 2018-09-30 09:41:23
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   email             3304 non-null   object        
 1   milestone         3304 non-null   object        
 2   start date-time   3304 non-null   datetime64[ns]
 3   submit date-time  3304 non-null   datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 103.4+ KB

Ex. 5 (1 pt): clean_times_colnames

Let's tidy up the column names for the timings data. Complete the function,

def clean_times_colnames(tdf):
    ...

to return a new tibble from tdf with columns renamed as follows.

  • 'email' => 'Email'
  • 'milestone' => 'Milestone'
  • 'start date-time' => 'Start'
  • 'submit date-time' => 'Submit'

Element types for the columns (.dtype) should be preserved, and the input dataframe should not be modified.

Ex. 5 demo input

In [27]:
### define demo inputs ###
demo_tdf_ex5 = valid_times_by_exid['exam2'].sample(5, random_state=567_890_123, replace=False).reset_index(drop=True)
demo_tdf_ex5
Out[27]:
email milestone start date-time submit date-time
0 jazlyn.florin@gmail.com problem4 2018-11-04 12:45:10 2018-11-05 20:50:41
1 christian.miyisha@yahoo.com problem2 2018-11-03 07:38:00 2018-11-04 12:21:12
2 taryn.raneshia@yahoo.co.id problem0 2018-11-04 08:25:24 2018-11-05 17:59:02
3 joseph.sherburne@gmail.com problem1 2018-11-02 16:45:06 2018-11-03 01:47:58
4 thomas.dyanni@gmail.com problem3 2018-11-03 14:20:56 2018-11-04 23:31:18

The demo included in the solution cell below should display the following output:

Email Milestone Start Submit
0 jazlyn.florin@gmail.com problem4 2018-11-04 12:45:10 2018-11-05 20:50:41
1 christian.miyisha@yahoo.com problem2 2018-11-03 07:38:00 2018-11-04 12:21:12
2 taryn.raneshia@yahoo.co.id problem0 2018-11-04 08:25:24 2018-11-05 17:59:02
3 joseph.sherburne@gmail.com problem1 2018-11-02 16:45:06 2018-11-03 01:47:58
4 thomas.dyanni@gmail.com problem3 2018-11-03 14:20:56 2018-11-04 23:31:18

Ex. 5 your code

In [28]:
### Exercise 5 solution
def clean_times_colnames(tdf):
    '''
    INPUT: 
    a data frame 'tdf'
    
    GOAL:
    Copy data frame and rename columns from:
    'email' => 'Email'
    'milestone' => 'Milestone'
    'start date-time' => 'Start'
    'submit date-time' => 'Submit'
    
    STRATEGY:
    1. Make a copy of input data frame. Let's call this 'tdf_final'
    2. Rename columns as instructed
    3. Return 'tdf_final'
    '''
    
    #SOLUTION:
    tdf_final = tdf.copy()
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
    tdf_final = tdf_final.rename(columns={'email':'Email', 'milestone':'Milestone', 'start date-time':'Start', 
                                          'submit date-time':'Submit'})
    
    return tdf_final  

### demo function call ###
clean_times_colnames(demo_tdf_ex5)
Out[28]:
Email Milestone Start Submit
0 jazlyn.florin@gmail.com problem4 2018-11-04 12:45:10 2018-11-05 20:50:41
1 christian.miyisha@yahoo.com problem2 2018-11-03 07:38:00 2018-11-04 12:21:12
2 taryn.raneshia@yahoo.co.id problem0 2018-11-04 08:25:24 2018-11-05 17:59:02
3 joseph.sherburne@gmail.com problem1 2018-11-02 16:45:06 2018-11-03 01:47:58
4 thomas.dyanni@gmail.com problem3 2018-11-03 14:20:56 2018-11-04 23:31:18

Ex. 5 test

The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [29]:
### test_cell_ex5
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_5', 
    'func': clean_times_colnames, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 5 RUNME (precomputed solutions)

The cell below creates a precomputed object, clean_times_by_exid. It is the result of a correct implementation of clean_times_colnames when run on valid_times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [30]:
clean_times_by_exid = cse6040.utils.load_obj_from_file('clean_times.dill')

# Example:
display(clean_times_by_exid['exam1'].head())
clean_times_by_exid['exam1'].info()
Email Milestone Start Submit
0 thelma.ousman@gmail.com problem0 2018-09-28 11:24:09 2018-09-28 19:06:18
1 christopher.dailee@gmail.com problem0 2018-09-28 08:42:26 2018-09-28 17:45:59
2 paul.jawaski@gmail.com problem0 2018-09-29 07:40:21 2018-09-29 22:50:55
3 matthew.taw@gmail.com problem0 2018-09-30 08:05:01 2018-09-30 09:41:23
4 elijah.lareta@gatech.edu problem0 2018-09-28 11:23:46 2018-09-28 13:30:46
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Milestone  3304 non-null   object        
 2   Start      3304 non-null   datetime64[ns]
 3   Submit     3304 non-null   datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 103.4+ KB

Ex. 6 (1 pt; FREE): Submission sequences

Our next analysis will estimate the test-takers' behavior by analyzing their submission times.

For example, consider the submission times for 'sherry.ayleigh@aol.com', stored in the dataframe demo_tdf_ex6 below:

In [31]:
demo_ex6_tdf = clean_times_by_exid['exam3'].loc[[34, 593, 1624, 2189, 2344, 3122]]
demo_ex6_tdf
Out[31]:
Email Milestone Start Submit
34 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16
593 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10
1624 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28
2189 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21
2344 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48
3122 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13

Based on the submission times, ordering the rows by the 'Submit' timestamp would yield a sensible timeline of submission events:

Email Milestone Start Submit
34 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16
593 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10
1624 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28
3122 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13
2344 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48
2189 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21

We refer to this ordering of submission times as the submission sequence. It is our best guess as to how they moved through the exam if we assume they did one problem at a time. From this submission sequence for this student, we might conclude the following.

First, the student started the exam at 2018-12-06 08:32:21 and submitted 'problem0' at 2018-12-06 09:07:16. That would be:

In [32]:
demo_ex6_tdf['Submit'].iloc[0] - demo_ex6_tdf['Start'].iloc[0]
Out[32]:
Timedelta('0 days 00:34:55')

or 34 minutes and 55 seconds, which is also 2,095 seconds = (2095/3600) ~ 0.582 hours, rounding to three decimal digits. In other words, we will assume that the student started the exam and then spent 0.582 hours on Problem 0.

Next, the student submitted 'problem1' at 2018-12-06 10:04:10. Relative to the previous submission time, that is 56 minutes and 54 seconds, or 3,414 seconds = (3414/3600) ~ 0.948 hours, again rounded to three digits. So we might guess that the student spent that much time on Problem 1.

Continuing, the sequence of problems solved was Problem 0, 1, 2, 5, 4, and finally, 3. The approximate number of hours spent on each are 0.58, 0.948, 1.438, 2.062, 4.326, and 2.426 hours, respectively.

Ex. 6 test (FREEBIE)

In [33]:
### test_cell_ex6 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 7 (3 pts): sequence_submissions

Suppose you are given a (clean) timings tibble, tdf, such as computed in Exercise 5. Complete the function,

def sequence_submissions(tdf):
    ...

so that it sequences the submissions for each student and returns a new tibble matching the format described below.

Input: A clean timings tibble, tdf, with the columns 'Email', 'Milestone', 'Start', and 'Stop', per Ex. 5.

Your task: For each student, order their submissions and determine the amount of time spent on each problem, as illustrated in Ex. 6.

Output: Without modifying the input tibble, return a new tibble with the following columns:

  • 'Email', 'Milestone', 'Start', and 'Submit', which keep the same values as given in tdf.
  • 'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits, corresponding to the estimated time spent on each problem.

That is, each row of the output will indicate the estimated time spent by the student on the given problem based on the estimated timeline.

Additional requirements: The rows of this new tibble should be sorted by 'Email' and 'Submit' timestamp in ascending order.

Ex. 7 demo input

In [34]:
### Define demo inputs ###
demo_tdf_ex7 = cse6040.utils.load_df_from_file('demo_tdf_ex7.df')
demo_tdf_ex7
Out[34]:
Email Milestone Start Submit
0 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16
1 chelsea.jaretzi@utexas.edu problem0 2018-12-11 09:50:18 2018-12-12 02:40:16
2 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10
3 jessica.rinda@gatech.edu problem1 2018-12-07 12:05:41 2018-12-07 18:04:19
4 chelsea.jaretzi@utexas.edu problem1 2018-12-11 09:50:18 2018-12-12 02:34:35
5 chelsea.jaretzi@utexas.edu problem2 2018-12-11 09:50:18 2018-12-12 06:17:25
6 jessica.rinda@gatech.edu problem2 2018-12-07 12:05:41 2018-12-07 18:21:51
7 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28
8 chelsea.jaretzi@utexas.edu problem3 2018-12-11 09:50:18 2018-12-12 05:49:20
9 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21
10 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48
11 jessica.rinda@gatech.edu problem5 2018-12-07 12:05:41 2018-12-07 18:11:47
12 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13
13 chelsea.jaretzi@utexas.edu problem5 2018-12-11 09:50:18 2018-12-12 03:45:49

In the preceding demo input, there are three students: 'sherry.ayleigh@aol.com', 'chelsea.jaretzi@utexas.edu', and 'jessica.rinda@gatech.edu'. Observe that

  • 'sherry.ayleigh@aol.com' submitted Problems 0-5 inclusive.
  • 'chelsea.jaretzi@utexas.edu' submitted Problems 0, 1, 2, 3, and 5. (So they skipped Problem 4.)
  • 'jessica.rinda@gatech.edu' submitted Problems 1, 2, and 5. (So they skipped Problems 0, 3, and 4.)

A correct implementation of sequence_submissions on this input would produce:

Email Milestone Start Submit Hours
0 chelsea.jaretzi@utexas.edu problem1 2018-12-11 09:50:18 2018-12-12 02:34:35 16.738
1 chelsea.jaretzi@utexas.edu problem0 2018-12-11 09:50:18 2018-12-12 02:40:16 0.095
2 chelsea.jaretzi@utexas.edu problem5 2018-12-11 09:50:18 2018-12-12 03:45:49 1.092
3 chelsea.jaretzi@utexas.edu problem3 2018-12-11 09:50:18 2018-12-12 05:49:20 2.059
4 chelsea.jaretzi@utexas.edu problem2 2018-12-11 09:50:18 2018-12-12 06:17:25 0.468
5 jessica.rinda@gatech.edu problem1 2018-12-07 12:05:41 2018-12-07 18:04:19 5.977
6 jessica.rinda@gatech.edu problem5 2018-12-07 12:05:41 2018-12-07 18:11:47 0.125
7 jessica.rinda@gatech.edu problem2 2018-12-07 12:05:41 2018-12-07 18:21:51 0.167
8 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16 0.582
9 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10 0.948
10 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28 1.439
11 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13 2.062
12 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48 4.326
13 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21 2.426

Observe the ordering by 'Email' and 'Submit' in ascending order. Also observe that the values in 'Hours' are rounded to 3 decimal digits and correspond to differences in submission times (and submission to start time for the first entry).

Ex. 7 your code

In [35]:
### Exercise 7 solution
def sequence_submissions(tdf):
    '''
    INPUT: 
    A data frame 'tdf' with the columns 'Email', 'Milestone', 'Start', and 'Stop'
    
    GOAL:
    -Do not modify input. 
    -Sort by 'Email' and 'Submit' timestamp in ascending order.
    -Return new data frame with columns 'Email', 'Milestone', 'Start', and 'Submit' with same values as given in tdf.
    -Add new column 'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits, 
    corresponding to the estimated time spent on each problem.
    
    STRATEGY (not using groupby + apply + helper function):
    1. Make a copy of the input data frame. Let's call this 'df'
    2. Sort by 'Email' and 'Submit' in ascending order.
    3. Reset index.
    4. Create a new column that holds the difference between the Submit time and the Start time. 
       Let's call this 'Hours from Start'
    5. Convert this column to hours (use hint above or something you find online)
    6. Now let's create a new column that's the same as 'Hours from Start' but shifted down one row for each GROUP 
       (so we should group by Email). Let's call this 'Shifted Hours'
    7. Now calculate 'Hours from Start' - 'Shifted Hours'. Make this a new column called 'Hours'
    8. Round the 'Hours' column to 3 decimal places
    9. Drop the columns 'Hours from Start' and 'Shifted Hours'
    10. Return the data frame
    '''
    
    # SOLUTION:
    df = tdf.copy()
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
    df = df.sort_values(by=['Email','Submit'])
    df = df.reset_index(drop=True)
    
    df['Hours from Start'] = df['Submit'] - df['Start']
    
    # https://www.statology.org/pandas-convert-timedelta-to-int/
    df['Hours from Start'] = df['Hours from Start'] / pd.Timedelta(hours=1)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
    g = df.groupby('Email')
    df['Shifted Hours'] = g['Hours from Start'].shift(periods=1, fill_value=0)
    
    df['Hours'] = df['Hours from Start'] - df['Shifted Hours']
    df['Hours'] = round(df['Hours'], 3)
    df = df.drop(columns=['Hours from Start', 'Shifted Hours'])
    
    return df
    
### demo function call ###
sequence_submissions(demo_tdf_ex7)
Out[35]:
Email Milestone Start Submit Hours
0 chelsea.jaretzi@utexas.edu problem1 2018-12-11 09:50:18 2018-12-12 02:34:35 16.738
1 chelsea.jaretzi@utexas.edu problem0 2018-12-11 09:50:18 2018-12-12 02:40:16 0.095
2 chelsea.jaretzi@utexas.edu problem5 2018-12-11 09:50:18 2018-12-12 03:45:49 1.093
3 chelsea.jaretzi@utexas.edu problem3 2018-12-11 09:50:18 2018-12-12 05:49:20 2.059
4 chelsea.jaretzi@utexas.edu problem2 2018-12-11 09:50:18 2018-12-12 06:17:25 0.468
5 jessica.rinda@gatech.edu problem1 2018-12-07 12:05:41 2018-12-07 18:04:19 5.977
6 jessica.rinda@gatech.edu problem5 2018-12-07 12:05:41 2018-12-07 18:11:47 0.124
7 jessica.rinda@gatech.edu problem2 2018-12-07 12:05:41 2018-12-07 18:21:51 0.168
8 sherry.ayleigh@aol.com problem0 2018-12-06 08:32:21 2018-12-06 09:07:16 0.582
9 sherry.ayleigh@aol.com problem1 2018-12-06 08:32:21 2018-12-06 10:04:10 0.948
10 sherry.ayleigh@aol.com problem2 2018-12-06 08:32:21 2018-12-06 11:30:28 1.438
11 sherry.ayleigh@aol.com problem5 2018-12-06 08:32:21 2018-12-06 13:34:13 2.063
12 sherry.ayleigh@aol.com problem4 2018-12-06 08:32:21 2018-12-06 17:53:48 4.326
13 sherry.ayleigh@aol.com problem3 2018-12-06 08:32:21 2018-12-06 20:19:21 2.426

Ex. 7 test

The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [36]:
### test_cell_ex7
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7', 
    'func': sequence_submissions, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 2*10 ** (-3)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 7 RUNME (precomputed solutions)

The cell below creates a precomputed object, hours_by_exid. It is the result of a correct implementation of clean_times_colnames when run on hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [37]:
hours_by_exid = cse6040.utils.load_obj_from_file('hours.dill')

# Example:
display(hours_by_exid['exam1'].head())
hours_by_exid['exam1'].info()
Email Milestone Start Submit Hours
0 aaliyah.tayviona@gatech.edu problem1 2018-09-28 08:35:59 2018-09-28 15:23:24 6.790
1 aaliyah.tayviona@gatech.edu problem0 2018-09-28 08:35:59 2018-09-29 01:13:46 9.840
2 aaliyah.tayviona@gatech.edu problem3 2018-09-28 08:35:59 2018-09-29 07:20:57 6.119
3 aaron.kyerra@foxmail.com problem0 2018-09-29 09:57:52 2018-09-29 11:21:16 1.390
4 aaron.kyerra@foxmail.com problem1 2018-09-29 09:57:52 2018-09-29 12:07:36 0.772
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Milestone  3304 non-null   object        
 2   Start      3304 non-null   datetime64[ns]
 3   Submit     3304 non-null   datetime64[ns]
 4   Hours      3304 non-null   float64       
dtypes: datetime64[ns](2), float64(1), object(2)
memory usage: 129.2+ KB

Ex. 8 (3 pts): merge_gt

Given a tibble pointsdf of earned points (per Ex. 2) and another tibble hoursdf of sequenced hours-per-problem (per Ex. 7), complete the function,

def merge_gt(pointsdf, hoursdf):
    ...

so that it merges these into a single tibble formatted as described below.

Inputs:

  • pointsdf: A tibble of earned points, as computed by Exercise 2. It should have the columns, 'Email', one or more columns named 'problem#' where # is an integer string, and 'Earned'.
  • 'hoursdf': A tibble of sequenced hours-per-problem, as computed in Exercise 7. It should have the columns, 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'.

Your tasks:

  • Melt a copy of pointsdf so that its 'problem#' columns become values of a new column named 'Milestone' and the point values become a new column named 'Points'. The 'Points' column should hold floats.
  • Merge this melted result with hoursdf on the common 'Email' and 'Milestone' key-columns. (Use an inner-join/merge so that only keys that match both inputs are retained.)
  • Finally, sort this result by 'Email' and 'Submit' in ascending order.

Output: Return the result as a new tibble with the columns 'Email', 'Start', 'Submit', 'Milestone', 'Points', and 'Hours'. You should not modify either input, pointsdf or hoursdf.

Ex. 8 demo input

In [38]:
### Define demo inputs ###
demo_pointsdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['pointsdf']
demo_hoursdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['hoursdf']

print("* `demo_pointsdf_ex8`:")
display(demo_pointsdf_ex8)

print("* `demo_hoursdf_ex8`:")
display(demo_hoursdf_ex8)
* `demo_pointsdf_ex8`:
Email problem0 problem1 problem2 problem3 problem4 problem5 Earned
0 justin.jernee@gatech.edu 10.0 10.0 2.0 10.0 3.0 5.0 40.0
1 canaan.cynsere@hotmail.com NaN 10.0 8.0 10.0 3.0 10.0 41.0
2 terry.shantrece@gmail.com 10.0 10.0 3.0 3.0 3.0 2.0 31.0
* `demo_hoursdf_ex8`:
Email Milestone Start Submit Hours
0 justin.jernee@gatech.edu problem2 2018-12-08 10:53:50 2018-12-08 15:12:42 4.314
1 justin.jernee@gatech.edu problem0 2018-12-08 10:53:50 2018-12-08 19:21:39 4.150
2 justin.jernee@gatech.edu problem1 2018-12-08 10:53:50 2018-12-08 20:16:38 0.916
3 justin.jernee@gatech.edu problem5 2018-12-08 10:53:50 2018-12-09 13:53:43 17.618
4 justin.jernee@gatech.edu problem3 2018-12-08 10:53:50 2018-12-09 20:06:24 6.211
5 justin.jernee@gatech.edu problem4 2018-12-08 10:53:50 2018-12-09 22:08:52 2.042
6 micheal.glenola@gatech.edu problem0 2018-12-08 21:46:59 2018-12-09 03:48:29 6.025
7 micheal.glenola@gatech.edu problem1 2018-12-08 21:46:59 2018-12-09 05:19:48 1.522
8 micheal.glenola@gatech.edu problem2 2018-12-08 21:46:59 2018-12-09 06:59:44 1.666
9 micheal.glenola@gatech.edu problem5 2018-12-08 21:46:59 2018-12-09 17:02:13 10.041
10 micheal.glenola@gatech.edu problem4 2018-12-08 21:46:59 2018-12-09 20:09:08 3.115

In the above inputs, observe that only 'justin.jernee@gatech.edu' has data in both the points and hours tibbles. Therefore, the merged result will only include that student's data, since we are using an inner merge. For this input, the result is:

Email Earned Milestone Points Start Submit Hours
0 justin.jernee@gatech.edu 40 problem2 2 2018-12-08 10:53:50 2018-12-08 15:12:42 4.314
1 justin.jernee@gatech.edu 40 problem0 10 2018-12-08 10:53:50 2018-12-08 19:21:39 4.15
2 justin.jernee@gatech.edu 40 problem1 10 2018-12-08 10:53:50 2018-12-08 20:16:38 0.916
3 justin.jernee@gatech.edu 40 problem5 5 2018-12-08 10:53:50 2018-12-09 13:53:43 17.618
4 justin.jernee@gatech.edu 40 problem3 10 2018-12-08 10:53:50 2018-12-09 20:06:24 6.211
5 justin.jernee@gatech.edu 40 problem4 3 2018-12-08 10:53:50 2018-12-09 22:08:52 2.042

Observe the ordering of the rows. The problems appear in ascending order of submission time, per the sorting requirement of this exercise.

Ex. 8 your code

In [40]:
### Exercise 8 solution
def merge_gt(pointsdf, hoursdf):
    '''
    INPUT:
    1. pointsdf: has columns 'Email', 'problem#' columns where # is an integer string, and 'Earned'
    2. hoursdf: has columns 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'
    
    GOAL:
    Melt pointdf so that the 'problem#' columns become rows (in a column called 'Milestone'), 
    and the points become floats in a column called 'Points'.
    Inner merge this with hoursdf on 'Email' and 'Milestone'
    Final data frame should have columns: 'Email', 'Earned', 'Milestone', 'Points', 'Start', 'Submit', and 'Hours'
    
    STRATEGY:
    1. Melt pointsdf. Keep 'Email' and 'Earned' as is, all other columns (aka the problem ones) get melted. 
       Problem column should be named 'Milestone', and point values should be called 'Points'. 
       Let's call this new data frame melted_df.
    2. Inner merge melted_df with hoursdf on 'Email' and 'Milestone'. Let's call this new data frame merged_df.
    3. Sort merged_df by 'Email' and 'Submit' columns in ascending order
    4. Reset index
    5. Return this data frame
    '''
    
    # SOLUTION:
    # Easiest Way to Melt:
    # https://pandas.pydata.org/docs/reference/api/pandas.melt.html
    #melted_df = pd.melt(pointsdf, id_vars=['Email','Earned'], var_name='Milestone', value_name='Points')

    # Alternative Way to Melt using NB7 Melt Function (Changed to be helper function per Office Hours feedback):
    # We know our keep_vars, not our col_vars, so reverse difference line in the function below and input parameter to be
    # keep_vars instead of col_vars
    def melt(df, keep_vars, key, value):
        col_vals = df.columns.difference(keep_vars)
        melted_sections = []
        for c in col_vals:
            melted_c = df[keep_vars].copy()
            melted_c[key] = c
            melted_c[value] = df[c]
            melted_sections.append(melted_c)
        melted = pd.concat(melted_sections)
        return melted
    
    
    melted_df = melt(pointsdf, ['Email', 'Earned'], 'Milestone', 'Points')
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
    merged_df = melted_df.merge(hoursdf, how='inner', on=['Email', 'Milestone'])
    
    sorted_df = merged_df.sort_values(by=['Email','Submit'])
    
    return sorted_df.reset_index(drop=True)
    
    
### demo function call ###
merge_gt(demo_pointsdf_ex8, demo_hoursdf_ex8)
Out[40]:
Email Earned Milestone Points Start Submit Hours
0 justin.jernee@gatech.edu 40.0 problem2 2.0 2018-12-08 10:53:50 2018-12-08 15:12:42 4.314
1 justin.jernee@gatech.edu 40.0 problem0 10.0 2018-12-08 10:53:50 2018-12-08 19:21:39 4.150
2 justin.jernee@gatech.edu 40.0 problem1 10.0 2018-12-08 10:53:50 2018-12-08 20:16:38 0.916
3 justin.jernee@gatech.edu 40.0 problem5 5.0 2018-12-08 10:53:50 2018-12-09 13:53:43 17.618
4 justin.jernee@gatech.edu 40.0 problem3 10.0 2018-12-08 10:53:50 2018-12-09 20:06:24 6.211
5 justin.jernee@gatech.edu 40.0 problem4 3.0 2018-12-08 10:53:50 2018-12-09 22:08:52 2.042

Ex. 8 test

The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [41]:
### test_cell_ex8
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_8', 
    'func': merge_gt, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'pointsdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        },
        'hoursdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 8 RUNME (precomputed solutions)

The cell below creates a precomputed object, merged_by_exid. It is the result of a correct implementation of merge_gt when run on earned_by_exid and hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [42]:
merged_by_exid = cse6040.utils.load_obj_from_file('merged.dill')

# Example:
display(merged_by_exid['exam1'].head())
merged_by_exid['exam1'].info()
Email Earned Milestone Points Start Submit Hours
0 aaliyah.tayviona@gatech.edu 23.0 problem1 8.0 2018-09-28 08:35:59 2018-09-28 15:23:24 6.790
1 aaliyah.tayviona@gatech.edu 23.0 problem0 10.0 2018-09-28 08:35:59 2018-09-29 01:13:46 9.840
2 aaliyah.tayviona@gatech.edu 23.0 problem3 5.0 2018-09-28 08:35:59 2018-09-29 07:20:57 6.119
3 aaron.kyerra@foxmail.com 47.0 problem0 7.0 2018-09-29 09:57:52 2018-09-29 11:21:16 1.390
4 aaron.kyerra@foxmail.com 47.0 problem1 10.0 2018-09-29 09:57:52 2018-09-29 12:07:36 0.772
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Earned     3304 non-null   float64       
 2   Milestone  3304 non-null   object        
 3   Points     3303 non-null   float64       
 4   Start      3304 non-null   datetime64[ns]
 5   Submit     3304 non-null   datetime64[ns]
 6   Hours      3304 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(2)
memory usage: 180.8+ KB

Ex. 9 (2 pts): concat_exams

Exercise 8 produces a (merged) dataframe for each exam. But when we have multiple exams, we may want to analyze them all at once.

Suppose you are given a dictionary of exams, df_by_exid, whose keys are exam IDs (strings) and values are dataframes. Complete the function,

def concat_exams(df_by_exid):
    ...

to "concatenate" these as explained below.

Input: df_by_exid. If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam. Assume that all dataframes in df_by_exid have the same columns.

Output:

  • Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
  • To tell which rows came from which dataframe, include a column named 'Exam' whose value records the originating exid key.

Comments/hints:

  • You are asked to return a tibble, so remember be sure your returned output has a clean index.
  • As long as the 'Exam' value is set correctly, the ordering of rows and columns does not matter.
  • Remember that you should not modify any of the input dataframes! If you do, you may observe strange behavior or cause the autograder to fail.
  • You may assume that all input dataframes are nonempty. However, you should not assume anything else about what keys exist or what the columns of the input dataframe are, other than all dataframes will have the same columns.

Ex. 9 demo input

In [43]:
### Define demo inputs ###
demo_df_by_exid_ex9 = cse6040.utils.load_obj_from_file('demo_df_by_exid_ex9.dill')

for exid, df in demo_df_by_exid_ex9.items():
    print(f"\n*** exid = '{exid}' ***")
    display(demo_df_by_exid_ex9[exid])
*** exid = 'exam3' ***
Start Points Earned Email Submit Milestone Hours
0 2018-12-09 08:55:36 2.0 38.0 chelsea.dilann@gatech.edu 2018-12-10 11:17:14 problem2 8.557
1 2018-12-08 12:27:33 5.5 25.5 kristopher.erandi@gmail.com 2018-12-09 16:09:17 problem0 2.831
*** exid = 'exam1' ***
Start Points Earned Email Submit Milestone Hours
0 2018-09-30 01:05:14 10.0 45.0 meredith.rakyla@hotmail.com 2018-09-30 19:11:33 problem2 3.386
1 2018-09-28 16:21:04 10.0 45.0 stacy.loneta@gmail.com 2018-09-28 17:51:52 problem0 1.513
*** exid = 'exam2' ***
Start Points Earned Email Submit Milestone Hours
0 2018-11-03 20:05:45 2.0 22.0 leta.kartier@gmail.com 2018-11-05 05:16:51 problem1 2.097
1 2018-11-04 20:02:24 5.0 16.0 christopher.jullianne@gatech.edu 2018-11-05 10:51:45 problem0 14.822
2 2018-11-03 06:59:28 10.0 50.0 lola.fysher@gmail.com 2018-11-03 08:06:09 problem1 1.111
3 2018-11-03 10:19:14 10.0 39.0 susan.tedrina@gatech.edu 2018-11-04 18:28:46 problem3 1.337

For the preceding inputs, a valid concatenated output is:

Start Points Earned Email Submit Milestone Hours Exam
0 2018-12-09 08:55:36 2 38 chelsea.dilann@gatech.edu 2018-12-10 11:17:14 problem2 8.557 exam3
1 2018-12-08 12:27:33 5.5 25.5 kristopher.erandi@gmail.com 2018-12-09 16:09:17 problem0 2.831 exam3
2 2018-09-30 01:05:14 10 45 meredith.rakyla@hotmail.com 2018-09-30 19:11:33 problem2 3.386 exam1
3 2018-09-28 16:21:04 10 45 stacy.loneta@gmail.com 2018-09-28 17:51:52 problem0 1.513 exam1
4 2018-11-03 20:05:45 2 22 leta.kartier@gmail.com 2018-11-05 05:16:51 problem1 2.097 exam2
5 2018-11-04 20:02:24 5 16 christopher.jullianne@gatech.edu 2018-11-05 10:51:45 problem0 14.822 exam2
6 2018-11-03 06:59:28 10 50 lola.fysher@gmail.com 2018-11-03 08:06:09 problem1 1.111 exam2
7 2018-11-03 10:19:14 10 39 susan.tedrina@gatech.edu 2018-11-04 18:28:46 problem3 1.337 exam2

Ex. 9 your code

In [44]:
### Exercise 9 solution
def concat_exams(df_by_exid):
    '''
    INPUT:
    -A dictionary df_by_exid. 
    -If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam. 
    -All dataframes in df_by_exid have the same columns.
    
    GOAL:
    -Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
    -Include a column named 'Exam' whose value records the originating exid key.
    
    STRATEGY:
    1. Create an empty list to hold all of our data frames we want to stack. Let's call this list_of_all_dfs
    2. Iterate over exid in df_by_exid
            3. Grab the associated data frame using df_by_exid[exid]. Let's call this 'df'
            4. Make a copy of 'df'. Let's call this 'df_copy'
            5. Add an 'Exam' column to 'df_copy' where all rows = exid
            6. Append 'df_copy' to our list_of_all_dfs
    7. Concatenate all of the data frames in our list_of_all_dfs. 
    8. Reset index and return this data frame
    '''
    
    # SOLUTION:
    list_of_all_dfs = []
    
    for exid in df_by_exid:
        df = df_by_exid[exid]
        df_copy = df.copy()
        df_copy['Exam'] = exid
        list_of_all_dfs.append(df_copy)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.concat.html
    tall_df = pd.concat(list_of_all_dfs)
    return tall_df.reset_index(drop=True)
        
    
    
### demo function call ###
concat_exams(demo_df_by_exid_ex9)
Out[44]:
Start Points Earned Email Submit Milestone Hours Exam
0 2018-12-09 08:55:36 2.0 38.0 chelsea.dilann@gatech.edu 2018-12-10 11:17:14 problem2 8.557 exam3
1 2018-12-08 12:27:33 5.5 25.5 kristopher.erandi@gmail.com 2018-12-09 16:09:17 problem0 2.831 exam3
2 2018-09-30 01:05:14 10.0 45.0 meredith.rakyla@hotmail.com 2018-09-30 19:11:33 problem2 3.386 exam1
3 2018-09-28 16:21:04 10.0 45.0 stacy.loneta@gmail.com 2018-09-28 17:51:52 problem0 1.513 exam1
4 2018-11-03 20:05:45 2.0 22.0 leta.kartier@gmail.com 2018-11-05 05:16:51 problem1 2.097 exam2
5 2018-11-04 20:02:24 5.0 16.0 christopher.jullianne@gatech.edu 2018-11-05 10:51:45 problem0 14.822 exam2
6 2018-11-03 06:59:28 10.0 50.0 lola.fysher@gmail.com 2018-11-03 08:06:09 problem1 1.111 exam2
7 2018-11-03 10:19:14 10.0 39.0 susan.tedrina@gatech.edu 2018-11-04 18:28:46 problem3 1.337 exam2

Ex. 9 test

The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [45]:
### test_cell_ex9
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_9', 
    'func': concat_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df_by_exid':{
            'dtype':'dict', # data type of param.
            'check_modified':False, # @TODO: FIXME
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 9 RUNME (precomputed solutions)

The cell below creates a precomputed object, concatdf. It is the result of a correct implementation of concat_exams when run on merged_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

In [46]:
concatdf = cse6040.utils.load_obj_from_file('concat.dill')
concatdf
Out[46]:
Email Earned Milestone Points Start Submit Hours Exam
0 aaliyah.tayviona@gatech.edu 39.0 problem5 9.0 2018-12-07 11:09:35 2018-12-07 18:54:54 7.755 exam3
1 aaliyah.tayviona@gatech.edu 39.0 problem1 10.0 2018-12-07 11:09:35 2018-12-08 07:18:05 12.387 exam3
2 aaliyah.tayviona@gatech.edu 39.0 problem0 10.0 2018-12-07 11:09:35 2018-12-08 08:40:39 1.376 exam3
3 aaliyah.tayviona@gatech.edu 39.0 problem4 5.0 2018-12-07 11:09:35 2018-12-08 11:25:49 2.753 exam3
4 aaliyah.tayviona@gatech.edu 39.0 problem3 2.0 2018-12-07 11:09:35 2018-12-08 16:01:59 4.602 exam3
... ... ... ... ... ... ... ... ...
9708 zuri.eilleen@gatech.edu 50.0 problem0 10.0 2018-11-03 08:26:06 2018-11-03 14:24:54 5.980 exam2
9709 zuri.eilleen@gatech.edu 50.0 problem1 10.0 2018-11-03 08:26:06 2018-11-03 15:51:20 1.441 exam2
9710 zuri.eilleen@gatech.edu 50.0 problem2 10.0 2018-11-03 08:26:06 2018-11-03 19:48:11 3.947 exam2
9711 zuri.eilleen@gatech.edu 50.0 problem3 10.0 2018-11-03 08:26:06 2018-11-03 21:21:42 1.559 exam2
9712 zuri.eilleen@gatech.edu 50.0 problem4 10.0 2018-11-03 08:26:06 2018-11-04 01:52:15 4.509 exam2

9713 rows × 8 columns

Ex. 10 (2 pts): summarize_exams

Suppose you are given df, a concatenated dataframe like the one that produced in Ex. 9. Complete the function,

def summarize_exams(df):
    ...

so that it determines the mean exam score and mean time to complete the exam for every exam, returning the result as a dataframe (see below).

Input: A dataframe, similar to the output of Exercise 9. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

  • 'Email' (string): a student email address
  • 'Exam' (string): the ID of an exam
  • 'Milestone' (string): the problem number
  • 'Points' (float): the points scored on that problem (milestone) for that exam
  • 'Hours' (float): the estimated number of hours spent on that problem for that exam

Your task: For each student and exam, you need to determine their total points and total time. Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.

Output: Return a pandas dataframe with three columns:

  • 'Exam' (string): the exam ID
  • 'Points' (float): the total points for that exam averaged over all students
  • 'Hours' (float): the total time in hours for that exam averaged over all students

Comments/hints: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!

Ex. 10 demo input

In [47]:
### Define demo inputs ###

demo_df_ex10 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex10
Out[47]:
Email Exam Milestone Points Hours
0 matthew.monise@hotmail.com exam3 problem0 10.0 13.555
1 matthew.monise@hotmail.com exam3 problem1 10.0 1.844
2 matthew.monise@hotmail.com exam3 problem4 3.0 0.497
3 matthew.monise@hotmail.com exam3 problem5 8.0 14.469
4 matthew.monise@hotmail.com exam3 problem2 3.0 2.111
5 matthew.monise@hotmail.com exam3 problem3 4.0 3.072
6 paul.laconya@gmail.com exam3 problem1 10.0 3.553
7 paul.laconya@gmail.com exam3 problem2 2.0 0.992
8 paul.laconya@gmail.com exam3 problem4 8.0 17.808
9 paul.laconya@gmail.com exam3 problem5 10.0 1.229
10 paul.laconya@gmail.com exam3 problem3 6.0 1.693
11 paul.laconya@gmail.com exam3 problem0 10.0 1.687
12 matthew.monise@hotmail.com exam1 problem1 10.0 2.404
13 matthew.monise@hotmail.com exam1 problem2 10.0 5.234
14 matthew.monise@hotmail.com exam1 problem3 10.0 7.109
15 matthew.monise@hotmail.com exam1 problem0 10.0 0.212
16 matthew.monise@hotmail.com exam1 problem4 5.0 1.390
17 paul.laconya@gmail.com exam1 problem0 10.0 2.165
18 paul.laconya@gmail.com exam1 problem1 10.0 1.138
19 paul.laconya@gmail.com exam1 problem2 10.0 5.228
20 paul.laconya@gmail.com exam1 problem3 5.0 0.644
21 paul.laconya@gmail.com exam1 problem4 10.0 3.093
The above input should produce: | | Exam | Points | Hours | |---:|:-------|---------:|--------:| | 0 | exam1 | 45 | 14.3085 | | 1 | exam3 | 42 | 31.255 | As a spot check, observe that the input dataset has two students. Consider their `'exam3'` scores: student `'matthew.monise@hotmail.com'` got a total of 38 points and `'paul.laconya@gmail.com'` got 46 points, yielding an average of (38+46)/2 = 42 points.

Ex. 10 your code

In [48]:
### Exercise 10 solution
def summarize_exams(df):
    '''
    INPUT:
    A dataframe, where each row has the data for a (student, exam, problem) triple, but the only columns provided are:
    1. 'Email' (string): a student email address
    2. 'Exam' (string): the ID of an exam
    3. 'Milestone' (string): the problem number
    4. 'Points' (float): the points scored on that problem (milestone) for that exam
    5. 'Hours' (float): the estimated number of hours spent on that problem for that exam
    
    GOAL:
    For each student and exam, determine their total points and total time. 
    Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.
    
    Return a pandas dataframe with three columns:
    1. 'Exam' (string): the exam ID
    2. 'Points' (float): the total points for that exam averaged over all students
    3. 'Hours' (float): the total time in hours for that exam averaged over all students
    
    Hint: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!
    
    STRATEGY:
    1. Let's first determine total points and total time for each student and exam. 
       So group by 'Email' and 'Exam' and sum. Let's call this df_sum.
    2. Reset index to bring index ('Email' and 'Exam') back as regular columns
    3. Now let's find the mean of 'Hours' and 'Points' by Exam. 
       So group by 'Exam' and find mean.
    4. Reset index to bring index ('Exam') back as a regular column
    5. Return this data frame.    
    '''
    
    # SOLUTION: 
    df_sum = df.groupby(['Email', 'Exam']).sum()
    df_sum = df_sum.reset_index()
    
    df_mean = df_sum.groupby('Exam').mean()
    df_mean = df_mean.reset_index()
    
    return df_mean
    
    
### demo function call ###
summarize_exams(demo_df_ex10)
Out[48]:
Exam Points Hours
0 exam1 45.0 14.3085
1 exam3 42.0 31.2550

Ex. 10 test

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [49]:
### test_cell_ex10
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_10', 
    'func': summarize_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 10 RUNME (precomputed solutions)

The cell below creates a precomputed object, exam_summary. It is the result of a correct implementation of summarize_exams when run on concatdf from the full dataset.

In [50]:
exam_summary = cse6040.utils.load_obj_from_file('exam_summary.dill')
exam_summary
Out[50]:
Exam Points Hours
0 exam1 43.303779 14.253057
1 exam2 41.165123 27.490836
2 exam3 37.701169 23.046157

Final task: A simple regression

For the case of three exams, the instructor would like to know whether the first two scores and the timing numbers can predict the third exam score. To carry out this analysis, the instructor needs to build a data matrix and wants your help.

Ex. 11 (3 pts): form_data_table

Suppose you are given the same input as Ex. 10, that is, a concatenated dataframe df like the one that produced in Ex. 9. Complete the function,

def form_data_table(df):
    ...

so that it constructs a dataframe whose values will (later) become the entries of a linear-regression data matrix.

Input: The same dataframe as Ex. 10. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

  • 'Email' (string): a student email address
  • 'Exam' (string): the ID of an exam
  • 'Milestone' (string): the problem number
  • 'Points' (float): the points scored on that problem (milestone) for that exam
  • 'Hours' (float): the estimated number of hours spent on that problem for that exam

Output: Construct and return a new dataframe where each row corresponds with a student, and the columns are as follows:

  1. 'Email' (string): The student email. There will be one row per student in the output.
  2. 'Points#' (float): One column of total points scored for each exam present in the input. For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums, with the values being the total points earned by the student on the corresponding exam.
  3. 'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.

As a final step, if there are any NaNs, drop the corresponding rows.

The idea behind this format is that each student will be an "observation" or "sample" for linear regression, and the columns will serve as predictors or features.

Comments/hints: As usual, you should not modify the input dataframe, df.

Ex. 11 demo input

In [51]:
### Define demo inputs ###
demo_df_ex11 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex11
Out[51]:
Email Exam Milestone Points Hours
0 matthew.monise@hotmail.com exam3 problem0 10.0 13.555
1 matthew.monise@hotmail.com exam3 problem1 10.0 1.844
2 matthew.monise@hotmail.com exam3 problem4 3.0 0.497
3 matthew.monise@hotmail.com exam3 problem5 8.0 14.469
4 matthew.monise@hotmail.com exam3 problem2 3.0 2.111
5 matthew.monise@hotmail.com exam3 problem3 4.0 3.072
6 paul.laconya@gmail.com exam3 problem1 10.0 3.553
7 paul.laconya@gmail.com exam3 problem2 2.0 0.992
8 paul.laconya@gmail.com exam3 problem4 8.0 17.808
9 paul.laconya@gmail.com exam3 problem5 10.0 1.229
10 paul.laconya@gmail.com exam3 problem3 6.0 1.693
11 paul.laconya@gmail.com exam3 problem0 10.0 1.687
12 matthew.monise@hotmail.com exam1 problem1 10.0 2.404
13 matthew.monise@hotmail.com exam1 problem2 10.0 5.234
14 matthew.monise@hotmail.com exam1 problem3 10.0 7.109
15 matthew.monise@hotmail.com exam1 problem0 10.0 0.212
16 matthew.monise@hotmail.com exam1 problem4 5.0 1.390
17 paul.laconya@gmail.com exam1 problem0 10.0 2.165
18 paul.laconya@gmail.com exam1 problem1 10.0 1.138
19 paul.laconya@gmail.com exam1 problem2 10.0 5.228
20 paul.laconya@gmail.com exam1 problem3 5.0 0.644
21 paul.laconya@gmail.com exam1 problem4 10.0 3.093
For the sample input above, a correct implementation would produce: | | Email | Hours1 | Hours3 | Points1 | Points3 | |---:|:---------------------------|---------:|---------:|----------:|----------:| | 0 | matthew.monise@hotmail.com | 16.349 | 35.548 | 45 | 38 | | 1 | paul.laconya@gmail.com | 12.268 | 26.962 | 45 | 46 | This format makes it easy to, say, try to apply one of our models, like using linear regression to predict `'Points3'` from `'Hours2'` and `'Points2'`—assuming that is even a good model, which it might not be!

Ex. 11 your code

In [52]:
### Exercise 11 solution
def form_data_table(df):
    '''
    INPUT:
    A data frame df, where each row gives the data for a (student, exam, problem) triple, 
    but the only columns provided are:
    -'Email' (string): a student email address
    -'Exam' (string): the ID of an exam
    -'Milestone' (string): the problem number
    -'Points' (float): the points scored on that problem (milestone) for that exam
    -'Hours' (float): the estimated number of hours spent on that problem for that exam
    
    GOAL:
    Return a new dataframe where each row corresponds with a student, and the columns are as follows:
    -'Email' (string): The student email. There will be one row per student in the output.
    -'Points#' (float): One column of total points scored for each exam present in the input. 
      For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums, 
      with the values being the total points earned by the student on the corresponding exam.
    -'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.
    
    If there are any NaNs, drop the corresponding rows.
    
    STRATEGY:
    1. This looks like slightly more complicated casting (need to cast twice as there are multiple value columns you 
       need to create with new names)
    2. *Borrow* casting function from NB7 solution. Add as a helper function.
    3. Find the sum of 'Points' and 'Hours' grouped by 'Email'. Let's call this df_sum.
    4. Reset index to bring Email index back as an actual column
    5. Make 2 data frames: one containing Email + Exam + Points info and the other containing Email + Exam + Hours info
    6. Call Cast helper function on each of these data frames (passing in correct parameters for each)
    7. Merge the two resulting data frames together (on 'Email') to get our final data frame
    8. Drop any rows that contain na
    9. Return this data frame    
    '''
    
    # SOLUTION:
    def cast(df, key, value, join_how='outer'):
        fixed_vars = df.columns.difference([key, value])
        tibble = pd.DataFrame(columns=fixed_vars) 
    
        new_vars = df[key].unique()
        for v in new_vars:
            df_v = df[df[key] == v]
            del df_v[key]
            df_v = df_v.rename(columns={value: value+v[4:]})
            tibble = tibble.merge(df_v, on=list(fixed_vars),how=join_how)  
        return tibble

    df_sum = df.groupby(['Email', 'Exam']).sum()
    df_sum = df_sum.reset_index()
    
    df_points = df_sum[['Email', 'Exam', 'Points']]
    df_points = cast(df_points, 'Exam', 'Points', 'outer')
    
    df_hours = df_sum[['Email', 'Exam', 'Hours']]
    df_hours = cast(df_hours, 'Exam', 'Hours', 'outer')
    
    df_final = df_points.merge(df_hours, on='Email')
    df_final = df_final.dropna()
    
    return df_final
    
    
### demo function call ###
form_data_table(demo_df_ex11)
Out[52]:
Email Points1 Points3 Hours1 Hours3
0 matthew.monise@hotmail.com 45.0 38.0 16.349 35.548
1 paul.laconya@gmail.com 45.0 46.0 12.268 26.962

Ex. 11 test

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [53]:
### test_cell_ex11
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_11', 
    'func': form_data_table, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Passed! Please submit.

Ex. 11 RUNME (precomputed solutions)

The cell below creates a precomputed object, data_table. It is the result of a correct implementation of form_data_table when run on concatdf from the full dataset.

In [54]:
data_table = cse6040.utils.load_obj_from_file('data_table.dill')
data_table
Out[54]:
Email Hours1 Hours2 Hours3 Points1 Points2 Points3
0 aaliyah.tayviona@gatech.edu 22.749 36.018 32.916 23.0 33.0 39.0
1 aaron.kyerra@foxmail.com 7.689 23.262 8.649 47.0 50.0 31.0
2 abigail.alizah@umich.edu 20.258 35.771 28.797 45.0 32.0 29.5
3 abigail.lanty@gatech.edu 19.671 31.802 31.423 45.0 47.0 33.5
4 abigail.lynnon@gmail.com 20.527 30.615 29.204 0.0 7.0 12.0
... ... ... ... ... ... ... ...
688 william.sanvi@gmail.com 18.791 27.572 7.296 44.0 33.0 27.5
689 winston.wymon@gmail.com 16.476 32.701 13.935 47.0 45.0 39.0
690 wm.charrisse@gatech.edu 21.344 29.959 22.292 50.0 50.0 32.5
692 zitlaly.nester@valensdatalabs.com 9.432 15.756 12.448 45.0 48.0 37.0
693 zuri.eilleen@gatech.edu 10.640 17.436 24.853 45.0 50.0 45.0

590 rows × 7 columns

Fin (+ postscript)

If you have made it this far, congratulations on completing the exam. Don't forget to submit!

Postscript: We went through a lot of work in this exam to get the data table. So if we try to regress, say, 'Points3' from the other columns, how well does that work? Here is a code fragment to check. What does the result say?

In [55]:
# Predictors:
data_cols = data_table[['Points1', 'Hours1', 'Points2', 'Hours2']]
X = data_cols.values
X /= X.max(axis=0) # Normalize columns

# Response:
resp_col = 'Points3'
y = data_table[resp_col].values
y /= y.max()

from sklearn.linear_model import LinearRegression
model_type = LinearRegression(fit_intercept=True, positive=True)
model = model_type.fit(X, y)

print("* Predictors:", data_cols)
print("* Response:", resp_col)
print("* Model coefficients:", model.coef_)
print("* Model intercept:", model.intercept_)

y_pred = model.predict(X)
print(f"* Mean relative error: {abs((y_pred - y)/y).mean()*100:.1f}%")
* Predictors:      Points1    Hours1  Points2    Hours2
0       0.46  0.946337     0.66  0.996459
1       0.94  0.319855     1.00  0.643557
2       0.90  0.842714     0.64  0.989625
3       0.90  0.818295     0.94  0.879821
4       0.00  0.853904     0.14  0.846982
..       ...       ...      ...       ...
688     0.88  0.781688     0.66  0.762795
689     0.94  0.685386     0.90  0.904692
690     1.00  0.887891     1.00  0.828833
692     0.90  0.392362     0.96  0.435899
693     0.90  0.442614     1.00  0.482377

[590 rows x 4 columns]
* Response: Points3
* Model coefficients: [0.28781907 0.         0.20827215 0.        ]
* Model intercept: 0.19770122252806183
* Mean relative error: 25.4%