Final Exam, Fall 2023: Exam Analysis¶

Version 1.0

All of the header information is important. Please read it..

Topics, number of exercises: This problem builds on your knowledge of pandas, Numpy, and linear regression. It has 12 exercises exercises, numbered 0 to 11. There are 22 available points. However, to earn 100% the threshold is 15 points. (Therefore, once you hit 15 points, you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### define demo inputs ### load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Ex. 0: 1 (FREE) point(s)
Ex. 1: 2 point(s)
Ex. 2: 1 point(s)
Ex. 3: 1 (FREE) point(s)
Ex. 4: 2 point(s)
Ex. 5: 1 point(s)
Ex. 6: 1 (FREE) point(s)
Ex. 7: 3 point(s)
Ex. 8: 3 point(s)
Ex 9: 2 point(s)
Ex 10: 2 point(s)
Ex. 11: 3 point(s)

Exercises 0, 3, and 6 involve reading only (no coding), so they are "free" points. However, you still need to run the associated test cells and submit to get those points.

Final reminders:

We recommend submitting after every exercise.
Please review the generated grade report after you submit to see what errors were returned.
Stay calm, skip problems as needed, and take short breaks at your leisure.

Overview: Exam Examination¶

An instructor for another course has asked you to help analyze some of their exam data. The exams were completed electronically, in a format similar to our class's, but with some twists:

The class has multiple exams, named 'exam1', 'exam2', and so on. Exams were timed, but with different time limits for each exam.
Each exam consisted of multiple "milestones" (or "problems"). There were different numbers of problems for each exam, and different problems can be worth different numbers of points. (Thus, the total points on different exams could also differ.)
Students were allowed to work on the problems in any order and submit in any order, similar to our exams.

The dataset itself has two main parts:

The grades data, which shows, for each student, exam, and problem, how many points they scored.
The timing data, which shows, for each student and problem, when (i.e., at what date and time) the student started the problem and made their last submission thereof.

Each exam has its own grading and timing data, as you'll see.

Your overall task: You will help clean up and analyze this data, culminating in an assessment, based on linear regression, to see whether or not the last exam score can be predicted from the scores and timing information of earlier exams.

Setup¶

To get started, run the following code cells. Take note of preloaded modules, as it's likely you'll want to use them.

### Global Imports
###
### AUTOGRADER TEST - DO NOT REMOVE
###

# Modules you may find useful:
import pandas as pd
import numpy as np

# Some extra functions that this notebook needs:
import cse6040.utils
from pprint import pprint # Pretty-printer

# Load the dataset:
print("\n==> Loading the dataset...")
from cse6040.examdata import load_dataset
grades_by_exid, times_by_exid = load_dataset()
print("\n==> Done executing the setup and ready for take-off!")

==> Loading the dataset...
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam3.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam1.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_grades_exam2.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam3.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam1.csv'...
  ==> Done!
Loading `DataFrame` from 'resource/asnlib/publicdata/anon_times_exam2.csv'...
  ==> Done!

==> Done executing the setup and ready for take-off!

Ex. 0 (1 pt; FREE): Understanding the `grades` dataframe¶

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

The global variable grades_by_exam is a dictionary of pandas dataframes. Each key is an exam ID (e.g., 'exam2'), and each value is a dataframe holding the grading data for that exam. Let's take a look at an example, which are the grades for the exam with exam id 'exam1'.

list(grades_by_exid.keys())

['exam3', 'exam1', 'exam2']

Evidently, there are three exams in this dataset. However, the code you write later should not assume three exams unless otherwise specified.

grades_by_exid['exam1']

Note: The data has been anonymized with fake email addresses. Any resemblance to actual email addresses is purely coincidental.

Observations:

Recall that we are looking at just one exam, 'exam1', in the example above. There are similar dataframes for 'exam2' and 'exam3'.
There is one row per student, each identified by their unique email address. (Note: These are anonymized with fake email addresses. Any resemblance to real addresses is purely coincidental.)
Each column corresponds to an exam problem or "milestone." The value is the number of points the student got from an autograder.
If the student did not submit a solution for a problem, the score appears as a string with exactly eight hyphens: '--------'. For example, see row 1, where juan.quay@aol.com did not submit problem4.
Some students (probably) dropped the class or did not attempt the exam at all. In those cases, all problem scores for that student are missing. For instance, see rows 1544-1548 in the output.

Ex. 0 test (FREEBIE)¶

### test_cell_ex0 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 1 (2 pt): `remove_invalid_grades`¶

Given a grades dataframe, gdf, complete the function,

def remove_invalid_grades(gdf):
    ...

so that it returns clean copy of gdf.

Input: A grades dataframe, gdf. It will have a column named 'Email' and one or more columns named 'problem#' where # is an integer.

Your task: Copy this dataframe and clean it as follows:

Any problem that was not submitted, which will equal a string of 8 hyphens, '--------', should be replaced with a not-a-number (NaN) value, i.e., a Numpy np.nan.
Any row where none of the problems were submitted should be removed completely.
All problem columns should be converted to have a .dtype of float.

Output: Return a clean tibble copy, without modifying the input gdf. It should contain exactly the same columns as the input, even if the result is an empty dataframe, and the .dtype should be as specified above.

Caveat(s)/comment(s)/hint(s):

Since we are asking for a tibble output, you should reset the index.
The dataframe .replace() and .filter() methods may be helpful (but are not required).
You may assume there is at least 1 problem column. However, do not assume the input has a fixed number of problems. That is, the input might have 'problem0' through 'problem4' as shown in Exercise 0, but it might also only have, say, two columns named 'problem1' and 'problem3', or it might have four problem columns named 'problem0', 'problem1', 'problem25', and 'problem234'. Your code should robustly handle these cases. The dataframe's .dropna() method may be helpful.

Ex. 1 demo input¶

### Define demo inputs ###
demo_gdf_ex1 = grades_by_exid['exam1'].loc[[0, 1, 2, 683, 3, 957, 4, 1546]].reset_index(drop=True)
demo_gdf_ex1

Consider the demo input, demo_gdf_ex1, shown above. A correct output is:

	Email	problem0	problem1	problem2	problem3	problem4
0	sammie.jaydaa@gmail.com	7	10	10	10	10
1	juan.quay@aol.com	10	10	6.5	0	nan
2	gino.branston@gatech.edu	10	5	10	10	10
3	danny.jameriah@gmail.com	7	nan	nan	nan	nan
4	robert.gioanna@gatech.edu	10	10	10	10	10
5	adam.evemarie@gmail.com	10	10	10	10	10

Observe that rows 5 and 7 from the input are removed entirely since all of their problem values are missing. By contrast, rows 1 and 3 from the input have some non-NaN values and thus are retained.

In addition, the output types (inspected via .info()) should match the following:

 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     6 non-null      object 
 1   problem0  6 non-null      float64
 2   problem1  5 non-null      float64
 3   problem2  5 non-null      float64
 4   problem3  5 non-null      float64
 5   problem4  4 non-null      float64
dtypes: float64(5), object(1)

Ex. 1 your code¶

### Exercise 1 solution
def remove_invalid_grades(gdf):
    '''
    INPUT:
    A grades dataframe called gdf. Columns are 'Email' and one or more columns named 'problem#' where # is an integer.
    
    GOAL: 
    Copy this dataframe and clean it as follows:
    -Replace 8 hyphens '--------' with np.nan
    -Remove rows where no problems were submitted
    -Convert problem columns to be float
    
    Return this cleaned data frame.
    
    STRATEGY:
    1. Make copy of input data frame. Let's call this 'gdf_final'
    2. Get a list of all of the columns that start with 'problem' 
    3. Replace any values in those problem columns that contain 8 hyphens with np.nan
    4. Remove any rows where the problem columns are all np.nan now
    5. Convert problem columns to be float
    6. Reset index and return this data frame
    '''
    
    # SOLUTION:
    gdf_final = gdf.copy()    
    
    # Most robust (safest) way is to use Filter + Regex, specify digit after problem:
    problem_columns = list(gdf_final.filter(regex=r'^problem\d+', axis=1))
    
    # Alternative way: find all columns that start with 'problem':
    # https://stackoverflow.com/questions/27275236/how-to-select-all-columns-whose-names-start-with-x-in-a-pandas-dataframe
    #problem_columns = [col for col in gdf_final if col.startswith('problem')]
    
    # Super Risky Way (may or may not work depending on column input/order):
    #problem_columns = list(gdf_final.columns[1:])
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
    gdf_final[problem_columns] = gdf_final[problem_columns].replace('--------', np.nan)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
    gdf_final = gdf_final.dropna(how='all', subset=problem_columns)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
    gdf_final[problem_columns] = gdf_final[problem_columns].astype('float64')
    
    return gdf_final.reset_index(drop=True)

### demo function call ###
demo_result_ex1 = remove_invalid_grades(demo_gdf_ex1)
display(demo_result_ex1)
demo_result_ex1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     6 non-null      object 
 1   problem0  6 non-null      float64
 2   problem1  5 non-null      float64
 3   problem2  5 non-null      float64
 4   problem3  5 non-null      float64
 5   problem4  4 non-null      float64
dtypes: float64(5), object(1)
memory usage: 416.0+ bytes

Ex. 1 test¶

The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex1
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1', 
    'func': remove_invalid_grades, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'gdf':{
            'dtype': 'df', # data type of param.
            'check_modified': True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 1 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, valid_grades_by_exid. It is the result of a correct implementation of remove_invalid_grades when run on grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

valid_grades_by_exid = cse6040.utils.load_obj_from_file('valid_grades.dill')

# Example:
display(valid_grades_by_exid['exam1'].iloc[:4])
valid_grades_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688 entries, 0 to 687
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     688 non-null    object 
 1   problem0  684 non-null    float64
 2   problem1  672 non-null    float64
 3   problem2  654 non-null    float64
 4   problem3  655 non-null    float64
 5   problem4  638 non-null    float64
dtypes: float64(5), object(1)
memory usage: 32.4+ KB

Ex. 2 (1 pt): `calc_earned_points`¶

Suppose you have a grades dataframe, valid_gdf, having only valid rows as calculated in Exercise 1. Complete the function,

def calc_earned_points(valid_gdf):
    ...

so that it returns a new copy of valid_gdf with an additional column holding the total points earned.

Input: A dataframe valid_gdf with only valid rows, per Exercise 1.

Your task: Create a copy and sum the points earned by each student.

Output: Return a new tibble with the same columns as valid_gdf plus an additional column, 'Earned', of .dtype float. This column should contain the student's total score, that is, the sum of points on their problems. Any NaN values should be ignored. You should not modify the input valid_gdf.

Ex. 2 demo input¶

### Define demo inputs ###
demo_valid_gdf_ex2 = valid_grades_by_exid['exam1'].sample(6, random_state=3_456_789_012).reset_index(drop=True)
demo_valid_gdf_ex2

The demo included in the solution cell below should display the following output:

	Email	problem0	problem1	problem2	problem3	problem4	Earned
0	rene.amritpal@gmail.com	10	10	10	10	10	50
1	albert.averley@gmail.com	10	2	8	10	10	40
2	mary.yahshua@gatech.edu	7	2	6.5	5	nan	20.5
3	jillian.myler@gmail.com	10	10	10	10	1	41
4	raina.sheehan@gmail.com	10	10	10	5	10	45
5	allison.link@yahoo.com	7	2	nan	nan	nan	9

Note: Although the values appear to include integers, in fact, the .dtype of the columns should be float.

Ex. 2 your code¶

### Exercise 2 solution ###
def calc_earned_points(valid_gdf):
    '''
    INPUT:
    A dataframe 'valid_gdf'
    
    GOAL:
    Create a copy and sum the points earned by each student.
    Return new data frame with same columns + Earned column. 
    Earned column should be a float. Ignore NaNs.
    
    STRATEGY:
    1. Make a copy of the input data frame. Let's call this 'valid_gdf_final'
    2. Sum the problem columns and create a new column called 'Earned'
    3. Convert 'Earned' column to be float
    4. Return the data frame
    '''
    
    # SOLUTION:
    valid_gdf_final = valid_gdf.copy()

    # Using same method as in Exercise 1 to identify problem columns
    problem_columns = list(valid_gdf_final.filter(regex=r'^problem\d+', axis=1))
    
    valid_gdf_final['Earned'] = valid_gdf_final[problem_columns].sum(axis=1)
    
    valid_gdf_final['Earned'] = valid_gdf_final['Earned'].astype('float64')
    
    return valid_gdf_final
    

### demo function call ###
calc_earned_points(demo_valid_gdf_ex2)

Ex. 2 test¶

The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex2 ###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_2', 
    'func': calc_earned_points, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'valid_gdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 2 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, earned_by_exid. It is the result of a correct implementation of calc_earned_grades when run on valid_grades_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

earned_by_exid = cse6040.utils.load_obj_from_file('earned_grades.dill')

# Example:
display(earned_by_exid['exam1'].iloc[:4])
earned_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688 entries, 0 to 687
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Email     688 non-null    object 
 1   problem0  684 non-null    float64
 2   problem1  672 non-null    float64
 3   problem2  654 non-null    float64
 4   problem3  655 non-null    float64
 5   problem4  638 non-null    float64
 6   Earned    688 non-null    float64
dtypes: float64(6), object(1)
memory usage: 37.8+ KB

Ex. 3 (1 FREE pt): Understanding the `timings` dataset¶

This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).

Similar to the grades dataset (see Ex. 0), each exam has a timings dataset. Here's what it looks like for 'exam1':

demo_times3_ex3 = times_by_exid['exam3']
demo_times3_ex3

Observations:

Each row is for a student and a problem ("milestone" column). There may be multiple rows for the same student, but a (student, problem) pair appears at most once.
The 'start date-time' and 'submit date-time' columns hold timestamps, and their .dtypes are special pandas Timestamp objects. The mini-tutorial below shows you how they work.
The 'start date-time' value is when the student began the exam. It will have the same value for all rows of a given student.
The 'submit date-time' value is the timestamp of the student's last submission for that problem. That is, only one submission is recorded per student and problem.

Implications: One consequence of the data we have is that we cannot really know how the student navigated the exam: we know when they started, but if they jumped around or submitted a problem multiple times, the only information we have is the last submission for each problem. Nevertheless, we will use the information we have to guess at their timeline, after some clean up.

Timestamps: The printed timestamps have the following form:

2018-12-07 18:35:23

That corresponds to December 7, 2018, at 6:35 pm (and 23 seconds).

Luckily, you don't need to parse timestamps! These objects make it easy to sort and calculate time differences, which is what we'll need. For instance, we can check if one date is larger than another using the usual comparison-operators, like > or <=:

(demo_times3_ex3['submit date-time'] > demo_times3_ex3['start date-time']).head()

0    True
1    True
2    True
3    True
4    True
dtype: bool

And we can subtract times using the usual - operator, as we would for integers or floats:

(demo_times3_ex3['submit date-time'] - demo_times3_ex3['start date-time']).head()

0   0 days 20:45:19
1   1 days 03:11:49
2   0 days 07:17:04
3   1 days 08:54:30
4   0 days 19:42:42
dtype: timedelta64[ns]

The difference is stored as a timedelta64 object. The way to read printed item 0 above is "20 hours, 45 minutes, and 19 seconds."

Time-delta objects can be converted into a regular float representing the number of seconds using a method called .total_seconds():

(demo_times3_ex3['submit date-time'].iloc[0] - demo_times3_ex3['start date-time'].iloc[0]).total_seconds()

74719.0

We will need this transformation to simplify some calculations later on, so tuck this fact away.

Some messiness: The data has some "impurities." In particular, check out the data for 'kenneth.bryland@gatech.edu':

demo_times3_ex3_messy = demo_times3_ex3.loc[[206, 1111, 1780, 2511, 3375]]
demo_times3_ex3_messy

Observe that their 'problem0' and 'problem1' submission times occur before their start times, which we confirm next:

demo_times3_ex3_messy['submit date-time'] > demo_times3_ex3_messy['start date-time']

206     False
1111    False
1780     True
2511     True
3375     True
dtype: bool

While it turns out there is a reason for that, we will want to omit cases like that to simplify our analysis.

Ex. 3 test (FREEBIE)¶

### test_cell_ex3 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 4 (2 pts): `remove_causality_violators`¶

Suppose you are given a timings dataframe, tdf, per Exercise 3. Complete the function,

def remove_causality_violators(tdf):
    ...

so that it returns a new, cleaned copy that omits records for any student who has timestamps that "violate causality." A causality violation means that their submission time is strictly less than their start time.

Input: A timings dataframe, tdf, with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'.

Your task: Identify all students who have any causality violations and omit their rows from the output dataframe.

Output: Return a new tibble that includes only rows from tdf for students whose problem-records do not violate causality. The output column .dtypes should match those of the input. You should not modify the input, tdf.

Comment(s): If a student has even one problem that violates causality, you should omit all of their rows. That is, we will consider their entire exam data invalid for the purpose of our subsequent analysis.

Ex. 4 demo input¶

### Define demo inputs ###
demo_tdf_ex4 = times_by_exid['exam3'].loc[[206, 279, 439, 1111, 1344, 1780, 2511, 3224, 3375]].reset_index(drop=True)
demo_tdf_ex4

The demo included in the solution cell below should display the following output:

	email	milestone	start date-time	submit date-time
0	daniel.tashera@gmail.com	problem0	2018-12-09 01:12:56	2018-12-09 02:29:33
1	barbara.wilhemina@gatech.edu	problem0	2018-12-10 02:43:28	2018-12-10 06:21:30
2	barbara.wilhemina@gatech.edu	problem2	2018-12-10 02:43:28	2018-12-10 09:09:43
3	barbara.wilhemina@gatech.edu	problem5	2018-12-10 02:43:28	2018-12-10 09:04:35

Observe that kenneth.bryland@gatech.edu is omitted from this output because their 'problem0' and 'problem1' records violate causality. (Even though their remaining problems obey causality, we still drop them because of the problematic rows.)

Ex. 4 your code¶

### Exercise 4 solution ###
def remove_causality_violators(tdf):
    '''
    INPUT:
    A timings dataframe 'tdf' with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'
    
    GOAL:
    Find rows where submit date_time < start date-time.
    Return a new dataframe with these bad rows omitted. 
    Omit all rows for that student if one or more bad rows exist.
    
    STRATEGY:
    1. Make a copy of input data frame. Let's call this 'tdf_final'
    2. Find rows where 'submit date-time' < 'start date-time'. Let's call these 'tdf_bad_rows'.
    3. Get a list of all of the student emails in the bad rows identified in Step 2. Let's call this 'bad_emails_list'
    4. Grab the rows from tdf_final where the email address is NOT in 'bad_emails_list'
    5. Reset index and return tdf_final
    '''
    
    # SOLUTION:
    tdf_final = tdf.copy()
    
    tdf_bad_rows = tdf_final[tdf_final['submit date-time'] < tdf_final['start date-time']]
    
    bad_emails_list = list(tdf_bad_rows['email'])
    
    tdf_final = tdf_final[~tdf_final['email'].isin(bad_emails_list)]
    
    return tdf_final.reset_index(drop=True)
    
    

### demo function call ###
remove_causality_violators(demo_tdf_ex4)

Ex. 4 test¶

The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex4
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_4', 
    'func': remove_causality_violators, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 4 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, valid_times_by_exid. It is the result of a correct implementation of remove_causality_violaters when run on times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

valid_times_by_exid = cse6040.utils.load_obj_from_file('valid_times.dill')

# Example:
assert (valid_times_by_exid['exam3']['submit date-time'] >= valid_times_by_exid['exam3']['start date-time']).all()
display(valid_times_by_exid['exam1'].iloc[:4])
valid_times_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   email             3304 non-null   object        
 1   milestone         3304 non-null   object        
 2   start date-time   3304 non-null   datetime64[ns]
 3   submit date-time  3304 non-null   datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 103.4+ KB

Ex. 5 (1 pt): `clean_times_colnames`¶

Let's tidy up the column names for the timings data. Complete the function,

def clean_times_colnames(tdf):
    ...

to return a new tibble from tdf with columns renamed as follows.

'email' => 'Email'
'milestone' => 'Milestone'
'start date-time' => 'Start'
'submit date-time' => 'Submit'

Element types for the columns (.dtype) should be preserved, and the input dataframe should not be modified.

Ex. 5 demo input¶

### define demo inputs ###
demo_tdf_ex5 = valid_times_by_exid['exam2'].sample(5, random_state=567_890_123, replace=False).reset_index(drop=True)
demo_tdf_ex5

The demo included in the solution cell below should display the following output:

	Email	Milestone	Start	Submit
0	jazlyn.florin@gmail.com	problem4	2018-11-04 12:45:10	2018-11-05 20:50:41
1	christian.miyisha@yahoo.com	problem2	2018-11-03 07:38:00	2018-11-04 12:21:12
2	taryn.raneshia@yahoo.co.id	problem0	2018-11-04 08:25:24	2018-11-05 17:59:02
3	joseph.sherburne@gmail.com	problem1	2018-11-02 16:45:06	2018-11-03 01:47:58
4	thomas.dyanni@gmail.com	problem3	2018-11-03 14:20:56	2018-11-04 23:31:18

Ex. 5 your code¶

### Exercise 5 solution
def clean_times_colnames(tdf):
    '''
    INPUT: 
    a data frame 'tdf'
    
    GOAL:
    Copy data frame and rename columns from:
    'email' => 'Email'
    'milestone' => 'Milestone'
    'start date-time' => 'Start'
    'submit date-time' => 'Submit'
    
    STRATEGY:
    1. Make a copy of input data frame. Let's call this 'tdf_final'
    2. Rename columns as instructed
    3. Return 'tdf_final'
    '''
    
    #SOLUTION:
    tdf_final = tdf.copy()
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
    tdf_final = tdf_final.rename(columns={'email':'Email', 'milestone':'Milestone', 'start date-time':'Start', 
                                          'submit date-time':'Submit'})
    
    return tdf_final  

### demo function call ###
clean_times_colnames(demo_tdf_ex5)

Ex. 5 test¶

The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex5
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_5', 
    'func': clean_times_colnames, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 5 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, clean_times_by_exid. It is the result of a correct implementation of clean_times_colnames when run on valid_times_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

clean_times_by_exid = cse6040.utils.load_obj_from_file('clean_times.dill')

# Example:
display(clean_times_by_exid['exam1'].head())
clean_times_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Milestone  3304 non-null   object        
 2   Start      3304 non-null   datetime64[ns]
 3   Submit     3304 non-null   datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 103.4+ KB

Ex. 6 (1 pt; FREE): Submission sequences¶

Our next analysis will estimate the test-takers' behavior by analyzing their submission times.

For example, consider the submission times for 'sherry.ayleigh@aol.com', stored in the dataframe demo_tdf_ex6 below:

demo_ex6_tdf = clean_times_by_exid['exam3'].loc[[34, 593, 1624, 2189, 2344, 3122]]
demo_ex6_tdf

Based on the submission times, ordering the rows by the 'Submit' timestamp would yield a sensible timeline of submission events:

	Email	Milestone	Start	Submit
34	sherry.ayleigh@aol.com	problem0	2018-12-06 08:32:21	2018-12-06 09:07:16
593	sherry.ayleigh@aol.com	problem1	2018-12-06 08:32:21	2018-12-06 10:04:10
1624	sherry.ayleigh@aol.com	problem2	2018-12-06 08:32:21	2018-12-06 11:30:28
3122	sherry.ayleigh@aol.com	problem5	2018-12-06 08:32:21	2018-12-06 13:34:13
2344	sherry.ayleigh@aol.com	problem4	2018-12-06 08:32:21	2018-12-06 17:53:48
2189	sherry.ayleigh@aol.com	problem3	2018-12-06 08:32:21	2018-12-06 20:19:21

We refer to this ordering of submission times as the submission sequence. It is our best guess as to how they moved through the exam if we assume they did one problem at a time. From this submission sequence for this student, we might conclude the following.

First, the student started the exam at 2018-12-06 08:32:21 and submitted 'problem0' at 2018-12-06 09:07:16. That would be:

demo_ex6_tdf['Submit'].iloc[0] - demo_ex6_tdf['Start'].iloc[0]

Timedelta('0 days 00:34:55')

or 34 minutes and 55 seconds, which is also 2,095 seconds = (2095/3600) ~ 0.582 hours, rounding to three decimal digits. In other words, we will assume that the student started the exam and then spent 0.582 hours on Problem 0.

Next, the student submitted 'problem1' at 2018-12-06 10:04:10. Relative to the previous submission time, that is 56 minutes and 54 seconds, or 3,414 seconds = (3414/3600) ~ 0.948 hours, again rounded to three digits. So we might guess that the student spent that much time on Problem 1.

Continuing, the sequence of problems solved was Problem 0, 1, 2, 5, 4, and finally, 3. The approximate number of hours spent on each are 0.58, 0.948, 1.438, 2.062, 4.326, and 2.426 hours, respectively.

Ex. 6 test (FREEBIE)¶

### test_cell_ex6 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')

This exercise is free. To get the points, please run this cell _and_ submit.

Ex. 7 (3 pts): `sequence_submissions`¶

Suppose you are given a (clean) timings tibble, tdf, such as computed in Exercise 5. Complete the function,

def sequence_submissions(tdf):
    ...

so that it sequences the submissions for each student and returns a new tibble matching the format described below.

Input: A clean timings tibble, tdf, with the columns 'Email', 'Milestone', 'Start', and 'Stop', per Ex. 5.

Your task: For each student, order their submissions and determine the amount of time spent on each problem, as illustrated in Ex. 6.

Output: Without modifying the input tibble, return a new tibble with the following columns:

'Email', 'Milestone', 'Start', and 'Submit', which keep the same values as given in tdf.
'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits, corresponding to the estimated time spent on each problem.

That is, each row of the output will indicate the estimated time spent by the student on the given problem based on the estimated timeline.

Additional requirements: The rows of this new tibble should be sorted by 'Email' and 'Submit' timestamp in ascending order.

Ex. 7 demo input¶

### Define demo inputs ###
demo_tdf_ex7 = cse6040.utils.load_df_from_file('demo_tdf_ex7.df')
demo_tdf_ex7

In the preceding demo input, there are three students: 'sherry.ayleigh@aol.com', 'chelsea.jaretzi@utexas.edu', and 'jessica.rinda@gatech.edu'. Observe that

'sherry.ayleigh@aol.com' submitted Problems 0-5 inclusive.
'chelsea.jaretzi@utexas.edu' submitted Problems 0, 1, 2, 3, and 5. (So they skipped Problem 4.)
'jessica.rinda@gatech.edu' submitted Problems 1, 2, and 5. (So they skipped Problems 0, 3, and 4.)

A correct implementation of sequence_submissions on this input would produce:

	Email	Milestone	Start	Submit	Hours
0	chelsea.jaretzi@utexas.edu	problem1	2018-12-11 09:50:18	2018-12-12 02:34:35	16.738
1	chelsea.jaretzi@utexas.edu	problem0	2018-12-11 09:50:18	2018-12-12 02:40:16	0.095
2	chelsea.jaretzi@utexas.edu	problem5	2018-12-11 09:50:18	2018-12-12 03:45:49	1.092
3	chelsea.jaretzi@utexas.edu	problem3	2018-12-11 09:50:18	2018-12-12 05:49:20	2.059
4	chelsea.jaretzi@utexas.edu	problem2	2018-12-11 09:50:18	2018-12-12 06:17:25	0.468
5	jessica.rinda@gatech.edu	problem1	2018-12-07 12:05:41	2018-12-07 18:04:19	5.977
6	jessica.rinda@gatech.edu	problem5	2018-12-07 12:05:41	2018-12-07 18:11:47	0.125
7	jessica.rinda@gatech.edu	problem2	2018-12-07 12:05:41	2018-12-07 18:21:51	0.167
8	sherry.ayleigh@aol.com	problem0	2018-12-06 08:32:21	2018-12-06 09:07:16	0.582
9	sherry.ayleigh@aol.com	problem1	2018-12-06 08:32:21	2018-12-06 10:04:10	0.948
10	sherry.ayleigh@aol.com	problem2	2018-12-06 08:32:21	2018-12-06 11:30:28	1.439
11	sherry.ayleigh@aol.com	problem5	2018-12-06 08:32:21	2018-12-06 13:34:13	2.062
12	sherry.ayleigh@aol.com	problem4	2018-12-06 08:32:21	2018-12-06 17:53:48	4.326
13	sherry.ayleigh@aol.com	problem3	2018-12-06 08:32:21	2018-12-06 20:19:21	2.426

Observe the ordering by 'Email' and 'Submit' in ascending order. Also observe that the values in 'Hours' are rounded to 3 decimal digits and correspond to differences in submission times (and submission to start time for the first entry).

Ex. 7 your code¶

### Exercise 7 solution
def sequence_submissions(tdf):
    '''
    INPUT: 
    A data frame 'tdf' with the columns 'Email', 'Milestone', 'Start', and 'Stop'
    
    GOAL:
    -Do not modify input. 
    -Sort by 'Email' and 'Submit' timestamp in ascending order.
    -Return new data frame with columns 'Email', 'Milestone', 'Start', and 'Submit' with same values as given in tdf.
    -Add new column 'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits, 
    corresponding to the estimated time spent on each problem.
    
    STRATEGY (not using groupby + apply + helper function):
    1. Make a copy of the input data frame. Let's call this 'df'
    2. Sort by 'Email' and 'Submit' in ascending order.
    3. Reset index.
    4. Create a new column that holds the difference between the Submit time and the Start time. 
       Let's call this 'Hours from Start'
    5. Convert this column to hours (use hint above or something you find online)
    6. Now let's create a new column that's the same as 'Hours from Start' but shifted down one row for each GROUP 
       (so we should group by Email). Let's call this 'Shifted Hours'
    7. Now calculate 'Hours from Start' - 'Shifted Hours'. Make this a new column called 'Hours'
    8. Round the 'Hours' column to 3 decimal places
    9. Drop the columns 'Hours from Start' and 'Shifted Hours'
    10. Return the data frame
    '''
    
    # SOLUTION:
    df = tdf.copy()
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
    df = df.sort_values(by=['Email','Submit'])
    df = df.reset_index(drop=True)
    
    df['Hours from Start'] = df['Submit'] - df['Start']
    
    # https://www.statology.org/pandas-convert-timedelta-to-int/
    df['Hours from Start'] = df['Hours from Start'] / pd.Timedelta(hours=1)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
    g = df.groupby('Email')
    df['Shifted Hours'] = g['Hours from Start'].shift(periods=1, fill_value=0)
    
    df['Hours'] = df['Hours from Start'] - df['Shifted Hours']
    df['Hours'] = round(df['Hours'], 3)
    df = df.drop(columns=['Hours from Start', 'Shifted Hours'])
    
    return df
    
### demo function call ###
sequence_submissions(demo_tdf_ex7)

Ex. 7 test¶

The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex7
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7', 
    'func': sequence_submissions, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'tdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 2*10 ** (-3)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 7 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, hours_by_exid. It is the result of a correct implementation of clean_times_colnames when run on hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

hours_by_exid = cse6040.utils.load_obj_from_file('hours.dill')

# Example:
display(hours_by_exid['exam1'].head())
hours_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Milestone  3304 non-null   object        
 2   Start      3304 non-null   datetime64[ns]
 3   Submit     3304 non-null   datetime64[ns]
 4   Hours      3304 non-null   float64       
dtypes: datetime64[ns](2), float64(1), object(2)
memory usage: 129.2+ KB

Ex. 8 (3 pts): `merge_gt`¶

Given a tibble pointsdf of earned points (per Ex. 2) and another tibble hoursdf of sequenced hours-per-problem (per Ex. 7), complete the function,

def merge_gt(pointsdf, hoursdf):
    ...

so that it merges these into a single tibble formatted as described below.

Inputs:

pointsdf: A tibble of earned points, as computed by Exercise 2. It should have the columns, 'Email', one or more columns named 'problem#' where # is an integer string, and 'Earned'.
'hoursdf': A tibble of sequenced hours-per-problem, as computed in Exercise 7. It should have the columns, 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'.

Your tasks:

Melt a copy of pointsdf so that its 'problem#' columns become values of a new column named 'Milestone' and the point values become a new column named 'Points'. The 'Points' column should hold floats.
Merge this melted result with hoursdf on the common 'Email' and 'Milestone' key-columns. (Use an inner-join/merge so that only keys that match both inputs are retained.)
Finally, sort this result by 'Email' and 'Submit' in ascending order.

Output: Return the result as a new tibble with the columns 'Email', 'Start', 'Submit', 'Milestone', 'Points', and 'Hours'. You should not modify either input, pointsdf or hoursdf.

Ex. 8 demo input¶

### Define demo inputs ###
demo_pointsdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['pointsdf']
demo_hoursdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['hoursdf']

print("* `demo_pointsdf_ex8`:")
display(demo_pointsdf_ex8)

print("* `demo_hoursdf_ex8`:")
display(demo_hoursdf_ex8)

* `demo_pointsdf_ex8`:

* `demo_hoursdf_ex8`:

In the above inputs, observe that only 'justin.jernee@gatech.edu' has data in both the points and hours tibbles. Therefore, the merged result will only include that student's data, since we are using an inner merge. For this input, the result is:

	Email	Earned	Milestone	Points	Start	Submit	Hours
0	justin.jernee@gatech.edu	40	problem2	2	2018-12-08 10:53:50	2018-12-08 15:12:42	4.314
1	justin.jernee@gatech.edu	40	problem0	10	2018-12-08 10:53:50	2018-12-08 19:21:39	4.15
2	justin.jernee@gatech.edu	40	problem1	10	2018-12-08 10:53:50	2018-12-08 20:16:38	0.916
3	justin.jernee@gatech.edu	40	problem5	5	2018-12-08 10:53:50	2018-12-09 13:53:43	17.618
4	justin.jernee@gatech.edu	40	problem3	10	2018-12-08 10:53:50	2018-12-09 20:06:24	6.211
5	justin.jernee@gatech.edu	40	problem4	3	2018-12-08 10:53:50	2018-12-09 22:08:52	2.042

Observe the ordering of the rows. The problems appear in ascending order of submission time, per the sorting requirement of this exercise.

Ex. 8 your code¶

### Exercise 8 solution
def merge_gt(pointsdf, hoursdf):
    '''
    INPUT:
    1. pointsdf: has columns 'Email', 'problem#' columns where # is an integer string, and 'Earned'
    2. hoursdf: has columns 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'
    
    GOAL:
    Melt pointdf so that the 'problem#' columns become rows (in a column called 'Milestone'), 
    and the points become floats in a column called 'Points'.
    Inner merge this with hoursdf on 'Email' and 'Milestone'
    Final data frame should have columns: 'Email', 'Earned', 'Milestone', 'Points', 'Start', 'Submit', and 'Hours'
    
    STRATEGY:
    1. Melt pointsdf. Keep 'Email' and 'Earned' as is, all other columns (aka the problem ones) get melted. 
       Problem column should be named 'Milestone', and point values should be called 'Points'. 
       Let's call this new data frame melted_df.
    2. Inner merge melted_df with hoursdf on 'Email' and 'Milestone'. Let's call this new data frame merged_df.
    3. Sort merged_df by 'Email' and 'Submit' columns in ascending order
    4. Reset index
    5. Return this data frame
    '''
    
    # SOLUTION:
    # Easiest Way to Melt:
    # https://pandas.pydata.org/docs/reference/api/pandas.melt.html
    #melted_df = pd.melt(pointsdf, id_vars=['Email','Earned'], var_name='Milestone', value_name='Points')

    # Alternative Way to Melt using NB7 Melt Function (Changed to be helper function per Office Hours feedback):
    # We know our keep_vars, not our col_vars, so reverse difference line in the function below and input parameter to be
    # keep_vars instead of col_vars
    def melt(df, keep_vars, key, value):
        col_vals = df.columns.difference(keep_vars)
        melted_sections = []
        for c in col_vals:
            melted_c = df[keep_vars].copy()
            melted_c[key] = c
            melted_c[value] = df[c]
            melted_sections.append(melted_c)
        melted = pd.concat(melted_sections)
        return melted
    
    
    melted_df = melt(pointsdf, ['Email', 'Earned'], 'Milestone', 'Points')
    
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
    merged_df = melted_df.merge(hoursdf, how='inner', on=['Email', 'Milestone'])
    
    sorted_df = merged_df.sort_values(by=['Email','Submit'])
    
    return sorted_df.reset_index(drop=True)
    
    
### demo function call ###
merge_gt(demo_pointsdf_ex8, demo_hoursdf_ex8)

Ex. 8 test¶

The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex8
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_8', 
    'func': merge_gt, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'pointsdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        },
        'hoursdf':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 8 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, merged_by_exid. It is the result of a correct implementation of merge_gt when run on earned_by_exid and hours_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

merged_by_exid = cse6040.utils.load_obj_from_file('merged.dill')

# Example:
display(merged_by_exid['exam1'].head())
merged_by_exid['exam1'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Email      3304 non-null   object        
 1   Earned     3304 non-null   float64       
 2   Milestone  3304 non-null   object        
 3   Points     3303 non-null   float64       
 4   Start      3304 non-null   datetime64[ns]
 5   Submit     3304 non-null   datetime64[ns]
 6   Hours      3304 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(2)
memory usage: 180.8+ KB

Ex. 9 (2 pts): `concat_exams`¶

Exercise 8 produces a (merged) dataframe for each exam. But when we have multiple exams, we may want to analyze them all at once.

Suppose you are given a dictionary of exams, df_by_exid, whose keys are exam IDs (strings) and values are dataframes. Complete the function,

def concat_exams(df_by_exid):
    ...

to "concatenate" these as explained below.

Input: df_by_exid. If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam. Assume that all dataframes in df_by_exid have the same columns.

Output:

Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
To tell which rows came from which dataframe, include a column named 'Exam' whose value records the originating exid key.

Comments/hints:

You are asked to return a tibble, so remember be sure your returned output has a clean index.
As long as the 'Exam' value is set correctly, the ordering of rows and columns does not matter.
Remember that you should not modify any of the input dataframes! If you do, you may observe strange behavior or cause the autograder to fail.
You may assume that all input dataframes are nonempty. However, you should not assume anything else about what keys exist or what the columns of the input dataframe are, other than all dataframes will have the same columns.

Ex. 9 demo input¶

### Define demo inputs ###
demo_df_by_exid_ex9 = cse6040.utils.load_obj_from_file('demo_df_by_exid_ex9.dill')

for exid, df in demo_df_by_exid_ex9.items():
    print(f"\n*** exid = '{exid}' ***")
    display(demo_df_by_exid_ex9[exid])

*** exid = 'exam3' ***

*** exid = 'exam1' ***

*** exid = 'exam2' ***

For the preceding inputs, a valid concatenated output is:

	Start	Points	Earned	Email	Submit	Milestone	Hours	Exam
0	2018-12-09 08:55:36	2	38	chelsea.dilann@gatech.edu	2018-12-10 11:17:14	problem2	8.557	exam3
1	2018-12-08 12:27:33	5.5	25.5	kristopher.erandi@gmail.com	2018-12-09 16:09:17	problem0	2.831	exam3
2	2018-09-30 01:05:14	10	45	meredith.rakyla@hotmail.com	2018-09-30 19:11:33	problem2	3.386	exam1
3	2018-09-28 16:21:04	10	45	stacy.loneta@gmail.com	2018-09-28 17:51:52	problem0	1.513	exam1
4	2018-11-03 20:05:45	2	22	leta.kartier@gmail.com	2018-11-05 05:16:51	problem1	2.097	exam2
5	2018-11-04 20:02:24	5	16	christopher.jullianne@gatech.edu	2018-11-05 10:51:45	problem0	14.822	exam2
6	2018-11-03 06:59:28	10	50	lola.fysher@gmail.com	2018-11-03 08:06:09	problem1	1.111	exam2
7	2018-11-03 10:19:14	10	39	susan.tedrina@gatech.edu	2018-11-04 18:28:46	problem3	1.337	exam2

Ex. 9 your code¶

### Exercise 9 solution
def concat_exams(df_by_exid):
    '''
    INPUT:
    -A dictionary df_by_exid. 
    -If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam. 
    -All dataframes in df_by_exid have the same columns.
    
    GOAL:
    -Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
    -Include a column named 'Exam' whose value records the originating exid key.
    
    STRATEGY:
    1. Create an empty list to hold all of our data frames we want to stack. Let's call this list_of_all_dfs
    2. Iterate over exid in df_by_exid
            3. Grab the associated data frame using df_by_exid[exid]. Let's call this 'df'
            4. Make a copy of 'df'. Let's call this 'df_copy'
            5. Add an 'Exam' column to 'df_copy' where all rows = exid
            6. Append 'df_copy' to our list_of_all_dfs
    7. Concatenate all of the data frames in our list_of_all_dfs. 
    8. Reset index and return this data frame
    '''
    
    # SOLUTION:
    list_of_all_dfs = []
    
    for exid in df_by_exid:
        df = df_by_exid[exid]
        df_copy = df.copy()
        df_copy['Exam'] = exid
        list_of_all_dfs.append(df_copy)
    
    # https://pandas.pydata.org/docs/reference/api/pandas.concat.html
    tall_df = pd.concat(list_of_all_dfs)
    return tall_df.reset_index(drop=True)
        
    
    
### demo function call ###
concat_exams(demo_df_by_exid_ex9)

Ex. 9 test¶

The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex9
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_9', 
    'func': concat_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df_by_exid':{
            'dtype':'dict', # data type of param.
            'check_modified':False, # @TODO: FIXME
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 9 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, concatdf. It is the result of a correct implementation of concat_exams when run on merged_by_exid from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.

concatdf = cse6040.utils.load_obj_from_file('concat.dill')
concatdf

Ex. 10 (2 pts): `summarize_exams`¶

Suppose you are given df, a concatenated dataframe like the one that produced in Ex. 9. Complete the function,

def summarize_exams(df):
    ...

so that it determines the mean exam score and mean time to complete the exam for every exam, returning the result as a dataframe (see below).

Input: A dataframe, similar to the output of Exercise 9. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

'Email' (string): a student email address
'Exam' (string): the ID of an exam
'Milestone' (string): the problem number
'Points' (float): the points scored on that problem (milestone) for that exam
'Hours' (float): the estimated number of hours spent on that problem for that exam

Your task: For each student and exam, you need to determine their total points and total time. Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.

Output: Return a pandas dataframe with three columns:

'Exam' (string): the exam ID
'Points' (float): the total points for that exam averaged over all students
'Hours' (float): the total time in hours for that exam averaged over all students

Comments/hints: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!

Ex. 10 demo input¶

### Define demo inputs ###

demo_df_ex10 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex10

The above input should produce: | | Exam | Points | Hours | |---:|:-------|---------:|--------:| | 0 | exam1 | 45 | 14.3085 | | 1 | exam3 | 42 | 31.255 | As a spot check, observe that the input dataset has two students. Consider their `'exam3'` scores: student `'matthew.monise@hotmail.com'` got a total of 38 points and `'paul.laconya@gmail.com'` got 46 points, yielding an average of (38+46)/2 = 42 points.

Ex. 10 your code¶

### Exercise 10 solution
def summarize_exams(df):
    '''
    INPUT:
    A dataframe, where each row has the data for a (student, exam, problem) triple, but the only columns provided are:
    1. 'Email' (string): a student email address
    2. 'Exam' (string): the ID of an exam
    3. 'Milestone' (string): the problem number
    4. 'Points' (float): the points scored on that problem (milestone) for that exam
    5. 'Hours' (float): the estimated number of hours spent on that problem for that exam
    
    GOAL:
    For each student and exam, determine their total points and total time. 
    Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.
    
    Return a pandas dataframe with three columns:
    1. 'Exam' (string): the exam ID
    2. 'Points' (float): the total points for that exam averaged over all students
    3. 'Hours' (float): the total time in hours for that exam averaged over all students
    
    Hint: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!
    
    STRATEGY:
    1. Let's first determine total points and total time for each student and exam. 
       So group by 'Email' and 'Exam' and sum. Let's call this df_sum.
    2. Reset index to bring index ('Email' and 'Exam') back as regular columns
    3. Now let's find the mean of 'Hours' and 'Points' by Exam. 
       So group by 'Exam' and find mean.
    4. Reset index to bring index ('Exam') back as a regular column
    5. Return this data frame.    
    '''
    
    # SOLUTION: 
    df_sum = df.groupby(['Email', 'Exam']).sum()
    df_sum = df_sum.reset_index()
    
    df_mean = df_sum.groupby('Exam').mean()
    df_mean = df_mean.reset_index()
    
    return df_mean
    
    
### demo function call ###
summarize_exams(demo_df_ex10)

Ex. 10 test¶

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex10
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_10', 
    'func': summarize_exams, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 10 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, exam_summary. It is the result of a correct implementation of summarize_exams when run on concatdf from the full dataset.

exam_summary = cse6040.utils.load_obj_from_file('exam_summary.dill')
exam_summary

Final task: A simple regression¶

For the case of three exams, the instructor would like to know whether the first two scores and the timing numbers can predict the third exam score. To carry out this analysis, the instructor needs to build a data matrix and wants your help.

Ex. 11 (3 pts): `form_data_table`¶

Suppose you are given the same input as Ex. 10, that is, a concatenated dataframe df like the one that produced in Ex. 9. Complete the function,

def form_data_table(df):
    ...

so that it constructs a dataframe whose values will (later) become the entries of a linear-regression data matrix.

Input: The same dataframe as Ex. 10. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:

'Email' (string): a student email address
'Exam' (string): the ID of an exam
'Milestone' (string): the problem number
'Points' (float): the points scored on that problem (milestone) for that exam
'Hours' (float): the estimated number of hours spent on that problem for that exam

Output: Construct and return a new dataframe where each row corresponds with a student, and the columns are as follows:

'Email' (string): The student email. There will be one row per student in the output.
'Points#' (float): One column of total points scored for each exam present in the input. For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums, with the values being the total points earned by the student on the corresponding exam.
'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.

As a final step, if there are any NaNs, drop the corresponding rows.

The idea behind this format is that each student will be an "observation" or "sample" for linear regression, and the columns will serve as predictors or features.

Comments/hints: As usual, you should not modify the input dataframe, df.

Ex. 11 demo input¶

### Define demo inputs ###
demo_df_ex11 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex11

For the sample input above, a correct implementation would produce: | | Email | Hours1 | Hours3 | Points1 | Points3 | |---:|:---------------------------|---------:|---------:|----------:|----------:| | 0 | matthew.monise@hotmail.com | 16.349 | 35.548 | 45 | 38 | | 1 | paul.laconya@gmail.com | 12.268 | 26.962 | 45 | 46 | This format makes it easy to, say, try to apply one of our models, like using linear regression to predict `'Points3'` from `'Hours2'` and `'Points2'`—assuming that is even a good model, which it might not be!

Ex. 11 your code¶

### Exercise 11 solution
def form_data_table(df):
    '''
    INPUT:
    A data frame df, where each row gives the data for a (student, exam, problem) triple, 
    but the only columns provided are:
    -'Email' (string): a student email address
    -'Exam' (string): the ID of an exam
    -'Milestone' (string): the problem number
    -'Points' (float): the points scored on that problem (milestone) for that exam
    -'Hours' (float): the estimated number of hours spent on that problem for that exam
    
    GOAL:
    Return a new dataframe where each row corresponds with a student, and the columns are as follows:
    -'Email' (string): The student email. There will be one row per student in the output.
    -'Points#' (float): One column of total points scored for each exam present in the input. 
      For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums, 
      with the values being the total points earned by the student on the corresponding exam.
    -'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.
    
    If there are any NaNs, drop the corresponding rows.
    
    STRATEGY:
    1. This looks like slightly more complicated casting (need to cast twice as there are multiple value columns you 
       need to create with new names)
    2. *Borrow* casting function from NB7 solution. Add as a helper function.
    3. Find the sum of 'Points' and 'Hours' grouped by 'Email'. Let's call this df_sum.
    4. Reset index to bring Email index back as an actual column
    5. Make 2 data frames: one containing Email + Exam + Points info and the other containing Email + Exam + Hours info
    6. Call Cast helper function on each of these data frames (passing in correct parameters for each)
    7. Merge the two resulting data frames together (on 'Email') to get our final data frame
    8. Drop any rows that contain na
    9. Return this data frame    
    '''
    
    # SOLUTION:
    def cast(df, key, value, join_how='outer'):
        fixed_vars = df.columns.difference([key, value])
        tibble = pd.DataFrame(columns=fixed_vars) 
    
        new_vars = df[key].unique()
        for v in new_vars:
            df_v = df[df[key] == v]
            del df_v[key]
            df_v = df_v.rename(columns={value: value+v[4:]})
            tibble = tibble.merge(df_v, on=list(fixed_vars),how=join_how)  
        return tibble

    df_sum = df.groupby(['Email', 'Exam']).sum()
    df_sum = df_sum.reset_index()
    
    df_points = df_sum[['Email', 'Exam', 'Points']]
    df_points = cast(df_points, 'Exam', 'Points', 'outer')
    
    df_hours = df_sum[['Email', 'Exam', 'Hours']]
    df_hours = cast(df_hours, 'Exam', 'Hours', 'outer')
    
    df_final = df_points.merge(df_hours, on='Email')
    df_final = df_final.dropna()
    
    return df_final
    
    
### demo function call ###
form_data_table(demo_df_ex11)

Ex. 11 test¶

The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### test_cell_ex11
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_11', 
    'func': form_data_table, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'df':{
            'dtype':'df', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

Passed! Please submit.

Ex. 11 RUNME (precomputed solutions)¶

The cell below creates a precomputed object, data_table. It is the result of a correct implementation of form_data_table when run on concatdf from the full dataset.

data_table = cse6040.utils.load_obj_from_file('data_table.dill')
data_table

Fin (+ postscript)¶

If you have made it this far, congratulations on completing the exam. Don't forget to submit!

Postscript: We went through a lot of work in this exam to get the data table. So if we try to regress, say, 'Points3' from the other columns, how well does that work? Here is a code fragment to check. What does the result say?

# Predictors:
data_cols = data_table[['Points1', 'Hours1', 'Points2', 'Hours2']]
X = data_cols.values
X /= X.max(axis=0) # Normalize columns

# Response:
resp_col = 'Points3'
y = data_table[resp_col].values
y /= y.max()

from sklearn.linear_model import LinearRegression
model_type = LinearRegression(fit_intercept=True, positive=True)
model = model_type.fit(X, y)

print("* Predictors:", data_cols)
print("* Response:", resp_col)
print("* Model coefficients:", model.coef_)
print("* Model intercept:", model.intercept_)

y_pred = model.predict(X)
print(f"* Mean relative error: {abs((y_pred - y)/y).mean()*100:.1f}%")

* Predictors:      Points1    Hours1  Points2    Hours2
0       0.46  0.946337     0.66  0.996459
1       0.94  0.319855     1.00  0.643557
2       0.90  0.842714     0.64  0.989625
3       0.90  0.818295     0.94  0.879821
4       0.00  0.853904     0.14  0.846982
..       ...       ...      ...       ...
688     0.88  0.781688     0.66  0.762795
689     0.94  0.685386     0.90  0.904692
690     1.00  0.887891     1.00  0.828833
692     0.90  0.392362     0.96  0.435899
693     0.90  0.442614     1.00  0.482377

[590 rows x 4 columns]
* Response: Points3
* Model coefficients: [0.28781907 0.         0.20827215 0.        ]
* Model intercept: 0.19770122252806183
* Mean relative error: 25.4%

	Email	problem0	problem1	problem2	problem3	problem4
0	sammie.jaydaa@gmail.com	7.0	10.0	10.0	10.0	10.0
1	juan.quay@aol.com	10.0	10.0	6.5	0.0	NaN
2	gino.branston@gatech.edu	10.0	5.0	10.0	10.0	10.0
3	danny.jameriah@gmail.com	7.0	NaN	NaN	NaN	NaN
4	robert.gioanna@gatech.edu	10.0	10.0	10.0	10.0	10.0
5	adam.evemarie@gmail.com	10.0	10.0	10.0	10.0	10.0

	Email	problem0	problem1	problem2	problem3	problem4
0	sammie.jaydaa@gmail.com	7.0	10.0	10.0	10.0	10.0
1	juan.quay@aol.com	10.0	10.0	6.5	0.0	NaN
2	gino.branston@gatech.edu	10.0	5.0	10.0	10.0	10.0
3	robert.gioanna@gatech.edu	10.0	10.0	10.0	10.0	10.0

	Email	problem0	problem1	problem2	problem3	problem4	Earned
0	rene.amritpal@gmail.com	10.0	10.0	10.0	10.0	10.0	50.0
1	albert.averley@gmail.com	10.0	2.0	8.0	10.0	10.0	40.0
2	mary.yahshua@gatech.edu	7.0	2.0	6.5	5.0	NaN	20.5
3	jillian.myler@gmail.com	10.0	10.0	10.0	10.0	1.0	41.0
4	raina.sheehan@gmail.com	10.0	10.0	10.0	5.0	10.0	45.0
5	allison.link@yahoo.com	7.0	2.0	NaN	NaN	NaN	9.0

	Email	problem0	problem1	problem2	problem3	problem4	Earned
0	sammie.jaydaa@gmail.com	7.0	10.0	10.0	10.0	10.0	47.0
1	juan.quay@aol.com	10.0	10.0	6.5	0.0	NaN	26.5
2	gino.branston@gatech.edu	10.0	5.0	10.0	10.0	10.0	45.0
3	robert.gioanna@gatech.edu	10.0	10.0	10.0	10.0	10.0	50.0

	email	milestone	start date-time	submit date-time
0	kathy.tavaughn@gmail.com	problem0	2018-12-07 18:35:23	2018-12-08 15:20:42
1	jeri.robynn@gmail.com	problem0	2018-12-08 18:22:20	2018-12-09 21:34:09
2	jacolyn.deborahh@live.de	problem0	2018-12-08 11:08:02	2018-12-08 18:25:06
3	alex.mesias@gmail.com	problem0	2018-12-10 12:38:00	2018-12-11 21:32:30
4	james.marcisha@gatech.edu	problem0	2018-12-10 10:09:25	2018-12-11 05:52:07
...	...	...	...	...
3453	patricia.ariha@gatech.edu	problem5	2018-12-08 11:43:24	2018-12-09 17:15:05
3454	conrad.rhonisha@vt.edu	problem5	2018-12-07 07:10:28	2018-12-07 06:06:09
3455	robert.burnest@gmail.com	problem5	2018-12-09 11:38:58	2018-12-10 11:08:58
3456	dustin.maleeah@gmail.com	problem5	2018-12-10 15:35:53	2018-12-11 16:23:29
3457	alexandra.kogan@gmail.com	problem5	2018-12-09 15:02:24	2018-12-10 10:45:23

	Email	problem0	problem1	problem2	problem3	problem4
0	sammie.jaydaa@gmail.com	7	10	10	10	10
1	juan.quay@aol.com	10	10	6.5	0	--------
2	gino.branston@gatech.edu	10	5	10	10	10
3	robert.gioanna@gatech.edu	10	10	10	10	10
4	adam.evemarie@gmail.com	10	10	10	10	10
...	...	...	...	...	...	...
1544	francis.auden@gmail.com	--------	--------	--------	--------	--------
1545	alexander.amilianna@noahliot.com	--------	--------	--------	--------	--------
1546	robert.jyn@live.com	--------	--------	--------	--------	--------
1547	donald.bernon@gatech.edu	--------	--------	--------	--------	--------
1548	christopher.julies@hotmail.com	--------	--------	--------	--------	--------

	email	milestone	start date-time	submit date-time
206	kenneth.bryland@gatech.edu	problem0	2018-12-08 01:27:05	2018-12-08 00:54:28
1111	kenneth.bryland@gatech.edu	problem1	2018-12-08 01:27:05	2018-12-08 01:26:03
1780	kenneth.bryland@gatech.edu	problem2	2018-12-08 01:27:05	2018-12-08 12:04:59
2511	kenneth.bryland@gatech.edu	problem4	2018-12-08 01:27:05	2018-12-08 13:14:38
3375	kenneth.bryland@gatech.edu	problem5	2018-12-08 01:27:05	2018-12-08 13:38:08

	email	milestone	start date-time	submit date-time
0	thelma.ousman@gmail.com	problem0	2018-09-28 11:24:09	2018-09-28 19:06:18
1	christopher.dailee@gmail.com	problem0	2018-09-28 08:42:26	2018-09-28 17:45:59
2	paul.jawaski@gmail.com	problem0	2018-09-29 07:40:21	2018-09-29 22:50:55
3	matthew.taw@gmail.com	problem0	2018-09-30 08:05:01	2018-09-30 09:41:23

	Email	Milestone	Start	Submit	Hours
0	aaliyah.tayviona@gatech.edu	problem1	2018-09-28 08:35:59	2018-09-28 15:23:24	6.790
1	aaliyah.tayviona@gatech.edu	problem0	2018-09-28 08:35:59	2018-09-29 01:13:46	9.840
2	aaliyah.tayviona@gatech.edu	problem3	2018-09-28 08:35:59	2018-09-29 07:20:57	6.119
3	aaron.kyerra@foxmail.com	problem0	2018-09-29 09:57:52	2018-09-29 11:21:16	1.390
4	aaron.kyerra@foxmail.com	problem1	2018-09-29 09:57:52	2018-09-29 12:07:36	0.772

	Email	Earned	Milestone	Points	Start	Submit	Hours
0	aaliyah.tayviona@gatech.edu	23.0	problem1	8.0	2018-09-28 08:35:59	2018-09-28 15:23:24	6.790
1	aaliyah.tayviona@gatech.edu	23.0	problem0	10.0	2018-09-28 08:35:59	2018-09-29 01:13:46	9.840
2	aaliyah.tayviona@gatech.edu	23.0	problem3	5.0	2018-09-28 08:35:59	2018-09-29 07:20:57	6.119
3	aaron.kyerra@foxmail.com	47.0	problem0	7.0	2018-09-29 09:57:52	2018-09-29 11:21:16	1.390
4	aaron.kyerra@foxmail.com	47.0	problem1	10.0	2018-09-29 09:57:52	2018-09-29 12:07:36	0.772

	Start	Points	Earned	Email	Submit	Milestone	Hours
0	2018-12-09 08:55:36	2.0	38.0	chelsea.dilann@gatech.edu	2018-12-10 11:17:14	problem2	8.557
1	2018-12-08 12:27:33	5.5	25.5	kristopher.erandi@gmail.com	2018-12-09 16:09:17	problem0	2.831

	Start	Points	Earned	Email	Submit	Milestone	Hours
0	2018-11-03 20:05:45	2.0	22.0	leta.kartier@gmail.com	2018-11-05 05:16:51	problem1	2.097
1	2018-11-04 20:02:24	5.0	16.0	christopher.jullianne@gatech.edu	2018-11-05 10:51:45	problem0	14.822
2	2018-11-03 06:59:28	10.0	50.0	lola.fysher@gmail.com	2018-11-03 08:06:09	problem1	1.111
3	2018-11-03 10:19:14	10.0	39.0	susan.tedrina@gatech.edu	2018-11-04 18:28:46	problem3	1.337

	Email	Exam	Milestone	Points	Hours
0	matthew.monise@hotmail.com	exam3	problem0	10.0	13.555
1	matthew.monise@hotmail.com	exam3	problem1	10.0	1.844
2	matthew.monise@hotmail.com	exam3	problem4	3.0	0.497
3	matthew.monise@hotmail.com	exam3	problem5	8.0	14.469
4	matthew.monise@hotmail.com	exam3	problem2	3.0	2.111
5	matthew.monise@hotmail.com	exam3	problem3	4.0	3.072
6	paul.laconya@gmail.com	exam3	problem1	10.0	3.553
7	paul.laconya@gmail.com	exam3	problem2	2.0	0.992
8	paul.laconya@gmail.com	exam3	problem4	8.0	17.808
9	paul.laconya@gmail.com	exam3	problem5	10.0	1.229
10	paul.laconya@gmail.com	exam3	problem3	6.0	1.693
11	paul.laconya@gmail.com	exam3	problem0	10.0	1.687
12	matthew.monise@hotmail.com	exam1	problem1	10.0	2.404
13	matthew.monise@hotmail.com	exam1	problem2	10.0	5.234
14	matthew.monise@hotmail.com	exam1	problem3	10.0	7.109
15	matthew.monise@hotmail.com	exam1	problem0	10.0	0.212
16	matthew.monise@hotmail.com	exam1	problem4	5.0	1.390
17	paul.laconya@gmail.com	exam1	problem0	10.0	2.165
18	paul.laconya@gmail.com	exam1	problem1	10.0	1.138
19	paul.laconya@gmail.com	exam1	problem2	10.0	5.228
20	paul.laconya@gmail.com	exam1	problem3	5.0	0.644
21	paul.laconya@gmail.com	exam1	problem4	10.0	3.093

	Exam	Points	Hours
0	exam1	43.303779	14.253057
1	exam2	41.165123	27.490836
2	exam3	37.701169	23.046157

	Email	Hours1	Hours2	Hours3	Points1	Points2	Points3
0	aaliyah.tayviona@gatech.edu	22.749	36.018	32.916	23.0	33.0	39.0
1	aaron.kyerra@foxmail.com	7.689	23.262	8.649	47.0	50.0	31.0
2	abigail.alizah@umich.edu	20.258	35.771	28.797	45.0	32.0	29.5
3	abigail.lanty@gatech.edu	19.671	31.802	31.423	45.0	47.0	33.5
4	abigail.lynnon@gmail.com	20.527	30.615	29.204	0.0	7.0	12.0
...	...	...	...	...	...	...	...
688	william.sanvi@gmail.com	18.791	27.572	7.296	44.0	33.0	27.5
689	winston.wymon@gmail.com	16.476	32.701	13.935	47.0	45.0	39.0
690	wm.charrisse@gatech.edu	21.344	29.959	22.292	50.0	50.0	32.5
692	zitlaly.nester@valensdatalabs.com	9.432	15.756	12.448	45.0	48.0	37.0
693	zuri.eilleen@gatech.edu	10.640	17.436	24.853	45.0	50.0	45.0

Final Exam, Fall 2023: Exam Analysis¶

Overview: Exam Examination¶

Setup¶

Ex. 0 (1 pt; FREE): Understanding the grades dataframe¶

Ex. 0 test (FREEBIE)¶

Ex. 1 (2 pt): remove_invalid_grades¶

Ex. 1 demo input¶

Ex. 1 your code¶

Ex. 1 test¶

Ex. 1 RUNME (precomputed solutions)¶

Ex. 2 (1 pt): calc_earned_points¶

Ex. 2 demo input¶

Ex. 2 your code¶

Ex. 2 test¶

Ex. 2 RUNME (precomputed solutions)¶

Ex. 3 (1 FREE pt): Understanding the timings dataset¶

Ex. 3 test (FREEBIE)¶

Ex. 4 (2 pts): remove_causality_violators¶

Ex. 4 demo input¶

Ex. 4 your code¶

Ex. 4 test¶

Ex. 4 RUNME (precomputed solutions)¶

Ex. 5 (1 pt): clean_times_colnames¶

Ex. 5 demo input¶

Ex. 5 your code¶

Ex. 5 test¶

Ex. 5 RUNME (precomputed solutions)¶

Ex. 6 (1 pt; FREE): Submission sequences¶

Ex. 6 test (FREEBIE)¶

Ex. 7 (3 pts): sequence_submissions¶

Ex. 7 demo input¶

Ex. 7 your code¶

Ex. 7 test¶

Ex. 7 RUNME (precomputed solutions)¶

Ex. 8 (3 pts): merge_gt¶

Ex. 8 demo input¶

Ex. 8 your code¶

Ex. 8 test¶

Ex. 8 RUNME (precomputed solutions)¶

Ex. 9 (2 pts): concat_exams¶

Ex. 9 demo input¶

Ex. 9 your code¶

Ex. 9 test¶

Ex. 9 RUNME (precomputed solutions)¶

Ex. 10 (2 pts): summarize_exams¶

Ex. 10 demo input¶

Ex. 10 your code¶

Ex. 10 test¶

Ex. 10 RUNME (precomputed solutions)¶

Final task: A simple regression¶

Ex. 11 (3 pts): form_data_table¶

Ex. 11 demo input¶

Ex. 11 your code¶

Ex. 11 test¶

Ex. 11 RUNME (precomputed solutions)¶

Fin (+ postscript)¶

Ex. 0 (1 pt; FREE): Understanding the `grades` dataframe¶

Ex. 1 (2 pt): `remove_invalid_grades`¶

Ex. 2 (1 pt): `calc_earned_points`¶

Ex. 3 (1 FREE pt): Understanding the `timings` dataset¶

Ex. 4 (2 pts): `remove_causality_violators`¶

Ex. 5 (1 pt): `clean_times_colnames`¶

Ex. 7 (3 pts): `sequence_submissions`¶

Ex. 8 (3 pts): `merge_gt`¶

Ex. 9 (2 pts): `concat_exams`¶

Ex. 10 (2 pts): `summarize_exams`¶

Ex. 11 (3 pts): `form_data_table`¶