Version 1.0
All of the header information is important. Please read it..
Topics, number of exercises: This problem builds on your knowledge of pandas, Numpy, and linear regression. It has 12 exercises exercises, numbered 0 to 11. There are 22 available points. However, to earn 100% the threshold is 15 points. (Therefore, once you hit 15 points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs ###
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercises 0, 3, and 6 involve reading only (no coding), so they are "free" points. However, you still need to run the associated test cells and submit to get those points.
Final reminders:
An instructor for another course has asked you to help analyze some of their exam data. The exams were completed electronically, in a format similar to our class's, but with some twists:
'exam1'
, 'exam2'
, and so on. Exams were timed, but with different time limits for each exam.The dataset itself has two main parts:
Each exam has its own grading and timing data, as you'll see.
Your overall task: You will help clean up and analyze this data, culminating in an assessment, based on linear regression, to see whether or not the last exam score can be predicted from the scores and timing information of earlier exams.
To get started, run the following code cells. Take note of preloaded modules, as it's likely you'll want to use them.
### Global Imports
### BEGIN HIDDEN TESTS
%load_ext autoreload
%autoreload 2
if False: # set to True to set up
REGENERATE_OUTPUTS = False
import dill
import hashlib
def hash_check(f1, f2, verbose=True):
with open(f1, 'rb') as f:
h1 = hashlib.md5(f.read()).hexdigest()
with open(f2, 'rb') as f:
h2 = hashlib.md5(f.read()).hexdigest()
if verbose:
print(h1)
print(h2)
assert h1 == h2, f'The file "{f1}" has been modified'
with open('resource/asnlib/public/hash_check.pkl', 'wb') as f:
dill.dump(hash_check, f)
del hash_check
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
for fname in ['__init__.py', 'utils.py']:
hash_check(f'cse6040/{fname}', f'resource/asnlib/public/cse6040/{fname}')
del hash_check
### END HIDDEN TESTS
# Modules you may find useful:
import pandas as pd
import numpy as np
# Some extra functions that this notebook needs:
import cse6040.utils
from pprint import pprint # Pretty-printer
# Load the dataset:
print("\n==> Loading the dataset...")
from cse6040.examdata import load_dataset
grades_by_exid, times_by_exid = load_dataset()
print("\n==> Done executing the setup and ready for take-off!")
grades
dataframe¶This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).
The global variable grades_by_exam
is a dictionary of pandas dataframes. Each key is an exam ID (e.g., 'exam2'
), and each value is a dataframe holding the grading data for that exam. Let's take a look at an example, which are the grades for the exam with exam id 'exam1'
.
list(grades_by_exid.keys())
Evidently, there are three exams in this dataset. However, the code you write later should not assume three exams unless otherwise specified.
grades_by_exid['exam1']
Note: The data has been anonymized with fake email addresses. Any resemblance to actual email addresses is purely coincidental.
Observations:
'exam1'
, in the example above. There are similar dataframes for 'exam2'
and 'exam3'
.'--------'
. For example, see row 1, where juan.quay@aol.com
did not submit problem4
.### test_cell_ex0 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
remove_invalid_grades
¶Given a grades dataframe, gdf
, complete the function,
def remove_invalid_grades(gdf):
...
so that it returns clean copy of gdf
.
Input: A grades dataframe, gdf
. It will have a column named 'Email'
and one or more columns named 'problem#'
where #
is an integer.
Your task: Copy this dataframe and clean it as follows:
'--------'
, should be replaced with a not-a-number (NaN) value, i.e., a Numpy np.nan
..dtype
of float
.Output: Return a clean tibble copy, without modifying the input gdf
. It should contain exactly the same columns as the input, even if the result is an empty dataframe, and the .dtype
should be as specified above.
Caveat(s)/comment(s)/hint(s):
.replace()
and .filter()
methods may be helpful (but are not required).'problem0'
through 'problem4'
as shown in Exercise 0, but it might also only have, say, two columns named 'problem1'
and 'problem3'
, or it might have four problem columns named 'problem0'
, 'problem1'
, 'problem25'
, and 'problem234'
. Your code should robustly handle these cases. The dataframe's .dropna()
method may be helpful.### Define demo inputs ###
demo_gdf_ex1 = grades_by_exid['exam1'].loc[[0, 1, 2, 683, 3, 957, 4, 1546]].reset_index(drop=True)
demo_gdf_ex1
Consider the demo input, demo_gdf_ex1
, shown above. A correct output is:
problem0 | problem1 | problem2 | problem3 | problem4 | ||
---|---|---|---|---|---|---|
0 | sammie.jaydaa@gmail.com | 7 | 10 | 10 | 10 | 10 |
1 | juan.quay@aol.com | 10 | 10 | 6.5 | 0 | nan |
2 | gino.branston@gatech.edu | 10 | 5 | 10 | 10 | 10 |
3 | danny.jameriah@gmail.com | 7 | nan | nan | nan | nan |
4 | robert.gioanna@gatech.edu | 10 | 10 | 10 | 10 | 10 |
5 | adam.evemarie@gmail.com | 10 | 10 | 10 | 10 | 10 |
Observe that rows 5 and 7 from the input are removed entirely since all of their problem values are missing. By contrast, rows 1 and 3 from the input have some non-NaN values and thus are retained.
In addition, the output types (inspected via .info()
) should match the following:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Email 6 non-null object
1 problem0 6 non-null float64
2 problem1 5 non-null float64
3 problem2 5 non-null float64
4 problem3 5 non-null float64
5 problem4 4 non-null float64
dtypes: float64(5), object(1)
### Exercise 1 solution
def remove_invalid_grades(gdf):
pass # Can be removed
### BEGIN SOLUTION
from numpy import nan
gdf = gdf.replace(to_replace='--------', value=nan)
problem_cols = gdf.filter(regex=r'problem\d+').columns
gdf = gdf.dropna(how='all', subset=problem_cols)
gdf[problem_cols] = gdf[problem_cols].astype(float)
gdf = gdf.reset_index(drop=True)
return gdf
### END SOLUTION
### demo function call ###
demo_result_ex1 = remove_invalid_grades(demo_gdf_ex1)
display(demo_result_ex1)
demo_result_ex1.info()
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_1',
'func': remove_invalid_grades, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'gdf':{
'dtype': 'df', # data type of param.
'check_modified': True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, valid_grades_by_exid
. It is the result of a correct implementation of remove_invalid_grades
when run on grades_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
valid_grades_by_exid = cse6040.utils.load_obj_from_file('valid_grades.dill')
# Example:
display(valid_grades_by_exid['exam1'].iloc[:4])
valid_grades_by_exid['exam1'].info()
calc_earned_points
¶Suppose you have a grades dataframe, valid_gdf
, having only valid rows as calculated in Exercise 1. Complete the function,
def calc_earned_points(valid_gdf):
...
so that it returns a new copy of valid_gdf
with an additional column holding the total points earned.
Input: A dataframe valid_gdf
with only valid rows, per Exercise 1.
Your task: Create a copy and sum the points earned by each student.
Output: Return a new tibble with the same columns as valid_gdf
plus an additional column, 'Earned'
, of .dtype
float
. This column should contain the student's total score, that is, the sum of points on their problems. Any NaN values should be ignored. You should not modify the input valid_gdf
.
### Define demo inputs ###
demo_valid_gdf_ex2 = valid_grades_by_exid['exam1'].sample(6, random_state=3_456_789_012).reset_index(drop=True)
demo_valid_gdf_ex2
The demo included in the solution cell below should display the following output:
problem0 | problem1 | problem2 | problem3 | problem4 | Earned | ||
---|---|---|---|---|---|---|---|
0 | rene.amritpal@gmail.com | 10 | 10 | 10 | 10 | 10 | 50 |
1 | albert.averley@gmail.com | 10 | 2 | 8 | 10 | 10 | 40 |
2 | mary.yahshua@gatech.edu | 7 | 2 | 6.5 | 5 | nan | 20.5 |
3 | jillian.myler@gmail.com | 10 | 10 | 10 | 10 | 1 | 41 |
4 | raina.sheehan@gmail.com | 10 | 10 | 10 | 5 | 10 | 45 |
5 | allison.link@yahoo.com | 7 | 2 | nan | nan | nan | 9 |
Note: Although the values appear to include integers, in fact, the
.dtype
of the columns should befloat
.
### Exercise 2 solution ###
def calc_earned_points(valid_gdf):
pass # Can be removed
### BEGIN SOLUTION
gdf = valid_gdf.copy()
gdf['Earned'] = gdf.filter(regex=r'problem\d+').sum(axis='columns')
return gdf.reset_index(drop=True)
### END SOLUTION
### demo function call ###
calc_earned_points(demo_valid_gdf_ex2)
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2 ###
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_2',
'func': calc_earned_points, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'valid_gdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, earned_by_exid
. It is the result of a correct implementation of calc_earned_grades
when run on valid_grades_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
earned_by_exid = cse6040.utils.load_obj_from_file('earned_grades.dill')
# Example:
display(earned_by_exid['exam1'].iloc[:4])
earned_by_exid['exam1'].info()
timings
dataset¶This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).
Similar to the grades dataset (see Ex. 0), each exam has a timings dataset. Here's what it looks like for 'exam1'
:
demo_times3_ex3 = times_by_exid['exam3']
demo_times3_ex3
Observations:
'start date-time'
and 'submit date-time'
columns hold timestamps, and their .dtypes
are special pandas Timestamp
objects. The mini-tutorial below shows you how they work.'start date-time'
value is when the student began the exam. It will have the same value for all rows of a given student.'submit date-time'
value is the timestamp of the student's last submission for that problem. That is, only one submission is recorded per student and problem.Implications: One consequence of the data we have is that we cannot really know how the student navigated the exam: we know when they started, but if they jumped around or submitted a problem multiple times, the only information we have is the last submission for each problem. Nevertheless, we will use the information we have to guess at their timeline, after some clean up.
Timestamps: The printed timestamps have the following form:
2018-12-07 18:35:23
That corresponds to December 7, 2018, at 6:35 pm (and 23 seconds).
Luckily, you don't need to parse timestamps! These objects make it easy to sort and calculate time differences, which is what we'll need. For instance, we can check if one date is larger than another using the usual comparison-operators, like >
or <=
:
(demo_times3_ex3['submit date-time'] > demo_times3_ex3['start date-time']).head()
And we can subtract times using the usual -
operator, as we would for integers or floats:
(demo_times3_ex3['submit date-time'] - demo_times3_ex3['start date-time']).head()
The difference is stored as a timedelta64
object. The way to read printed item 0
above is "20 hours, 45 minutes, and 19 seconds."
Time-delta objects can be converted into a regular float representing the number of seconds using a method called .total_seconds()
:
(demo_times3_ex3['submit date-time'].iloc[0] - demo_times3_ex3['start date-time'].iloc[0]).total_seconds()
We will need this transformation to simplify some calculations later on, so tuck this fact away.
Some messiness: The data has some "impurities." In particular, check out the data for 'kenneth.bryland@gatech.edu'
:
demo_times3_ex3_messy = demo_times3_ex3.loc[[206, 1111, 1780, 2511, 3375]]
demo_times3_ex3_messy
Observe that their 'problem0'
and 'problem1'
submission times occur before their start times, which we confirm next:
demo_times3_ex3_messy['submit date-time'] > demo_times3_ex3_messy['start date-time']
While it turns out there is a reason for that, we will want to omit cases like that to simplify our analysis.
### test_cell_ex3 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
remove_causality_violators
¶Suppose you are given a timings dataframe, tdf
, per Exercise 3. Complete the function,
def remove_causality_violators(tdf):
...
so that it returns a new, cleaned copy that omits records for any student who has timestamps that "violate causality." A causality violation means that their submission time is strictly less than their start time.
Input: A timings dataframe, tdf
, with the columns 'email'
, 'milestone'
, 'start date-time'
, and 'submit date-time'
.
Your task: Identify all students who have any causality violations and omit their rows from the output dataframe.
Output: Return a new tibble that includes only rows from tdf
for students whose problem-records do not violate causality. The output column .dtypes
should match those of the input. You should not modify the input, tdf
.
Comment(s): If a student has even one problem that violates causality, you should omit all of their rows. That is, we will consider their entire exam data invalid for the purpose of our subsequent analysis.
### Define demo inputs ###
demo_tdf_ex4 = times_by_exid['exam3'].loc[[206, 279, 439, 1111, 1344, 1780, 2511, 3224, 3375]].reset_index(drop=True)
demo_tdf_ex4
The demo included in the solution cell below should display the following output:
milestone | start date-time | submit date-time | ||
---|---|---|---|---|
0 | daniel.tashera@gmail.com | problem0 | 2018-12-09 01:12:56 | 2018-12-09 02:29:33 |
1 | barbara.wilhemina@gatech.edu | problem0 | 2018-12-10 02:43:28 | 2018-12-10 06:21:30 |
2 | barbara.wilhemina@gatech.edu | problem2 | 2018-12-10 02:43:28 | 2018-12-10 09:09:43 |
3 | barbara.wilhemina@gatech.edu | problem5 | 2018-12-10 02:43:28 | 2018-12-10 09:04:35 |
Observe that kenneth.bryland@gatech.edu
is omitted from this output because their 'problem0'
and 'problem1'
records violate causality. (Even though their remaining problems obey causality, we still drop them because of the problematic rows.)
### Exercise 4 solution ###
def remove_causality_violators(tdf):
pass # Can be removed
### BEGIN SOLUTION
tdf = tdf.copy()
violation = tdf['submit date-time'] < tdf['start date-time']
violators = set(tdf[violation]['email'].unique())
tdf = tdf[~tdf['email'].isin(violators)]
tdf = tdf.reset_index(drop=True)
return tdf
# Example of an incorrect solution
def remove_causality_violators__ERROR(tdf):
df = tdf.copy()
return df[df["submit date-time"] > df["start date-time"]]
### END SOLUTION
### demo function call ###
remove_causality_violators(demo_tdf_ex4)
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_4',
'func': remove_causality_violators, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, valid_times_by_exid
. It is the result of a correct implementation of remove_causality_violaters
when run on times_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
valid_times_by_exid = cse6040.utils.load_obj_from_file('valid_times.dill')
# Example:
assert (valid_times_by_exid['exam3']['submit date-time'] >= valid_times_by_exid['exam3']['start date-time']).all()
display(valid_times_by_exid['exam1'].iloc[:4])
valid_times_by_exid['exam1'].info()
clean_times_colnames
¶Let's tidy up the column names for the timings data. Complete the function,
def clean_times_colnames(tdf):
...
to return a new tibble from tdf
with columns renamed as follows.
'email'
=> 'Email'
'milestone'
=> 'Milestone'
'start date-time'
=> 'Start'
'submit date-time'
=> 'Submit'
Element types for the columns (.dtype
) should be preserved, and the input dataframe should not be modified.
### define demo inputs ###
demo_tdf_ex5 = valid_times_by_exid['exam2'].sample(5, random_state=567_890_123, replace=False).reset_index(drop=True)
demo_tdf_ex5
The demo included in the solution cell below should display the following output:
Milestone | Start | Submit | ||
---|---|---|---|---|
0 | jazlyn.florin@gmail.com | problem4 | 2018-11-04 12:45:10 | 2018-11-05 20:50:41 |
1 | christian.miyisha@yahoo.com | problem2 | 2018-11-03 07:38:00 | 2018-11-04 12:21:12 |
2 | taryn.raneshia@yahoo.co.id | problem0 | 2018-11-04 08:25:24 | 2018-11-05 17:59:02 |
3 | joseph.sherburne@gmail.com | problem1 | 2018-11-02 16:45:06 | 2018-11-03 01:47:58 |
4 | thomas.dyanni@gmail.com | problem3 | 2018-11-03 14:20:56 | 2018-11-04 23:31:18 |
### Exercise 5 solution
def clean_times_colnames(tdf):
pass # Can be removed
### BEGIN SOLUTION
return tdf.rename(columns={'email': 'Email',
'milestone': 'Milestone',
'start date-time': 'Start',
'submit date-time': 'Submit',
'dt': 'Dt'})
### END SOLUTION
### demo function call ###
clean_times_colnames(demo_tdf_ex5)
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_5',
'func': clean_times_colnames, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, clean_times_by_exid
. It is the result of a correct implementation of clean_times_colnames
when run on valid_times_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
clean_times_by_exid = cse6040.utils.load_obj_from_file('clean_times.dill')
# Example:
display(clean_times_by_exid['exam1'].head())
clean_times_by_exid['exam1'].info()
Our next analysis will estimate the test-takers' behavior by analyzing their submission times.
For example, consider the submission times for 'sherry.ayleigh@aol.com'
, stored in the dataframe demo_tdf_ex6
below:
demo_ex6_tdf = clean_times_by_exid['exam3'].loc[[34, 593, 1624, 2189, 2344, 3122]]
demo_ex6_tdf
Based on the submission times, ordering the rows by the 'Submit'
timestamp would yield a sensible timeline of submission events:
Milestone | Start | Submit | ||
---|---|---|---|---|
34 | sherry.ayleigh@aol.com | problem0 | 2018-12-06 08:32:21 | 2018-12-06 09:07:16 |
593 | sherry.ayleigh@aol.com | problem1 | 2018-12-06 08:32:21 | 2018-12-06 10:04:10 |
1624 | sherry.ayleigh@aol.com | problem2 | 2018-12-06 08:32:21 | 2018-12-06 11:30:28 |
3122 | sherry.ayleigh@aol.com | problem5 | 2018-12-06 08:32:21 | 2018-12-06 13:34:13 |
2344 | sherry.ayleigh@aol.com | problem4 | 2018-12-06 08:32:21 | 2018-12-06 17:53:48 |
2189 | sherry.ayleigh@aol.com | problem3 | 2018-12-06 08:32:21 | 2018-12-06 20:19:21 |
We refer to this ordering of submission times as the submission sequence. It is our best guess as to how they moved through the exam if we assume they did one problem at a time. From this submission sequence for this student, we might conclude the following.
First, the student started the exam at 2018-12-06 08:32:21
and submitted 'problem0'
at 2018-12-06 09:07:16
. That would be:
demo_ex6_tdf['Submit'].iloc[0] - demo_ex6_tdf['Start'].iloc[0]
or 34 minutes and 55 seconds, which is also 2,095 seconds = (2095/3600) ~ 0.582 hours, rounding to three decimal digits. In other words, we will assume that the student started the exam and then spent 0.582 hours on Problem 0.
Next, the student submitted 'problem1'
at 2018-12-06 10:04:10
. Relative to the previous submission time, that is 56 minutes and 54 seconds, or 3,414 seconds = (3414/3600) ~ 0.948 hours, again rounded to three digits. So we might guess that the student spent that much time on Problem 1.
Continuing, the sequence of problems solved was Problem 0, 1, 2, 5, 4, and finally, 3. The approximate number of hours spent on each are 0.58, 0.948, 1.438, 2.062, 4.326, and 2.426 hours, respectively.
### test_cell_ex6 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
sequence_submissions
¶Suppose you are given a (clean) timings tibble, tdf
, such as computed in Exercise 5. Complete the function,
def sequence_submissions(tdf):
...
so that it sequences the submissions for each student and returns a new tibble matching the format described below.
Input: A clean timings tibble, tdf
, with the columns 'Email'
, 'Milestone'
, 'Start'
, and 'Stop'
, per Ex. 5.
Your task: For each student, order their submissions and determine the amount of time spent on each problem, as illustrated in Ex. 6.
Output: Without modifying the input tibble, return a new tibble with the following columns:
'Email'
, 'Milestone'
, 'Start'
, and 'Submit'
, which keep the same values as given in tdf
.'Hours'
, the time in hours (a floating-point value) rounded to 3 decimal digits, corresponding to the estimated time spent on each problem.That is, each row of the output will indicate the estimated time spent by the student on the given problem based on the estimated timeline.
Additional requirements: The rows of this new tibble should be sorted by 'Email'
and 'Submit'
timestamp in ascending order.
### Define demo inputs ###
demo_tdf_ex7 = cse6040.utils.load_df_from_file('demo_tdf_ex7.df')
demo_tdf_ex7
In the preceding demo input, there are three students: 'sherry.ayleigh@aol.com'
, 'chelsea.jaretzi@utexas.edu'
, and 'jessica.rinda@gatech.edu'
. Observe that
'sherry.ayleigh@aol.com'
submitted Problems 0-5 inclusive.'chelsea.jaretzi@utexas.edu'
submitted Problems 0, 1, 2, 3, and 5. (So they skipped Problem 4.)'jessica.rinda@gatech.edu'
submitted Problems 1, 2, and 5. (So they skipped Problems 0, 3, and 4.)A correct implementation of sequence_submissions
on this input would produce:
Milestone | Start | Submit | Hours | ||
---|---|---|---|---|---|
0 | chelsea.jaretzi@utexas.edu | problem1 | 2018-12-11 09:50:18 | 2018-12-12 02:34:35 | 16.738 |
1 | chelsea.jaretzi@utexas.edu | problem0 | 2018-12-11 09:50:18 | 2018-12-12 02:40:16 | 0.095 |
2 | chelsea.jaretzi@utexas.edu | problem5 | 2018-12-11 09:50:18 | 2018-12-12 03:45:49 | 1.092 |
3 | chelsea.jaretzi@utexas.edu | problem3 | 2018-12-11 09:50:18 | 2018-12-12 05:49:20 | 2.059 |
4 | chelsea.jaretzi@utexas.edu | problem2 | 2018-12-11 09:50:18 | 2018-12-12 06:17:25 | 0.468 |
5 | jessica.rinda@gatech.edu | problem1 | 2018-12-07 12:05:41 | 2018-12-07 18:04:19 | 5.977 |
6 | jessica.rinda@gatech.edu | problem5 | 2018-12-07 12:05:41 | 2018-12-07 18:11:47 | 0.125 |
7 | jessica.rinda@gatech.edu | problem2 | 2018-12-07 12:05:41 | 2018-12-07 18:21:51 | 0.167 |
8 | sherry.ayleigh@aol.com | problem0 | 2018-12-06 08:32:21 | 2018-12-06 09:07:16 | 0.582 |
9 | sherry.ayleigh@aol.com | problem1 | 2018-12-06 08:32:21 | 2018-12-06 10:04:10 | 0.948 |
10 | sherry.ayleigh@aol.com | problem2 | 2018-12-06 08:32:21 | 2018-12-06 11:30:28 | 1.439 |
11 | sherry.ayleigh@aol.com | problem5 | 2018-12-06 08:32:21 | 2018-12-06 13:34:13 | 2.062 |
12 | sherry.ayleigh@aol.com | problem4 | 2018-12-06 08:32:21 | 2018-12-06 17:53:48 | 4.326 |
13 | sherry.ayleigh@aol.com | problem3 | 2018-12-06 08:32:21 | 2018-12-06 20:19:21 | 2.426 |
Observe the ordering by 'Email'
and 'Submit'
in ascending order. Also observe that the values in 'Hours'
are rounded to 3 decimal digits and correspond to differences in submission times (and submission to start time for the first entry).
### Exercise 7 solution
def sequence_submissions(tdf):
pass # Can be removed
### BEGIN SOLUTION
tdf = tdf.sort_values(['Email', 'Submit']).reset_index(drop=True)
tf = tdf.groupby('Email').apply(calc_hours_one).reset_index(drop=True)
tdf['Hours'] = tf
return tdf
def calc_hours_one(tdf): # `tdf` for **one** student; assume sorted by `'Submit'`
def to_hours(e, d=3):
return round(e.total_seconds() / 3600, 3)
dt = (tdf['Submit'] - tdf['Start']).apply(to_hours)
t0 = dt.iloc[0]
tf = dt.diff().fillna(t0)
return tf
def sequence_submissions__waldron(tdf):
def calc_hours(group):
group = group.sort_values('Submit').reset_index(drop=True)
seq = pd.Series([group['Start'].iloc[0], *group['Submit']])
group['Hours'] = seq.diff().dropna()\
.apply(lambda td: td.total_seconds()/3600)\
.round(3)\
.reset_index(drop=True)
return group
return tdf.groupby('Email', as_index=False).apply(calc_hours).reset_index(drop=True)
### END SOLUTION
### demo function call ###
sequence_submissions(demo_tdf_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_7',
'func': sequence_submissions, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 2*10 ** (-3)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, hours_by_exid
. It is the result of a correct implementation of clean_times_colnames
when run on hours_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
hours_by_exid = cse6040.utils.load_obj_from_file('hours.dill')
# Example:
display(hours_by_exid['exam1'].head())
hours_by_exid['exam1'].info()
merge_gt
¶Given a tibble pointsdf
of earned points (per Ex. 2) and another tibble hoursdf
of sequenced hours-per-problem (per Ex. 7), complete the function,
def merge_gt(pointsdf, hoursdf):
...
so that it merges these into a single tibble formatted as described below.
Inputs:
pointsdf
: A tibble of earned points, as computed by Exercise 2. It should have the columns, 'Email'
, one or more columns named 'problem#'
where #
is an integer string, and 'Earned'
.'hoursdf'
: A tibble of sequenced hours-per-problem, as computed in Exercise 7. It should have the columns, 'Email'
, 'Milestone'
, 'Start'
, 'Submit'
, and 'Hours'
.Your tasks:
pointsdf
so that its 'problem#'
columns become values of a new column named 'Milestone'
and the point values become a new column named 'Points'
. The 'Points'
column should hold floats.hoursdf
on the common 'Email'
and 'Milestone'
key-columns. (Use an inner-join/merge so that only keys that match both inputs are retained.)'Email'
and 'Submit'
in ascending order.Output: Return the result as a new tibble with the columns 'Email'
, 'Start'
, 'Submit'
, 'Milestone'
, 'Points'
, and 'Hours'
. You should not modify either input, pointsdf
or hoursdf
.
### Define demo inputs ###
demo_pointsdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['pointsdf']
demo_hoursdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['hoursdf']
print("* `demo_pointsdf_ex8`:")
display(demo_pointsdf_ex8)
print("* `demo_hoursdf_ex8`:")
display(demo_hoursdf_ex8)
In the above inputs, observe that only 'justin.jernee@gatech.edu'
has data in both the points
and hours
tibbles. Therefore, the merged result will only include that student's data, since we are using an inner merge. For this input, the result is:
Earned | Milestone | Points | Start | Submit | Hours | ||
---|---|---|---|---|---|---|---|
0 | justin.jernee@gatech.edu | 40 | problem2 | 2 | 2018-12-08 10:53:50 | 2018-12-08 15:12:42 | 4.314 |
1 | justin.jernee@gatech.edu | 40 | problem0 | 10 | 2018-12-08 10:53:50 | 2018-12-08 19:21:39 | 4.15 |
2 | justin.jernee@gatech.edu | 40 | problem1 | 10 | 2018-12-08 10:53:50 | 2018-12-08 20:16:38 | 0.916 |
3 | justin.jernee@gatech.edu | 40 | problem5 | 5 | 2018-12-08 10:53:50 | 2018-12-09 13:53:43 | 17.618 |
4 | justin.jernee@gatech.edu | 40 | problem3 | 10 | 2018-12-08 10:53:50 | 2018-12-09 20:06:24 | 6.211 |
5 | justin.jernee@gatech.edu | 40 | problem4 | 3 | 2018-12-08 10:53:50 | 2018-12-09 22:08:52 | 2.042 |
Observe the ordering of the rows. The problems appear in ascending order of submission time, per the sorting requirement of this exercise.
### Exercise 8 solution
def merge_gt(pointsdf, hoursdf):
pass # Can be removed
### BEGIN SOLUTION
df = pointsdf.melt(id_vars=['Email', 'Earned'], var_name='Milestone', value_name='Points')
df = df.merge(hoursdf, on=['Email', 'Milestone'], how='inner')
df = df.sort_values(['Email', 'Submit']).reset_index(drop=True)
return df
### END SOLUTION
### demo function call ###
merge_gt(demo_pointsdf_ex8, demo_hoursdf_ex8)
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_8',
'func': merge_gt, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'pointsdf':{
'dtype':'df', # data type of param.
'check_modified':True,
},
'hoursdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, merged_by_exid
. It is the result of a correct implementation of merge_gt
when run on earned_by_exid
and hours_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
merged_by_exid = cse6040.utils.load_obj_from_file('merged.dill')
# Example:
display(merged_by_exid['exam1'].head())
merged_by_exid['exam1'].info()
concat_exams
¶Exercise 8 produces a (merged) dataframe for each exam. But when we have multiple exams, we may want to analyze them all at once.
Suppose you are given a dictionary of exams, df_by_exid
, whose keys are exam IDs (strings) and values are dataframes. Complete the function,
def concat_exams(df_by_exid):
...
to "concatenate" these as explained below.
Input: df_by_exid
. If exid
is a string exam ID, then df_by_exid[exid]
is a dataframe associated with that exam. Assume that all dataframes in df_by_exid
have the same columns.
Output:
'Exam'
whose value records the originating exid
key.Comments/hints:
'Exam'
value is set correctly, the ordering of rows and columns does not matter.### Define demo inputs ###
demo_df_by_exid_ex9 = cse6040.utils.load_obj_from_file('demo_df_by_exid_ex9.dill')
for exid, df in demo_df_by_exid_ex9.items():
print(f"\n*** exid = '{exid}' ***")
display(demo_df_by_exid_ex9[exid])
For the preceding inputs, a valid concatenated output is:
Start | Points | Earned | Submit | Milestone | Hours | Exam | ||
---|---|---|---|---|---|---|---|---|
0 | 2018-12-09 08:55:36 | 2 | 38 | chelsea.dilann@gatech.edu | 2018-12-10 11:17:14 | problem2 | 8.557 | exam3 |
1 | 2018-12-08 12:27:33 | 5.5 | 25.5 | kristopher.erandi@gmail.com | 2018-12-09 16:09:17 | problem0 | 2.831 | exam3 |
2 | 2018-09-30 01:05:14 | 10 | 45 | meredith.rakyla@hotmail.com | 2018-09-30 19:11:33 | problem2 | 3.386 | exam1 |
3 | 2018-09-28 16:21:04 | 10 | 45 | stacy.loneta@gmail.com | 2018-09-28 17:51:52 | problem0 | 1.513 | exam1 |
4 | 2018-11-03 20:05:45 | 2 | 22 | leta.kartier@gmail.com | 2018-11-05 05:16:51 | problem1 | 2.097 | exam2 |
5 | 2018-11-04 20:02:24 | 5 | 16 | christopher.jullianne@gatech.edu | 2018-11-05 10:51:45 | problem0 | 14.822 | exam2 |
6 | 2018-11-03 06:59:28 | 10 | 50 | lola.fysher@gmail.com | 2018-11-03 08:06:09 | problem1 | 1.111 | exam2 |
7 | 2018-11-03 10:19:14 | 10 | 39 | susan.tedrina@gatech.edu | 2018-11-04 18:28:46 | problem3 | 1.337 | exam2 |
### Exercise 9 solution
def concat_exams(df_by_exid):
pass # Can be removed
### BEGIN SOLUTION
from pandas import concat
dfs = []
for exid, df in df_by_exid.items():
dfs.append(df.copy())
dfs[-1]['Exam'] = exid
return concat(dfs, ignore_index=True)
### END SOLUTION
### demo function call ###
concat_exams(demo_df_by_exid_ex9)
The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex9
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_9',
'func': concat_exams, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df_by_exid':{
'dtype':'dict', # data type of param.
'check_modified':False, # @TODO: FIXME
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, concatdf
. It is the result of a correct implementation of concat_exams
when run on merged_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
concatdf = cse6040.utils.load_obj_from_file('concat.dill')
concatdf
summarize_exams
¶Suppose you are given df
, a concatenated dataframe like the one that produced in Ex. 9. Complete the function,
def summarize_exams(df):
...
so that it determines the mean exam score and mean time to complete the exam for every exam, returning the result as a dataframe (see below).
Input: A dataframe, similar to the output of Exercise 9. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:
'Email'
(string): a student email address'Exam'
(string): the ID of an exam'Milestone'
(string): the problem number'Points'
(float): the points scored on that problem (milestone) for that exam'Hours'
(float): the estimated number of hours spent on that problem for that examYour task: For each student and exam, you need to determine their total points and total time. Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.
Output: Return a pandas dataframe with three columns:
'Exam'
(string): the exam ID'Points'
(float): the total points for that exam averaged over all students'Hours'
(float): the total time in hours for that exam averaged over all studentsComments/hints: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!
### Define demo inputs ###
demo_df_ex10 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex10
### Exercise 10 solution
def summarize_exams(df):
pass # Can be removed
### BEGIN SOLUTION
df = sum_by_student_exam(df)
df = avg_by_student(df)
return df
def sum_by_student_exam(df):
return df.groupby(['Exam', 'Email'])[['Points', 'Hours']].sum().reset_index()
def avg_by_student(df_stats_by_student_exam):
df = df_stats_by_student_exam
df = df.groupby('Exam')[['Points', 'Hours']].mean().reset_index()
return df
### END SOLUTION
### demo function call ###
summarize_exams(demo_df_ex10)
The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex10
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_10',
'func': summarize_exams, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, exam_summary
. It is the result of a correct implementation of summarize_exams
when run on concatdf
from the full dataset.
exam_summary = cse6040.utils.load_obj_from_file('exam_summary.dill')
exam_summary
For the case of three exams, the instructor would like to know whether the first two scores and the timing numbers can predict the third exam score. To carry out this analysis, the instructor needs to build a data matrix and wants your help.
form_data_table
¶Suppose you are given the same input as Ex. 10, that is, a concatenated dataframe df
like the one that produced in Ex. 9. Complete the function,
def form_data_table(df):
...
so that it constructs a dataframe whose values will (later) become the entries of a linear-regression data matrix.
Input: The same dataframe as Ex. 10. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:
'Email'
(string): a student email address'Exam'
(string): the ID of an exam'Milestone'
(string): the problem number'Points'
(float): the points scored on that problem (milestone) for that exam'Hours'
(float): the estimated number of hours spent on that problem for that examOutput: Construct and return a new dataframe where each row corresponds with a student, and the columns are as follows:
'Email'
(string): The student email. There will be one row per student in the output.'Points#'
(float): One column of total points scored for each exam present in the input. For example, if df
has data for 'exam2'
and 'exam3'
, then the output should have 'Points2'
and 'Points3'
colums, with the values being the total points earned by the student on the corresponding exam.'Hours#'
(float): Similarly, one column of total exam time for each exam present in the input.As a final step, if there are any NaNs, drop the corresponding rows.
The idea behind this format is that each student will be an "observation" or "sample" for linear regression, and the columns will serve as predictors or features.
Comments/hints: As usual, you should not modify the input dataframe, df
.
### Define demo inputs ###
demo_df_ex11 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex11
### Exercise 11 solution
def form_data_table(df):
pass # Can be removed
### BEGIN SOLUTION
def clean_colname(t):
assert isinstance(t, tuple)
return t[0] + t[1].replace('exam', '')
df = df.pivot_table(index='Email', values=['Points', 'Hours'], columns='Exam', aggfunc='sum')
df = df.reset_index()
df.columns = [clean_colname(c) for c in df.columns.to_flat_index()]
return df.dropna()
### END SOLUTION
### demo function call ###
form_data_table(demo_df_ex11)
The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex11
### BEGIN HIDDEN TESTS
import dill
import hashlib
with open('resource/asnlib/public/hash_check.pkl', 'rb') as f:
hash_check = dill.load(f)
for fname in ['testers.py', '__init__.py', 'test_utils.py']:
hash_check(f'tester_fw/{fname}', f'resource/asnlib/public/{fname}')
del hash_check
del dill
del hashlib
### END HIDDEN TESTS
from tester_fw.testers import Tester
conf = {
'case_file':'tc_11',
'func': form_data_table, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(conf, key=b'dRaQLzH9Dx27yR5MBt5Iy3HBuRaSpg1bXvN8NFqZ8j4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
The cell below creates a precomputed object, data_table
. It is the result of a correct implementation of form_data_table
when run on concatdf
from the full dataset.
data_table = cse6040.utils.load_obj_from_file('data_table.dill')
data_table
If you have made it this far, congratulations on completing the exam. Don't forget to submit!
Postscript: We went through a lot of work in this exam to get the data table. So if we try to regress, say, 'Points3'
from the other columns, how well does that work? Here is a code fragment to check. What does the result say?
# Predictors:
data_cols = data_table[['Points1', 'Hours1', 'Points2', 'Hours2']]
X = data_cols.values
X /= X.max(axis=0) # Normalize columns
# Response:
resp_col = 'Points3'
y = data_table[resp_col].values
y /= y.max()
from sklearn.linear_model import LinearRegression
model_type = LinearRegression(fit_intercept=True, positive=True)
model = model_type.fit(X, y)
print("* Predictors:", data_cols)
print("* Response:", resp_col)
print("* Model coefficients:", model.coef_)
print("* Model intercept:", model.intercept_)
y_pred = model.predict(X)
print(f"* Mean relative error: {abs((y_pred - y)/y).mean()*100:.1f}%")