Version 1.0
All of the header information is important. Please read it..
Topics, number of exercises: This problem builds on your knowledge of pandas, Numpy, and linear regression. It has 12 exercises exercises, numbered 0 to 11. There are 22 available points. However, to earn 100% the threshold is 15 points. (Therefore, once you hit 15 points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs ###
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.
Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercises 0, 3, and 6 involve reading only (no coding), so they are "free" points. However, you still need to run the associated test cells and submit to get those points.
Final reminders:
An instructor for another course has asked you to help analyze some of their exam data. The exams were completed electronically, in a format similar to our class's, but with some twists:
'exam1'
, 'exam2'
, and so on. Exams were timed, but with different time limits for each exam.The dataset itself has two main parts:
Each exam has its own grading and timing data, as you'll see.
Your overall task: You will help clean up and analyze this data, culminating in an assessment, based on linear regression, to see whether or not the last exam score can be predicted from the scores and timing information of earlier exams.
To get started, run the following code cells. Take note of preloaded modules, as it's likely you'll want to use them.
### Global Imports
###
### AUTOGRADER TEST - DO NOT REMOVE
###
# Modules you may find useful:
import pandas as pd
import numpy as np
# Some extra functions that this notebook needs:
import cse6040.utils
from pprint import pprint # Pretty-printer
# Load the dataset:
print("\n==> Loading the dataset...")
from cse6040.examdata import load_dataset
grades_by_exid, times_by_exid = load_dataset()
print("\n==> Done executing the setup and ready for take-off!")
grades
dataframe¶This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).
The global variable grades_by_exam
is a dictionary of pandas dataframes. Each key is an exam ID (e.g., 'exam2'
), and each value is a dataframe holding the grading data for that exam. Let's take a look at an example, which are the grades for the exam with exam id 'exam1'
.
list(grades_by_exid.keys())
Evidently, there are three exams in this dataset. However, the code you write later should not assume three exams unless otherwise specified.
grades_by_exid['exam1']
Note: The data has been anonymized with fake email addresses. Any resemblance to actual email addresses is purely coincidental.
Observations:
'exam1'
, in the example above. There are similar dataframes for 'exam2'
and 'exam3'
.'--------'
. For example, see row 1, where juan.quay@aol.com
did not submit problem4
.### test_cell_ex0 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
remove_invalid_grades
¶Given a grades dataframe, gdf
, complete the function,
def remove_invalid_grades(gdf):
...
so that it returns clean copy of gdf
.
Input: A grades dataframe, gdf
. It will have a column named 'Email'
and one or more columns named 'problem#'
where #
is an integer.
Your task: Copy this dataframe and clean it as follows:
'--------'
, should be replaced with a not-a-number (NaN) value, i.e., a Numpy np.nan
..dtype
of float
.Output: Return a clean tibble copy, without modifying the input gdf
. It should contain exactly the same columns as the input, even if the result is an empty dataframe, and the .dtype
should be as specified above.
Caveat(s)/comment(s)/hint(s):
.replace()
and .filter()
methods may be helpful (but are not required).'problem0'
through 'problem4'
as shown in Exercise 0, but it might also only have, say, two columns named 'problem1'
and 'problem3'
, or it might have four problem columns named 'problem0'
, 'problem1'
, 'problem25'
, and 'problem234'
. Your code should robustly handle these cases. The dataframe's .dropna()
method may be helpful.### Define demo inputs ###
demo_gdf_ex1 = grades_by_exid['exam1'].loc[[0, 1, 2, 683, 3, 957, 4, 1546]].reset_index(drop=True)
demo_gdf_ex1
Consider the demo input, demo_gdf_ex1
, shown above. A correct output is:
problem0 | problem1 | problem2 | problem3 | problem4 | ||
---|---|---|---|---|---|---|
0 | sammie.jaydaa@gmail.com | 7 | 10 | 10 | 10 | 10 |
1 | juan.quay@aol.com | 10 | 10 | 6.5 | 0 | nan |
2 | gino.branston@gatech.edu | 10 | 5 | 10 | 10 | 10 |
3 | danny.jameriah@gmail.com | 7 | nan | nan | nan | nan |
4 | robert.gioanna@gatech.edu | 10 | 10 | 10 | 10 | 10 |
5 | adam.evemarie@gmail.com | 10 | 10 | 10 | 10 | 10 |
Observe that rows 5 and 7 from the input are removed entirely since all of their problem values are missing. By contrast, rows 1 and 3 from the input have some non-NaN values and thus are retained.
In addition, the output types (inspected via .info()
) should match the following:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Email 6 non-null object
1 problem0 6 non-null float64
2 problem1 5 non-null float64
3 problem2 5 non-null float64
4 problem3 5 non-null float64
5 problem4 4 non-null float64
dtypes: float64(5), object(1)
### Exercise 1 solution
def remove_invalid_grades(gdf):
'''
INPUT:
A grades dataframe called gdf. Columns are 'Email' and one or more columns named 'problem#' where # is an integer.
GOAL:
Copy this dataframe and clean it as follows:
-Replace 8 hyphens '--------' with np.nan
-Remove rows where no problems were submitted
-Convert problem columns to be float
Return this cleaned data frame.
STRATEGY:
1. Make copy of input data frame. Let's call this 'gdf_final'
2. Get a list of all of the columns that start with 'problem'
3. Replace any values in those problem columns that contain 8 hyphens with np.nan
4. Remove any rows where the problem columns are all np.nan now
5. Convert problem columns to be float
6. Reset index and return this data frame
'''
# SOLUTION:
gdf_final = gdf.copy()
# Most robust (safest) way is to use Filter + Regex, specify digit after problem:
problem_columns = list(gdf_final.filter(regex=r'^problem\d+', axis=1))
# Alternative way: find all columns that start with 'problem':
# https://stackoverflow.com/questions/27275236/how-to-select-all-columns-whose-names-start-with-x-in-a-pandas-dataframe
#problem_columns = [col for col in gdf_final if col.startswith('problem')]
# Super Risky Way (may or may not work depending on column input/order):
#problem_columns = list(gdf_final.columns[1:])
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
gdf_final[problem_columns] = gdf_final[problem_columns].replace('--------', np.nan)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
gdf_final = gdf_final.dropna(how='all', subset=problem_columns)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
gdf_final[problem_columns] = gdf_final[problem_columns].astype('float64')
return gdf_final.reset_index(drop=True)
### demo function call ###
demo_result_ex1 = remove_invalid_grades(demo_gdf_ex1)
display(demo_result_ex1)
demo_result_ex1.info()
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex1
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_1',
'func': remove_invalid_grades, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'gdf':{
'dtype': 'df', # data type of param.
'check_modified': True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, valid_grades_by_exid
. It is the result of a correct implementation of remove_invalid_grades
when run on grades_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
valid_grades_by_exid = cse6040.utils.load_obj_from_file('valid_grades.dill')
# Example:
display(valid_grades_by_exid['exam1'].iloc[:4])
valid_grades_by_exid['exam1'].info()
calc_earned_points
¶Suppose you have a grades dataframe, valid_gdf
, having only valid rows as calculated in Exercise 1. Complete the function,
def calc_earned_points(valid_gdf):
...
so that it returns a new copy of valid_gdf
with an additional column holding the total points earned.
Input: A dataframe valid_gdf
with only valid rows, per Exercise 1.
Your task: Create a copy and sum the points earned by each student.
Output: Return a new tibble with the same columns as valid_gdf
plus an additional column, 'Earned'
, of .dtype
float
. This column should contain the student's total score, that is, the sum of points on their problems. Any NaN values should be ignored. You should not modify the input valid_gdf
.
### Define demo inputs ###
demo_valid_gdf_ex2 = valid_grades_by_exid['exam1'].sample(6, random_state=3_456_789_012).reset_index(drop=True)
demo_valid_gdf_ex2
The demo included in the solution cell below should display the following output:
problem0 | problem1 | problem2 | problem3 | problem4 | Earned | ||
---|---|---|---|---|---|---|---|
0 | rene.amritpal@gmail.com | 10 | 10 | 10 | 10 | 10 | 50 |
1 | albert.averley@gmail.com | 10 | 2 | 8 | 10 | 10 | 40 |
2 | mary.yahshua@gatech.edu | 7 | 2 | 6.5 | 5 | nan | 20.5 |
3 | jillian.myler@gmail.com | 10 | 10 | 10 | 10 | 1 | 41 |
4 | raina.sheehan@gmail.com | 10 | 10 | 10 | 5 | 10 | 45 |
5 | allison.link@yahoo.com | 7 | 2 | nan | nan | nan | 9 |
Note: Although the values appear to include integers, in fact, the
.dtype
of the columns should befloat
.
### Exercise 2 solution ###
def calc_earned_points(valid_gdf):
'''
INPUT:
A dataframe 'valid_gdf'
GOAL:
Create a copy and sum the points earned by each student.
Return new data frame with same columns + Earned column.
Earned column should be a float. Ignore NaNs.
STRATEGY:
1. Make a copy of the input data frame. Let's call this 'valid_gdf_final'
2. Sum the problem columns and create a new column called 'Earned'
3. Convert 'Earned' column to be float
4. Return the data frame
'''
# SOLUTION:
valid_gdf_final = valid_gdf.copy()
# Using same method as in Exercise 1 to identify problem columns
problem_columns = list(valid_gdf_final.filter(regex=r'^problem\d+', axis=1))
valid_gdf_final['Earned'] = valid_gdf_final[problem_columns].sum(axis=1)
valid_gdf_final['Earned'] = valid_gdf_final['Earned'].astype('float64')
return valid_gdf_final
### demo function call ###
calc_earned_points(demo_valid_gdf_ex2)
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex2 ###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_2',
'func': calc_earned_points, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'valid_gdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, earned_by_exid
. It is the result of a correct implementation of calc_earned_grades
when run on valid_grades_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
earned_by_exid = cse6040.utils.load_obj_from_file('earned_grades.dill')
# Example:
display(earned_by_exid['exam1'].iloc[:4])
earned_by_exid['exam1'].info()
timings
dataset¶This exercise involves looking at the grades data; you don't need to write any code. However, you do need to run the test cell and submit to get the free point(s).
Similar to the grades dataset (see Ex. 0), each exam has a timings dataset. Here's what it looks like for 'exam1'
:
demo_times3_ex3 = times_by_exid['exam3']
demo_times3_ex3
Observations:
'start date-time'
and 'submit date-time'
columns hold timestamps, and their .dtypes
are special pandas Timestamp
objects. The mini-tutorial below shows you how they work.'start date-time'
value is when the student began the exam. It will have the same value for all rows of a given student.'submit date-time'
value is the timestamp of the student's last submission for that problem. That is, only one submission is recorded per student and problem.Implications: One consequence of the data we have is that we cannot really know how the student navigated the exam: we know when they started, but if they jumped around or submitted a problem multiple times, the only information we have is the last submission for each problem. Nevertheless, we will use the information we have to guess at their timeline, after some clean up.
Timestamps: The printed timestamps have the following form:
2018-12-07 18:35:23
That corresponds to December 7, 2018, at 6:35 pm (and 23 seconds).
Luckily, you don't need to parse timestamps! These objects make it easy to sort and calculate time differences, which is what we'll need. For instance, we can check if one date is larger than another using the usual comparison-operators, like >
or <=
:
(demo_times3_ex3['submit date-time'] > demo_times3_ex3['start date-time']).head()
And we can subtract times using the usual -
operator, as we would for integers or floats:
(demo_times3_ex3['submit date-time'] - demo_times3_ex3['start date-time']).head()
The difference is stored as a timedelta64
object. The way to read printed item 0
above is "20 hours, 45 minutes, and 19 seconds."
Time-delta objects can be converted into a regular float representing the number of seconds using a method called .total_seconds()
:
(demo_times3_ex3['submit date-time'].iloc[0] - demo_times3_ex3['start date-time'].iloc[0]).total_seconds()
We will need this transformation to simplify some calculations later on, so tuck this fact away.
Some messiness: The data has some "impurities." In particular, check out the data for 'kenneth.bryland@gatech.edu'
:
demo_times3_ex3_messy = demo_times3_ex3.loc[[206, 1111, 1780, 2511, 3375]]
demo_times3_ex3_messy
Observe that their 'problem0'
and 'problem1'
submission times occur before their start times, which we confirm next:
demo_times3_ex3_messy['submit date-time'] > demo_times3_ex3_messy['start date-time']
While it turns out there is a reason for that, we will want to omit cases like that to simplify our analysis.
### test_cell_ex3 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
remove_causality_violators
¶Suppose you are given a timings dataframe, tdf
, per Exercise 3. Complete the function,
def remove_causality_violators(tdf):
...
so that it returns a new, cleaned copy that omits records for any student who has timestamps that "violate causality." A causality violation means that their submission time is strictly less than their start time.
Input: A timings dataframe, tdf
, with the columns 'email'
, 'milestone'
, 'start date-time'
, and 'submit date-time'
.
Your task: Identify all students who have any causality violations and omit their rows from the output dataframe.
Output: Return a new tibble that includes only rows from tdf
for students whose problem-records do not violate causality. The output column .dtypes
should match those of the input. You should not modify the input, tdf
.
Comment(s): If a student has even one problem that violates causality, you should omit all of their rows. That is, we will consider their entire exam data invalid for the purpose of our subsequent analysis.
### Define demo inputs ###
demo_tdf_ex4 = times_by_exid['exam3'].loc[[206, 279, 439, 1111, 1344, 1780, 2511, 3224, 3375]].reset_index(drop=True)
demo_tdf_ex4
The demo included in the solution cell below should display the following output:
milestone | start date-time | submit date-time | ||
---|---|---|---|---|
0 | daniel.tashera@gmail.com | problem0 | 2018-12-09 01:12:56 | 2018-12-09 02:29:33 |
1 | barbara.wilhemina@gatech.edu | problem0 | 2018-12-10 02:43:28 | 2018-12-10 06:21:30 |
2 | barbara.wilhemina@gatech.edu | problem2 | 2018-12-10 02:43:28 | 2018-12-10 09:09:43 |
3 | barbara.wilhemina@gatech.edu | problem5 | 2018-12-10 02:43:28 | 2018-12-10 09:04:35 |
Observe that kenneth.bryland@gatech.edu
is omitted from this output because their 'problem0'
and 'problem1'
records violate causality. (Even though their remaining problems obey causality, we still drop them because of the problematic rows.)
### Exercise 4 solution ###
def remove_causality_violators(tdf):
'''
INPUT:
A timings dataframe 'tdf' with the columns 'email', 'milestone', 'start date-time', and 'submit date-time'
GOAL:
Find rows where submit date_time < start date-time.
Return a new dataframe with these bad rows omitted.
Omit all rows for that student if one or more bad rows exist.
STRATEGY:
1. Make a copy of input data frame. Let's call this 'tdf_final'
2. Find rows where 'submit date-time' < 'start date-time'. Let's call these 'tdf_bad_rows'.
3. Get a list of all of the student emails in the bad rows identified in Step 2. Let's call this 'bad_emails_list'
4. Grab the rows from tdf_final where the email address is NOT in 'bad_emails_list'
5. Reset index and return tdf_final
'''
# SOLUTION:
tdf_final = tdf.copy()
tdf_bad_rows = tdf_final[tdf_final['submit date-time'] < tdf_final['start date-time']]
bad_emails_list = list(tdf_bad_rows['email'])
tdf_final = tdf_final[~tdf_final['email'].isin(bad_emails_list)]
return tdf_final.reset_index(drop=True)
### demo function call ###
remove_causality_violators(demo_tdf_ex4)
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex4
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_4',
'func': remove_causality_violators, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, valid_times_by_exid
. It is the result of a correct implementation of remove_causality_violaters
when run on times_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
valid_times_by_exid = cse6040.utils.load_obj_from_file('valid_times.dill')
# Example:
assert (valid_times_by_exid['exam3']['submit date-time'] >= valid_times_by_exid['exam3']['start date-time']).all()
display(valid_times_by_exid['exam1'].iloc[:4])
valid_times_by_exid['exam1'].info()
clean_times_colnames
¶Let's tidy up the column names for the timings data. Complete the function,
def clean_times_colnames(tdf):
...
to return a new tibble from tdf
with columns renamed as follows.
'email'
=> 'Email'
'milestone'
=> 'Milestone'
'start date-time'
=> 'Start'
'submit date-time'
=> 'Submit'
Element types for the columns (.dtype
) should be preserved, and the input dataframe should not be modified.
### define demo inputs ###
demo_tdf_ex5 = valid_times_by_exid['exam2'].sample(5, random_state=567_890_123, replace=False).reset_index(drop=True)
demo_tdf_ex5
The demo included in the solution cell below should display the following output:
Milestone | Start | Submit | ||
---|---|---|---|---|
0 | jazlyn.florin@gmail.com | problem4 | 2018-11-04 12:45:10 | 2018-11-05 20:50:41 |
1 | christian.miyisha@yahoo.com | problem2 | 2018-11-03 07:38:00 | 2018-11-04 12:21:12 |
2 | taryn.raneshia@yahoo.co.id | problem0 | 2018-11-04 08:25:24 | 2018-11-05 17:59:02 |
3 | joseph.sherburne@gmail.com | problem1 | 2018-11-02 16:45:06 | 2018-11-03 01:47:58 |
4 | thomas.dyanni@gmail.com | problem3 | 2018-11-03 14:20:56 | 2018-11-04 23:31:18 |
### Exercise 5 solution
def clean_times_colnames(tdf):
'''
INPUT:
a data frame 'tdf'
GOAL:
Copy data frame and rename columns from:
'email' => 'Email'
'milestone' => 'Milestone'
'start date-time' => 'Start'
'submit date-time' => 'Submit'
STRATEGY:
1. Make a copy of input data frame. Let's call this 'tdf_final'
2. Rename columns as instructed
3. Return 'tdf_final'
'''
#SOLUTION:
tdf_final = tdf.copy()
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
tdf_final = tdf_final.rename(columns={'email':'Email', 'milestone':'Milestone', 'start date-time':'Start',
'submit date-time':'Submit'})
return tdf_final
### demo function call ###
clean_times_colnames(demo_tdf_ex5)
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex5
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_5',
'func': clean_times_colnames, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, clean_times_by_exid
. It is the result of a correct implementation of clean_times_colnames
when run on valid_times_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
clean_times_by_exid = cse6040.utils.load_obj_from_file('clean_times.dill')
# Example:
display(clean_times_by_exid['exam1'].head())
clean_times_by_exid['exam1'].info()
Our next analysis will estimate the test-takers' behavior by analyzing their submission times.
For example, consider the submission times for 'sherry.ayleigh@aol.com'
, stored in the dataframe demo_tdf_ex6
below:
demo_ex6_tdf = clean_times_by_exid['exam3'].loc[[34, 593, 1624, 2189, 2344, 3122]]
demo_ex6_tdf
Based on the submission times, ordering the rows by the 'Submit'
timestamp would yield a sensible timeline of submission events:
Milestone | Start | Submit | ||
---|---|---|---|---|
34 | sherry.ayleigh@aol.com | problem0 | 2018-12-06 08:32:21 | 2018-12-06 09:07:16 |
593 | sherry.ayleigh@aol.com | problem1 | 2018-12-06 08:32:21 | 2018-12-06 10:04:10 |
1624 | sherry.ayleigh@aol.com | problem2 | 2018-12-06 08:32:21 | 2018-12-06 11:30:28 |
3122 | sherry.ayleigh@aol.com | problem5 | 2018-12-06 08:32:21 | 2018-12-06 13:34:13 |
2344 | sherry.ayleigh@aol.com | problem4 | 2018-12-06 08:32:21 | 2018-12-06 17:53:48 |
2189 | sherry.ayleigh@aol.com | problem3 | 2018-12-06 08:32:21 | 2018-12-06 20:19:21 |
We refer to this ordering of submission times as the submission sequence. It is our best guess as to how they moved through the exam if we assume they did one problem at a time. From this submission sequence for this student, we might conclude the following.
First, the student started the exam at 2018-12-06 08:32:21
and submitted 'problem0'
at 2018-12-06 09:07:16
. That would be:
demo_ex6_tdf['Submit'].iloc[0] - demo_ex6_tdf['Start'].iloc[0]
or 34 minutes and 55 seconds, which is also 2,095 seconds = (2095/3600) ~ 0.582 hours, rounding to three decimal digits. In other words, we will assume that the student started the exam and then spent 0.582 hours on Problem 0.
Next, the student submitted 'problem1'
at 2018-12-06 10:04:10
. Relative to the previous submission time, that is 56 minutes and 54 seconds, or 3,414 seconds = (3414/3600) ~ 0.948 hours, again rounded to three digits. So we might guess that the student spent that much time on Problem 1.
Continuing, the sequence of problems solved was Problem 0, 1, 2, 5, 4, and finally, 3. The approximate number of hours spent on each are 0.58, 0.948, 1.438, 2.062, 4.326, and 2.426 hours, respectively.
### test_cell_ex6 ###
print('This exercise is free. To get the points, please run this cell _and_ submit.')
sequence_submissions
¶Suppose you are given a (clean) timings tibble, tdf
, such as computed in Exercise 5. Complete the function,
def sequence_submissions(tdf):
...
so that it sequences the submissions for each student and returns a new tibble matching the format described below.
Input: A clean timings tibble, tdf
, with the columns 'Email'
, 'Milestone'
, 'Start'
, and 'Stop'
, per Ex. 5.
Your task: For each student, order their submissions and determine the amount of time spent on each problem, as illustrated in Ex. 6.
Output: Without modifying the input tibble, return a new tibble with the following columns:
'Email'
, 'Milestone'
, 'Start'
, and 'Submit'
, which keep the same values as given in tdf
.'Hours'
, the time in hours (a floating-point value) rounded to 3 decimal digits, corresponding to the estimated time spent on each problem.That is, each row of the output will indicate the estimated time spent by the student on the given problem based on the estimated timeline.
Additional requirements: The rows of this new tibble should be sorted by 'Email'
and 'Submit'
timestamp in ascending order.
### Define demo inputs ###
demo_tdf_ex7 = cse6040.utils.load_df_from_file('demo_tdf_ex7.df')
demo_tdf_ex7
In the preceding demo input, there are three students: 'sherry.ayleigh@aol.com'
, 'chelsea.jaretzi@utexas.edu'
, and 'jessica.rinda@gatech.edu'
. Observe that
'sherry.ayleigh@aol.com'
submitted Problems 0-5 inclusive.'chelsea.jaretzi@utexas.edu'
submitted Problems 0, 1, 2, 3, and 5. (So they skipped Problem 4.)'jessica.rinda@gatech.edu'
submitted Problems 1, 2, and 5. (So they skipped Problems 0, 3, and 4.)A correct implementation of sequence_submissions
on this input would produce:
Milestone | Start | Submit | Hours | ||
---|---|---|---|---|---|
0 | chelsea.jaretzi@utexas.edu | problem1 | 2018-12-11 09:50:18 | 2018-12-12 02:34:35 | 16.738 |
1 | chelsea.jaretzi@utexas.edu | problem0 | 2018-12-11 09:50:18 | 2018-12-12 02:40:16 | 0.095 |
2 | chelsea.jaretzi@utexas.edu | problem5 | 2018-12-11 09:50:18 | 2018-12-12 03:45:49 | 1.092 |
3 | chelsea.jaretzi@utexas.edu | problem3 | 2018-12-11 09:50:18 | 2018-12-12 05:49:20 | 2.059 |
4 | chelsea.jaretzi@utexas.edu | problem2 | 2018-12-11 09:50:18 | 2018-12-12 06:17:25 | 0.468 |
5 | jessica.rinda@gatech.edu | problem1 | 2018-12-07 12:05:41 | 2018-12-07 18:04:19 | 5.977 |
6 | jessica.rinda@gatech.edu | problem5 | 2018-12-07 12:05:41 | 2018-12-07 18:11:47 | 0.125 |
7 | jessica.rinda@gatech.edu | problem2 | 2018-12-07 12:05:41 | 2018-12-07 18:21:51 | 0.167 |
8 | sherry.ayleigh@aol.com | problem0 | 2018-12-06 08:32:21 | 2018-12-06 09:07:16 | 0.582 |
9 | sherry.ayleigh@aol.com | problem1 | 2018-12-06 08:32:21 | 2018-12-06 10:04:10 | 0.948 |
10 | sherry.ayleigh@aol.com | problem2 | 2018-12-06 08:32:21 | 2018-12-06 11:30:28 | 1.439 |
11 | sherry.ayleigh@aol.com | problem5 | 2018-12-06 08:32:21 | 2018-12-06 13:34:13 | 2.062 |
12 | sherry.ayleigh@aol.com | problem4 | 2018-12-06 08:32:21 | 2018-12-06 17:53:48 | 4.326 |
13 | sherry.ayleigh@aol.com | problem3 | 2018-12-06 08:32:21 | 2018-12-06 20:19:21 | 2.426 |
Observe the ordering by 'Email'
and 'Submit'
in ascending order. Also observe that the values in 'Hours'
are rounded to 3 decimal digits and correspond to differences in submission times (and submission to start time for the first entry).
### Exercise 7 solution
def sequence_submissions(tdf):
'''
INPUT:
A data frame 'tdf' with the columns 'Email', 'Milestone', 'Start', and 'Stop'
GOAL:
-Do not modify input.
-Sort by 'Email' and 'Submit' timestamp in ascending order.
-Return new data frame with columns 'Email', 'Milestone', 'Start', and 'Submit' with same values as given in tdf.
-Add new column 'Hours', the time in hours (a floating-point value) rounded to 3 decimal digits,
corresponding to the estimated time spent on each problem.
STRATEGY (not using groupby + apply + helper function):
1. Make a copy of the input data frame. Let's call this 'df'
2. Sort by 'Email' and 'Submit' in ascending order.
3. Reset index.
4. Create a new column that holds the difference between the Submit time and the Start time.
Let's call this 'Hours from Start'
5. Convert this column to hours (use hint above or something you find online)
6. Now let's create a new column that's the same as 'Hours from Start' but shifted down one row for each GROUP
(so we should group by Email). Let's call this 'Shifted Hours'
7. Now calculate 'Hours from Start' - 'Shifted Hours'. Make this a new column called 'Hours'
8. Round the 'Hours' column to 3 decimal places
9. Drop the columns 'Hours from Start' and 'Shifted Hours'
10. Return the data frame
'''
# SOLUTION:
df = tdf.copy()
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
df = df.sort_values(by=['Email','Submit'])
df = df.reset_index(drop=True)
df['Hours from Start'] = df['Submit'] - df['Start']
# https://www.statology.org/pandas-convert-timedelta-to-int/
df['Hours from Start'] = df['Hours from Start'] / pd.Timedelta(hours=1)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
g = df.groupby('Email')
df['Shifted Hours'] = g['Hours from Start'].shift(periods=1, fill_value=0)
df['Hours'] = df['Hours from Start'] - df['Shifted Hours']
df['Hours'] = round(df['Hours'], 3)
df = df.drop(columns=['Hours from Start', 'Shifted Hours'])
return df
### demo function call ###
sequence_submissions(demo_tdf_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex7
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_7',
'func': sequence_submissions, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'tdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 2*10 ** (-3)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, hours_by_exid
. It is the result of a correct implementation of clean_times_colnames
when run on hours_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
hours_by_exid = cse6040.utils.load_obj_from_file('hours.dill')
# Example:
display(hours_by_exid['exam1'].head())
hours_by_exid['exam1'].info()
merge_gt
¶Given a tibble pointsdf
of earned points (per Ex. 2) and another tibble hoursdf
of sequenced hours-per-problem (per Ex. 7), complete the function,
def merge_gt(pointsdf, hoursdf):
...
so that it merges these into a single tibble formatted as described below.
Inputs:
pointsdf
: A tibble of earned points, as computed by Exercise 2. It should have the columns, 'Email'
, one or more columns named 'problem#'
where #
is an integer string, and 'Earned'
.'hoursdf'
: A tibble of sequenced hours-per-problem, as computed in Exercise 7. It should have the columns, 'Email'
, 'Milestone'
, 'Start'
, 'Submit'
, and 'Hours'
.Your tasks:
pointsdf
so that its 'problem#'
columns become values of a new column named 'Milestone'
and the point values become a new column named 'Points'
. The 'Points'
column should hold floats.hoursdf
on the common 'Email'
and 'Milestone'
key-columns. (Use an inner-join/merge so that only keys that match both inputs are retained.)'Email'
and 'Submit'
in ascending order.Output: Return the result as a new tibble with the columns 'Email'
, 'Start'
, 'Submit'
, 'Milestone'
, 'Points'
, and 'Hours'
. You should not modify either input, pointsdf
or hoursdf
.
### Define demo inputs ###
demo_pointsdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['pointsdf']
demo_hoursdf_ex8 = cse6040.utils.load_obj_from_file('demo_xdf_ex8.dill')['hoursdf']
print("* `demo_pointsdf_ex8`:")
display(demo_pointsdf_ex8)
print("* `demo_hoursdf_ex8`:")
display(demo_hoursdf_ex8)
In the above inputs, observe that only 'justin.jernee@gatech.edu'
has data in both the points
and hours
tibbles. Therefore, the merged result will only include that student's data, since we are using an inner merge. For this input, the result is:
Earned | Milestone | Points | Start | Submit | Hours | ||
---|---|---|---|---|---|---|---|
0 | justin.jernee@gatech.edu | 40 | problem2 | 2 | 2018-12-08 10:53:50 | 2018-12-08 15:12:42 | 4.314 |
1 | justin.jernee@gatech.edu | 40 | problem0 | 10 | 2018-12-08 10:53:50 | 2018-12-08 19:21:39 | 4.15 |
2 | justin.jernee@gatech.edu | 40 | problem1 | 10 | 2018-12-08 10:53:50 | 2018-12-08 20:16:38 | 0.916 |
3 | justin.jernee@gatech.edu | 40 | problem5 | 5 | 2018-12-08 10:53:50 | 2018-12-09 13:53:43 | 17.618 |
4 | justin.jernee@gatech.edu | 40 | problem3 | 10 | 2018-12-08 10:53:50 | 2018-12-09 20:06:24 | 6.211 |
5 | justin.jernee@gatech.edu | 40 | problem4 | 3 | 2018-12-08 10:53:50 | 2018-12-09 22:08:52 | 2.042 |
Observe the ordering of the rows. The problems appear in ascending order of submission time, per the sorting requirement of this exercise.
### Exercise 8 solution
def merge_gt(pointsdf, hoursdf):
'''
INPUT:
1. pointsdf: has columns 'Email', 'problem#' columns where # is an integer string, and 'Earned'
2. hoursdf: has columns 'Email', 'Milestone', 'Start', 'Submit', and 'Hours'
GOAL:
Melt pointdf so that the 'problem#' columns become rows (in a column called 'Milestone'),
and the points become floats in a column called 'Points'.
Inner merge this with hoursdf on 'Email' and 'Milestone'
Final data frame should have columns: 'Email', 'Earned', 'Milestone', 'Points', 'Start', 'Submit', and 'Hours'
STRATEGY:
1. Melt pointsdf. Keep 'Email' and 'Earned' as is, all other columns (aka the problem ones) get melted.
Problem column should be named 'Milestone', and point values should be called 'Points'.
Let's call this new data frame melted_df.
2. Inner merge melted_df with hoursdf on 'Email' and 'Milestone'. Let's call this new data frame merged_df.
3. Sort merged_df by 'Email' and 'Submit' columns in ascending order
4. Reset index
5. Return this data frame
'''
# SOLUTION:
# Easiest Way to Melt:
# https://pandas.pydata.org/docs/reference/api/pandas.melt.html
#melted_df = pd.melt(pointsdf, id_vars=['Email','Earned'], var_name='Milestone', value_name='Points')
# Alternative Way to Melt using NB7 Melt Function (Changed to be helper function per Office Hours feedback):
# We know our keep_vars, not our col_vars, so reverse difference line in the function below and input parameter to be
# keep_vars instead of col_vars
def melt(df, keep_vars, key, value):
col_vals = df.columns.difference(keep_vars)
melted_sections = []
for c in col_vals:
melted_c = df[keep_vars].copy()
melted_c[key] = c
melted_c[value] = df[c]
melted_sections.append(melted_c)
melted = pd.concat(melted_sections)
return melted
melted_df = melt(pointsdf, ['Email', 'Earned'], 'Milestone', 'Points')
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
merged_df = melted_df.merge(hoursdf, how='inner', on=['Email', 'Milestone'])
sorted_df = merged_df.sort_values(by=['Email','Submit'])
return sorted_df.reset_index(drop=True)
### demo function call ###
merge_gt(demo_pointsdf_ex8, demo_hoursdf_ex8)
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex8
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_8',
'func': merge_gt, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'pointsdf':{
'dtype':'df', # data type of param.
'check_modified':True,
},
'hoursdf':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': True, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, merged_by_exid
. It is the result of a correct implementation of merge_gt
when run on earned_by_exid
and hours_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
merged_by_exid = cse6040.utils.load_obj_from_file('merged.dill')
# Example:
display(merged_by_exid['exam1'].head())
merged_by_exid['exam1'].info()
concat_exams
¶Exercise 8 produces a (merged) dataframe for each exam. But when we have multiple exams, we may want to analyze them all at once.
Suppose you are given a dictionary of exams, df_by_exid
, whose keys are exam IDs (strings) and values are dataframes. Complete the function,
def concat_exams(df_by_exid):
...
to "concatenate" these as explained below.
Input: df_by_exid
. If exid
is a string exam ID, then df_by_exid[exid]
is a dataframe associated with that exam. Assume that all dataframes in df_by_exid
have the same columns.
Output:
'Exam'
whose value records the originating exid
key.Comments/hints:
'Exam'
value is set correctly, the ordering of rows and columns does not matter.### Define demo inputs ###
demo_df_by_exid_ex9 = cse6040.utils.load_obj_from_file('demo_df_by_exid_ex9.dill')
for exid, df in demo_df_by_exid_ex9.items():
print(f"\n*** exid = '{exid}' ***")
display(demo_df_by_exid_ex9[exid])
For the preceding inputs, a valid concatenated output is:
Start | Points | Earned | Submit | Milestone | Hours | Exam | ||
---|---|---|---|---|---|---|---|---|
0 | 2018-12-09 08:55:36 | 2 | 38 | chelsea.dilann@gatech.edu | 2018-12-10 11:17:14 | problem2 | 8.557 | exam3 |
1 | 2018-12-08 12:27:33 | 5.5 | 25.5 | kristopher.erandi@gmail.com | 2018-12-09 16:09:17 | problem0 | 2.831 | exam3 |
2 | 2018-09-30 01:05:14 | 10 | 45 | meredith.rakyla@hotmail.com | 2018-09-30 19:11:33 | problem2 | 3.386 | exam1 |
3 | 2018-09-28 16:21:04 | 10 | 45 | stacy.loneta@gmail.com | 2018-09-28 17:51:52 | problem0 | 1.513 | exam1 |
4 | 2018-11-03 20:05:45 | 2 | 22 | leta.kartier@gmail.com | 2018-11-05 05:16:51 | problem1 | 2.097 | exam2 |
5 | 2018-11-04 20:02:24 | 5 | 16 | christopher.jullianne@gatech.edu | 2018-11-05 10:51:45 | problem0 | 14.822 | exam2 |
6 | 2018-11-03 06:59:28 | 10 | 50 | lola.fysher@gmail.com | 2018-11-03 08:06:09 | problem1 | 1.111 | exam2 |
7 | 2018-11-03 10:19:14 | 10 | 39 | susan.tedrina@gatech.edu | 2018-11-04 18:28:46 | problem3 | 1.337 | exam2 |
### Exercise 9 solution
def concat_exams(df_by_exid):
'''
INPUT:
-A dictionary df_by_exid.
-If exid is a string exam ID, then df_by_exid[exid] is a dataframe associated with that exam.
-All dataframes in df_by_exid have the same columns.
GOAL:
-Return a new combined tibble formed by "stacking" these dataframes one on top of the other.
-Include a column named 'Exam' whose value records the originating exid key.
STRATEGY:
1. Create an empty list to hold all of our data frames we want to stack. Let's call this list_of_all_dfs
2. Iterate over exid in df_by_exid
3. Grab the associated data frame using df_by_exid[exid]. Let's call this 'df'
4. Make a copy of 'df'. Let's call this 'df_copy'
5. Add an 'Exam' column to 'df_copy' where all rows = exid
6. Append 'df_copy' to our list_of_all_dfs
7. Concatenate all of the data frames in our list_of_all_dfs.
8. Reset index and return this data frame
'''
# SOLUTION:
list_of_all_dfs = []
for exid in df_by_exid:
df = df_by_exid[exid]
df_copy = df.copy()
df_copy['Exam'] = exid
list_of_all_dfs.append(df_copy)
# https://pandas.pydata.org/docs/reference/api/pandas.concat.html
tall_df = pd.concat(list_of_all_dfs)
return tall_df.reset_index(drop=True)
### demo function call ###
concat_exams(demo_df_by_exid_ex9)
The cell below will test your solution for Exercise 9. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex9
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_9',
'func': concat_exams, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df_by_exid':{
'dtype':'dict', # data type of param.
'check_modified':False, # @TODO: FIXME
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, concatdf
. It is the result of a correct implementation of concat_exams
when run on merged_by_exid
from the full dataset. Subsequent cells use it, so please run this cell regardless of whether you completed this exercise.
concatdf = cse6040.utils.load_obj_from_file('concat.dill')
concatdf
summarize_exams
¶Suppose you are given df
, a concatenated dataframe like the one that produced in Ex. 9. Complete the function,
def summarize_exams(df):
...
so that it determines the mean exam score and mean time to complete the exam for every exam, returning the result as a dataframe (see below).
Input: A dataframe, similar to the output of Exercise 9. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:
'Email'
(string): a student email address'Exam'
(string): the ID of an exam'Milestone'
(string): the problem number'Points'
(float): the points scored on that problem (milestone) for that exam'Hours'
(float): the estimated number of hours spent on that problem for that examYour task: For each student and exam, you need to determine their total points and total time. Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.
Output: Return a pandas dataframe with three columns:
'Exam'
(string): the exam ID'Points'
(float): the total points for that exam averaged over all students'Hours'
(float): the total time in hours for that exam averaged over all studentsComments/hints: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!
### Define demo inputs ###
demo_df_ex10 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex10
### Exercise 10 solution
def summarize_exams(df):
'''
INPUT:
A dataframe, where each row has the data for a (student, exam, problem) triple, but the only columns provided are:
1. 'Email' (string): a student email address
2. 'Exam' (string): the ID of an exam
3. 'Milestone' (string): the problem number
4. 'Points' (float): the points scored on that problem (milestone) for that exam
5. 'Hours' (float): the estimated number of hours spent on that problem for that exam
GOAL:
For each student and exam, determine their total points and total time.
Then for each exam, you need to calculate the average (mean) total points and total time taken over all students.
Return a pandas dataframe with three columns:
1. 'Exam' (string): the exam ID
2. 'Points' (float): the total points for that exam averaged over all students
3. 'Hours' (float): the total time in hours for that exam averaged over all students
Hint: These averages you are to report are by exam taken over all students, not over all (student, problem) pairs!
STRATEGY:
1. Let's first determine total points and total time for each student and exam.
So group by 'Email' and 'Exam' and sum. Let's call this df_sum.
2. Reset index to bring index ('Email' and 'Exam') back as regular columns
3. Now let's find the mean of 'Hours' and 'Points' by Exam.
So group by 'Exam' and find mean.
4. Reset index to bring index ('Exam') back as a regular column
5. Return this data frame.
'''
# SOLUTION:
df_sum = df.groupby(['Email', 'Exam']).sum()
df_sum = df_sum.reset_index()
df_mean = df_sum.groupby('Exam').mean()
df_mean = df_mean.reset_index()
return df_mean
### demo function call ###
summarize_exams(demo_df_ex10)
The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex10
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_10',
'func': summarize_exams, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, exam_summary
. It is the result of a correct implementation of summarize_exams
when run on concatdf
from the full dataset.
exam_summary = cse6040.utils.load_obj_from_file('exam_summary.dill')
exam_summary
For the case of three exams, the instructor would like to know whether the first two scores and the timing numbers can predict the third exam score. To carry out this analysis, the instructor needs to build a data matrix and wants your help.
form_data_table
¶Suppose you are given the same input as Ex. 10, that is, a concatenated dataframe df
like the one that produced in Ex. 9. Complete the function,
def form_data_table(df):
...
so that it constructs a dataframe whose values will (later) become the entries of a linear-regression data matrix.
Input: The same dataframe as Ex. 10. Each row gives the data for a (student, exam, problem) triple, but the only columns provided are:
'Email'
(string): a student email address'Exam'
(string): the ID of an exam'Milestone'
(string): the problem number'Points'
(float): the points scored on that problem (milestone) for that exam'Hours'
(float): the estimated number of hours spent on that problem for that examOutput: Construct and return a new dataframe where each row corresponds with a student, and the columns are as follows:
'Email'
(string): The student email. There will be one row per student in the output.'Points#'
(float): One column of total points scored for each exam present in the input. For example, if df
has data for 'exam2'
and 'exam3'
, then the output should have 'Points2'
and 'Points3'
colums, with the values being the total points earned by the student on the corresponding exam.'Hours#'
(float): Similarly, one column of total exam time for each exam present in the input.As a final step, if there are any NaNs, drop the corresponding rows.
The idea behind this format is that each student will be an "observation" or "sample" for linear regression, and the columns will serve as predictors or features.
Comments/hints: As usual, you should not modify the input dataframe, df
.
### Define demo inputs ###
demo_df_ex11 = cse6040.utils.load_obj_from_file('demo_df_ex10.dill')
demo_df_ex11
### Exercise 11 solution
def form_data_table(df):
'''
INPUT:
A data frame df, where each row gives the data for a (student, exam, problem) triple,
but the only columns provided are:
-'Email' (string): a student email address
-'Exam' (string): the ID of an exam
-'Milestone' (string): the problem number
-'Points' (float): the points scored on that problem (milestone) for that exam
-'Hours' (float): the estimated number of hours spent on that problem for that exam
GOAL:
Return a new dataframe where each row corresponds with a student, and the columns are as follows:
-'Email' (string): The student email. There will be one row per student in the output.
-'Points#' (float): One column of total points scored for each exam present in the input.
For example, if df has data for 'exam2' and 'exam3', then the output should have 'Points2' and 'Points3' colums,
with the values being the total points earned by the student on the corresponding exam.
-'Hours#' (float): Similarly, one column of total exam time for each exam present in the input.
If there are any NaNs, drop the corresponding rows.
STRATEGY:
1. This looks like slightly more complicated casting (need to cast twice as there are multiple value columns you
need to create with new names)
2. *Borrow* casting function from NB7 solution. Add as a helper function.
3. Find the sum of 'Points' and 'Hours' grouped by 'Email'. Let's call this df_sum.
4. Reset index to bring Email index back as an actual column
5. Make 2 data frames: one containing Email + Exam + Points info and the other containing Email + Exam + Hours info
6. Call Cast helper function on each of these data frames (passing in correct parameters for each)
7. Merge the two resulting data frames together (on 'Email') to get our final data frame
8. Drop any rows that contain na
9. Return this data frame
'''
# SOLUTION:
def cast(df, key, value, join_how='outer'):
fixed_vars = df.columns.difference([key, value])
tibble = pd.DataFrame(columns=fixed_vars)
new_vars = df[key].unique()
for v in new_vars:
df_v = df[df[key] == v]
del df_v[key]
df_v = df_v.rename(columns={value: value+v[4:]})
tibble = tibble.merge(df_v, on=list(fixed_vars),how=join_how)
return tibble
df_sum = df.groupby(['Email', 'Exam']).sum()
df_sum = df_sum.reset_index()
df_points = df_sum[['Email', 'Exam', 'Points']]
df_points = cast(df_points, 'Exam', 'Points', 'outer')
df_hours = df_sum[['Email', 'Exam', 'Hours']]
df_hours = cast(df_hours, 'Exam', 'Hours', 'outer')
df_final = df_points.merge(df_hours, on='Email')
df_final = df_final.dropna()
return df_final
### demo function call ###
form_data_table(demo_df_ex11)
The cell below will test your solution for Exercise 10. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. These should be the same as input_vars
- otherwise the inputs were modified by your solution.returned_output_vars
- Outputs returned by your solution.true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### test_cell_ex11
###
### AUTOGRADER TEST - DO NOT REMOVE
###
from tester_fw.testers import Tester
conf = {
'case_file':'tc_11',
'func': form_data_table, # replace this with the function defined above
'inputs':{ # input config dict. keys are parameter names
'df':{
'dtype':'df', # data type of param.
'check_modified':True,
}
},
'outputs':{
'output_0':{
'index':0,
'dtype':'df',
'check_dtype': True,
'check_col_dtypes': True, # Ignored if dtype is not df
'check_col_order': False, # Ignored if dtype is not df
'check_row_order': False, # Ignored if dtype is not df
'check_column_type': True, # Ignored if dtype is not df
'float_tolerance': 10 ** (-6)
}
}
}
tester = Tester(conf, key=b'It2jBzT52UbvyXCvc19bZBXTaLtU-OurcEtWwuHet8M=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
The cell below creates a precomputed object, data_table
. It is the result of a correct implementation of form_data_table
when run on concatdf
from the full dataset.
data_table = cse6040.utils.load_obj_from_file('data_table.dill')
data_table
If you have made it this far, congratulations on completing the exam. Don't forget to submit!
Postscript: We went through a lot of work in this exam to get the data table. So if we try to regress, say, 'Points3'
from the other columns, how well does that work? Here is a code fragment to check. What does the result say?
# Predictors:
data_cols = data_table[['Points1', 'Hours1', 'Points2', 'Hours2']]
X = data_cols.values
X /= X.max(axis=0) # Normalize columns
# Response:
resp_col = 'Points3'
y = data_table[resp_col].values
y /= y.max()
from sklearn.linear_model import LinearRegression
model_type = LinearRegression(fit_intercept=True, positive=True)
model = model_type.fit(X, y)
print("* Predictors:", data_cols)
print("* Response:", resp_col)
print("* Model coefficients:", model.coef_)
print("* Model intercept:", model.intercept_)
y_pred = model.predict(X)
print(f"* Mean relative error: {abs((y_pred - y)/y).mean()*100:.1f}%")