Version 1.0
This exam will test your understanding of PCA and Linear Regression along with manipulation of Pandas, Numpy, and base Python data structures. There are 8 exercises, numbered 0 to 7. There are 15 available points. However, to earn 100%, the threshold is just 13 points. (Therefore, once you hit 13 points, you can stop. There is no extra credit for exceeding this threshold.)
Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. One demo cell calls a function defined in an earlier demo - the rest are self-contained. If you see a comment # demo - used in later demo cell
be sure to run that cell before moving on.
For more information see the "Final Exam Release Notes" post on the discussion forum.
Good luck!
###
### AUTOGRADER TEST - DO NOT REMOVE
###
import pandas as pd
import numpy as np
import run_tests
from time import time
from importlib import reload
time_start = time()
Today, we will be working with data on home sales. We will try to use information about the individual homes to build a linear regression model predict the sale prices. There are challenges to this approach, which will need to be addressed.
Before we can go about choosing a model from many options, we must decide on a how to compare them. One common metric for evaluating linear regression models is the "coefficient of determination", often referred to as .
reveals the share of variance in the response which is accounted for by the model. However, this metric is not effective at comparing models with different numbers of parameters. Specifically, adding parameters will always result in an increase in . To compare models of different sizes, there is a modified metric, , which penalizes larger models.
Exercise 0: (1 point) - Given Numpy arrays representing the observed responses (y
) and predicted responses (y_hat
), complete the function signature for r_sq
to compute . Return your answer as a Python float.
def r_sq(y, y_hat):
###
SSR = np.sum((y - y_hat) ** 2)
y_bar = np.mean(y)
SST = np.sum((y - y_bar) ** 2)
R_sq = 1 - SSR / SST
return R_sq
###
The demo cell below should output approximately 0.93690140
.
These are the values for the intermediate calculations.
# demo
demo_y_0 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_0 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
r_sq(demo_y_0, demo_y_hat_0)
# exercise 0 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex0(r_sq)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables for `y`, `y_hat`, `true_output`, `your_output`
from run_tests import y_check, y_hat_check, r_sq_true, r_sq_output
Exercise 1: (1 point) - Given Numpy arrays representing the observed responses (y
), predicted responses (y_hat
), and the number of parameters in the model (p
), complete the function signature for r_sq_adj
to compute . If you successfully solved Exercise 0 you can use r_sq
in your solution to this exercise, however you can pass by directly implementing the formula.
Here's the formula again:
def r_sq_adj(y, y_hat, p):
###
# Option 1)
n = len(y)
R2 = r_sq(y, y_hat)
R2_adj = 1 - (1 - R2) * (n - 1) / (n - p - 1)
# Option 2)
# n = len(y)
# ssr = np.sum((y - y_hat) ** 2)
# y_bar = np.mean(y)
# sst = np.sum((y - y_bar) ** 2)
# R2_adj = 1 - (ssr * (n - 1)) / (sst * (n - p - 1))
return R2_adj
###
The demo cell below should output approximately 0.89483568
.
These are the values for the intermediate calculations.
# demo
demo_y_1 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_1 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
demo_p_1 = 2
r_sq_adj(demo_y_1, demo_y_hat_1, demo_p_1)
# exercise 1 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex1(r_sq_adj)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables for `y`, `y_hat`, `p`, `true_output`, `your_output`
from run_tests import y_check, y_hat_check, p_check, r_sq_adj_true, r_sq_adj_output
Exercise 2: (2 points) - The data set has some missing values, columns which are the wrong type, and columns which we are not interested in. Complete the function signature for clean
to return a new DataFrame that is a copy of df
with the following modifications based on the Python dictionary, params
. The values of params
are all Python lists of column names.
params['cols_to_drop']
.params['cols_to_string']
to Python strings.params['cols_to_zero']
to 0.0
, while leaving all other values intact.params['cols_to_none']
to the Python string 'None'
- Don't forget the quotation marks!Notes:
params
will always have all of the keys mentioned above.params['cols_to_string']
does not have missing values.params
lists are mutually exclusive - i.e. a column included in params['cols_to_string']
will not be in any of the other lists.For example, given the following DataFrame input:
foo_drop bar_drop foo_zero foo_none bar_none foo_str foo bar
0 1.0 0.0 2.0 foo1 bar 1 foo bar
1 2.0 NaN 4.0 foo2 bax 2 foo bar
2 3.0 NaN NaN foo3 qux 3 foo bar
3 4.0 NaN NaN foo4 bar 4 foo bar
4 5.0 NaN 3.0 NaN bax 5 foo bar
5 6.0 1.0 4.0 NaN qux 6 foo bar
6 NaN 2.0 5.0 foo1 bar 7 foo bar
7 NaN 3.0 6.0 foo2 bax 8 foo NaN
8 NaN 4.0 7.0 foo3 qux 9 NaN bar
9 NaN 5.0 1.0 foo4 NaN 10 foo bar
With these params:
{
'cols_to_drop': ['foo_drop', 'bar_drop'],
'cols_to_zero': ['foo_zero'],
'cols_to_none': ['foo_none', 'bar_none'],
'cols_to_string': ['foo_str']
}
The function call to clean
should return these results.
foo_zero foo_none bar_none foo_str foo bar
0 2.0 foo1 bar 1 foo bar
1 4.0 foo2 bax 2 foo bar
2 0.0 foo3 qux 3 foo bar
3 0.0 foo4 bar 4 foo bar
4 3.0 None bax 5 foo bar
5 4.0 None qux 6 foo bar
6 5.0 foo1 bar 7 foo bar
9 1.0 foo4 None 10 foo bar
Notice that the columns 'foo_drop'
, 'bar_drop'
were dropped entirely, 'foo_zero'
had missing values replaced with 0.0
in rows 2 and 3, 'foo_none'
and 'bar_none'
had missing values replaced with 'None'
in rows 4,5, and 9, the numbers in 'foo_str'
were converted to strings, and rows 7 and 8 were dropped because of missing values in 'foo'
and 'bar'
.
The demo cell below should generate the same output.
def clean(df, params):
###
res = df.copy()
res = res.drop(params['cols_to_drop'], axis=1)
res[params['cols_to_string']] = res[params['cols_to_string']].astype('str')
res[params['cols_to_zero']] = res[params['cols_to_zero']].fillna(0.0)
res[params['cols_to_none']] = res[params['cols_to_none']].fillna('None')
res.dropna(inplace=True)
return res
###
# demo
demo_params = {
'cols_to_drop': ['foo_drop', 'bar_drop'],
'cols_to_zero': ['foo_zero'],
'cols_to_none': ['foo_none', 'bar_none'],
'cols_to_string': ['foo_str']
}
demo_df_2 = pd.DataFrame(
{
'foo_drop': [1, 2, 3, 4, 5, 6, np.nan, np.nan, np.nan, np.nan],
'bar_drop': [0, np.nan, np.nan, np.nan, np.nan, 1, 2, 3, 4, 5],
'foo_zero': [2., 4., np.nan, np.nan, 3., 4., 5., 6., 7., 1],
'foo_none': ['foo1', 'foo2', 'foo3', 'foo4', np.nan, np.nan, 'foo1', 'foo2', 'foo3', 'foo4'],
'bar_none': ['bar', 'bax', 'qux', 'bar', 'bax', 'qux', 'bar', 'bax', 'qux', np.nan],
'foo_str': [1,2,3,4,5,6,7,8,9,10],
'foo': ['foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', np.nan, 'foo'],
'bar': ['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', np.nan, 'bar', 'bar' ]
}
)
print(clean(demo_df_2, demo_params))
# exercise 2 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex2(clean)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables for `df`, `params`, `true_output`, `your_output`
from run_tests import df_check_in, params_check, df_check_out, clean_check
The data has categorical predictors as well as numerical predictors. We have to treat each type separately to get meaningful predictions from our model. As such, we need to partition the data separating the numerical predictors, the categorical predictors, and the target variable. Additionally, there is sometimes not a linear relationship between predictors and the target variable. It may be appropriate to transform one or both to obtain the desired relationship. The most common transformation is the log transformation, where the natural logarithm of a particular variable is used for modeling in lieu of the variable itself.
Exercise 3: (2 points) - Given a DataFrame (df
), the name of the column containing the responses (target
), and a boolean value (log
), complete the function signature for partition_cols
. The function should output the following:
Y
- Numpy array - Vector holding the response values. Y
should be log-transformed based on the truth-value of log
.X_num
- Numpy array - Matrix holding the values of all numerical columns of df
aside from the target
column.X_cat_df
- DataFrame - A copy of all the categorical columns of df
.Implementation details:
log
is True
, Y
should be log transformed. Otherwise, no transformation should be performed. dtype
(i.e. some flavor of int
or float
). All columns which do not have numerical dtypes
should be consiederd categorical.For example, given the following DataFrame input:
foo bar baz kag tar
0 abf 1 2.0 148.413159 zyx
1 def 3 4.0 54.598150 wvu
2 ghi 5 6.0 20.085537 tsr
3 jkl 7 8.0 1096.633158 qpo
With 'kag'
as the target
and log = True
, a call to partition_cols
should return the following output:
(array([5., 4., 3., 7.]),
array([[1., 2.],
[3., 4.],
[5., 6.],
[7., 8.]]),
foo tar
0 abf zyx
1 def wvu
2 ghi tsr
3 jkl qpo)
Notice that the output is a tuple with three elements. The first is an array of the log transformed values in the target column, 'kag'
. The second is an array of the values in the numerical columns 'bar'
and 'baz'
. The third is a DataFrame with the categorical columns 'foo'
and 'tar'
.
The demo cell below should return the following results.
def partition_cols(df, target, log=True):
###
Y = np.log(df[target]) if log else df[target]
Y = Y.to_numpy()
X_num = df.drop(target, axis=1).select_dtypes(include='number').to_numpy()
X_cat_df = df.select_dtypes(include='object').copy()
###
return Y, X_num, X_cat_df
# demo
demo_df_3 = pd.DataFrame(
{
'foo': ['abf', 'def', 'ghi', 'jkl'],
'bar': [1,3,5,7],
'baz': [2,4.,6,8],
'kag': [np.exp(5), np.exp(4), np.exp(3), np.exp(7)],
'tar': ['zyx', 'wvu', 'tsr', 'qpo']
}
)
demo_target = 'kag'
demo_log = True
partition_cols(demo_df_3, demo_target, demo_log)
# excercise 3 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex3(partition_cols)
print('Passed. Please sumbit your work now.')
time() - time_start
# load testing variables for `df`, `target`, `log`, `true_Y_output`, `your_Y_output`, `true_X_num_output`, `true_X_cat_df_output`, `your_X_cat_df_output`
from run_tests import df_check_in, target_check, do_log, Y_check, Y_out, X_num_check, X_num_out, X_cat_df_check, X_cat_df_out
To use categorical variables in regression, they must be converted into some numerical form. One strategy for doing this is "One-hot" encoding. To accomplish this, several new variables are created - one for each unique value of the original variable. For each observation a 1
is assigned to the new variable corresponding to original variable, and the rest are assigned 0
. This strategy is preferred over assigning each unique original value a number because the categorical variables don't have any "order" or "value".
Exercise 4: (2 points) - Given a DataFrame (df
) complete the function signature for one_hot_all
to return a new DataFrame which is a copy of df
with all categorical columns replaced with a one-hot encoded version of the column (note that a single column will be replaced by multiple columns). Use the guidelines for performing the one-hot encoding for each categorical column.
dtype
should be considered categorical.For each categorical column:
<original column name>_<unique value>
.<unique value>
then <original column name>_<unique value>
should have an entry of 1
, all other new columns should have entries of 0
.All of the new columns should have a dtype
of 'int64'
- the columns which are not one-hot encoded should keep their original dtype
.
Hint - There is a Pandas method which will accomplish most of what needs to be done in this exercise. We won't name it, but feel free to search the web and use whatever you find.
For example, given the input DataFrame:
foo bar baz
0 baz tar 2.3
1 qux kag 3.0
2 baz kag 4.0
3 qux tuk 1.1
4 baz tar 6.0
5 qux kag 8.0
Notice that there are two categorical columns (you can check for yourself using dtypes
).
'foo'
has unique values 'baz'
and 'qux'
the output should have columns 'foo_baz'
and 'foo_qux'
.'bar'
has unique values 'kag'
, 'tar'
, and 'tuk'
the output should have columns 'bar_kag'
, 'bar_tar'
, and 'bar_tuk'
.The result has replaced the columns 'foo'
and 'bar'
with the one-hot encoded versions:
baz foo_baz foo_qux bar_kag bar_tar bar_tuk
0 2.3 1 0 0 1 0
1 3.0 0 1 1 0 0
2 4.0 1 0 1 0 0
3 1.1 0 1 0 0 1
4 6.0 1 0 0 1 0
5 8.0 0 1 1 0 0
The demo cell below should produce the same output.
def one_hot_all(df):
###
return pd.get_dummies(data=df, dtype='int64')
###
# demo
demo_df_4 = pd.DataFrame(
{
'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'],
'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag'],
'baz': [2.3, 3, 4, 1.1, 6, 8]
}
)
print(demo_df_4)
print(one_hot_all(demo_df_4))
# exercise 4 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex4(one_hot_all)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables for `df`, `true_output`, and `your_output`
from run_tests import df_check, oh_check, oh_out
It is common practice to randomly partition available data into a training and test set to more accurately gage how well a model will perform on new data. It is possible (and somewhat likely) that models will fit some of the randomness or "noise" in the data which is not actually accounted for in the predictive variables.
The cell below defines a function ind_splitter
. It takes two inputs:
m
- the number of rows bring split.pct
- a float indicating the percentage of rows to be assigned to the test set. The function ind_splitter
returns two lists:
# demo - used in later demo cell
def ind_splitter(m, pct):
n = int(pct*m)
rng = np.random.default_rng(6040)
inds = np.arange(m)
rng.shuffle(inds)
return inds[:n], inds[n:] # test indices, training indices
Exercise 5: (2 points) - Given a tuple (data
) containing some number of either DataFrames or Numpy arrays, the percentage desired in the test partition, and a function (splitter
) which will decide the indices belonging to each partition, fill out the function definition for train_test_split
. The function should do the following:
splitter
to determine which indices (rows) belong to which partiton. splitter
will take the same arguments as ind_splitter
and return two list-like structures. The first is the indices of the test set, the second is the indices of the training set.splitter
to return a tuple that has the test and training partitions for each data structure in data
. There will be two entries in the output for each entry in data
.data
is of the form (data0, data1, data2)
, then the output should be of the form (data0_test, data0_train, data1_test, data1_train, data2_test, data2_train)
. There can be any number of entries in data
, but the output should give the test then the trainning partition of each entry before moving on to the next entry.data
will have the same number of rows. This is checked with the assert
statement inside the for-loop given in the starter code. Feel free to use the variable m
in your solution.Note:
splitter
takes the same inputs/outputs as the ind_splitter
example above. However, the test code will not use ind_splitter
. Your solution must call splitter
to pass. splitter
. Do not reset any DataFrame indexes or shuffle any columns. This requirement is necessary to ensure that the relation between each piece of data
is maintained properly when partitioning.For example, consider the following tuple:
(array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]),
foo bar
0 baz tar
1 qux kag
2 baz kag
3 qux tuk
4 baz tar
5 qux kag)
It has two entries - the first is an array and the second is a DataFrame. Notice how they both have the same number of rows. Using pct = 0.4
and ind_splitter
as our splitter
a call to train_test_split
should produce the following output.
(array([[ 9, 10, 11],
[ 0, 1, 2]]),
array([[ 3, 4, 5],
[15, 16, 17],
[12, 13, 14],
[ 6, 7, 8]]),
foo bar
3 qux tuk
0 baz tar,
foo bar
1 qux kag
5 qux kag
4 baz tar
2 baz kag)
Notice that rows 3
and 0
are assigned to the test partition for both entries in data
. The ordering of the rows has been changed but is consistent between the array and the tuple as well.
The demo cell below should produce the same output.
def train_test_split(data, pct, splitter=ind_splitter):
m = data[0].shape[0]
for d in data:
assert d.shape[0] == m
###
test_ind, train_ind = splitter(m, pct)
res = []
for dataset in data:
if isinstance(dataset, pd.DataFrame):
res.append(dataset.iloc[test_ind, :])
res.append(dataset.iloc[train_ind, :])
else:
res.append(dataset[test_ind])
res.append(dataset[train_ind])
return tuple(res)
###
# demo - calls function defined in earlier demo cell
demo_df_5 = pd.DataFrame(
{
'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'], 'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag']
}
)
demo_arr_5 = np.arange(18).reshape((6,3))
demo_pct_5 = 0.4
demo_data_5 = (demo_arr_5, demo_df_5)
train_test_split(demo_data_5, demo_pct_5, ind_splitter)
# exercise 5 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
run_tests.chk_ex5(train_test_split)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables for `data`, `pct`, `splitter`, `true_output`, `your_output`
from run_tests import data_check, pct_check, splitter_check, true_output, your_output
To attempt to reduce our data's dimensionality, we will use Principal Component Analysis. However, PCA works best when each data column is "standardized", i.e. has a 0 and 1 for its mean and standard deviation.
To standardize a matrix , suppose that is a column of . Also, suppose that and are the mean and standard deviation of . To calculate the standardized column use this formula.
To refresh your memory on PCA: Suppose our data is in a matrix with rows and columns. Recall the Singular Value Decomposition (SVD) of :
To use the same column standardization and PCA transformation on new data, we need the vector of column means, vector of column standard deviations and the principal components. With this we can evaluate models based on our transformed data with new observations.
Exercise 6: (3 points) - Complete the function prin_comps
to meet the following requirements.
X
and compute its SVD.prin_comps
should return the column mean vector (Numpy array), the column sample standard deviation vector (Numpy array), and a number of principal components to be determined by the rules below.pct
is not None
use the SVD to determine the minimum value of such that the first principal components account for at least pct
of the variance. Return the first principal components.n
is not None
return the first n
principal components.pct
and n
are None
return all of the principal components.pct
and n
are not None
will not occur - this is checked with an assertion in the starter code.numpy.linalg.svd
(which you should use for SVD calculation).np.svd
. Check the documentation for details on how to set the degrees of freedom. There are other libraries (like scipy.stats
) which can also calculate column sample standard deviations, but you will need to check the documentation to ensure that the degrees of freedom are what's intended.There is no simplified example or demo for this exercise. However, this exercise incorporates concepts almost directly from Notebook 15.
pct
.def prin_comps(X, pct=None, n=None):
assert not (pct is not None and n is not None), 'Can\'t set both of these'
assert pct is None or (0 < pct <= 1), 'Invalid pct. Must be between 0 and 1 or `None`.'
assert n is None or (type(n) is int and 1 <= n <= X.shape[1]), 'n must be integer between 1 and number of columns in X'
###
m = np.mean(X, axis=0)
s = np.std(X, axis=0, ddof=1)
X_std = (X - m) / s
U, S, Vh = np.linalg.svd(X_std)
if pct:
frac_var = S ** 2 / np.sum(S ** 2)
acc = 0
for i, frac in enumerate(frac_var):
acc += frac
if acc >= pct:
k = i + 1
break
VTk = Vh[:k, :]
elif n:
VTk = Vh[:n, :]
else:
VTk = Vh
###
return m, s, VTk # means, standard deviations, and principal components
# exercise 6 test cell
reload(run_tests)
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
run_tests.chk_ex6(prin_comps, 50)
print('Passed. Please submit your work now.')
time() - time_start
# load testing variables `pca_test`, `your_m_output`, `your_s_output`, `your_VTk_output`
# `pca test` is a dictionary with keys ['X', 'pct', 'n', 'm', 's', 'VTk'] corresponding to
# inputs and required outputs given in the problem description
from run_tests import pca_test, m_out, s_out, VTk_out
We have defined pca_transformers
and calc_theta
below. Take the time to look at them, because Exercise 7 will incorporate elements from both.
pca_transformer
- You have seen above that functions are able to take other functions as arguments. Well, they can also return functions! This function returns the function transform
, which uses the parameters from pca_transformers
as constants and takes a single parameter itself. The idea is that pca_transformer
can be used to create a function that projects a new matrix (X
) into the same Principal Component space that was used to create m
, s
, and VTk
.# demo
def pca_transformer(m, s, VTk):
def transform(X):
X_std = ((X-m)/(s))
return X_std.dot(VTk.T)
return transform
calc_theta
- This function takes a data matrix X
and a response vector y
and computes the regression coefficient vector theta
. Notice how the "column of ones" is added to the left of X
when X_p
is created.# demo
def calc_theta(X, y):
from scipy.linalg import solve_triangular
# add intercept
m = X.shape[0]
X_p = np.append(np.ones((m,1)), X, axis=1)
# compute regression parameter
Q, R = np.linalg.qr(X_p)
b = Q.T.dot(y)
theta = solve_triangular(R, b)
return theta
Exercise 7: (2 points) - Given a linear regression parameter (theta
), complete the function lr_predictor
to meet these requirements.
theta
.To compute predictions based on the data matrix and a coefficient vector , complete the following steps:
For example if given the input np.array([1,2,3])
for theta
, a call to lr_predictor
should produce a function which makes predictions on new data.
demo_predictor = lr_predictor(np.array([1,2,3]))
Here demo_predictor
is a function that will make predictions based on theta
. We can make predictions on the following data.
X0 = np.array([[112.89440365, 65.62227416],
[111.84947876, 58.21425899],
[ 99.58399277, 72.10462519],
[ 89.75919282, 80.56690485],
[106.24550229, 79.71511156]])
X1 = np.array([[102.37686976, 73.0361458 ],
[108.56479013, 70.62355242],
[ 96.92681804, 56.06983186],
[101.33747938, 66.66655047],
[ 93.75484396, 43.88275971]])
X2 = np.array([[106.42275221, 62.33434278],
[108.38137453, 67.76061615],
[112.14494972, 64.73565864],
[110.38170377, 64.6506213],
[113.28106798, 75.74780834]])
The demo below makes the following calls and should produce the same output.
demo_predictor(X0)
[423.65562978 399.34173449 416.48186111 422.21910019 452.63633926]
demo_predictor(X1)
[424.86217692 430.00023752 363.06313166 403.67461017 320.15796705]
demo_predictor(X2)
[400.84853276 421.04459751 419.49687536 415.71527144 454.80556098]
def lr_predictor(theta):
def predict(X):
###
ones = np.ones((X.shape[0], 1))
X_aug = np.hstack((ones, X))
return X_aug @ theta
###
return predict
#demo
demo_predictor = lr_predictor(np.array([1,2,3]))
X0 = np.array([[112.89440365, 65.62227416],
[111.84947876, 58.21425899],
[ 99.58399277, 72.10462519],
[ 89.75919282, 80.56690485],
[106.24550229, 79.71511156]])
X1 = np.array([[102.37686976, 73.0361458 ],
[108.56479013, 70.62355242],
[ 96.92681804, 56.06983186],
[101.33747938, 66.66655047],
[ 93.75484396, 43.88275971]])
X2 = np.array([[106.42275221, 62.33434278],
[108.38137453, 67.76061615],
[112.14494972, 64.73565864],
[110.38170377, 64.6506213],
[113.28106798, 75.74780834]])
print('demo_predictor(X0) ->', demo_predictor(X0), '\ndemo_predictor(X1) ->', demo_predictor(X1), '\ndemo_predictor(X2) ->', demo_predictor(X2))
# exercise 7 test cell
reload(run_tests)
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
run_tests.chk_ex7(lr_predictor, 50)
print('Passed. Please submit your work now.')
time() - time_start
from run_tests import lr_test, pred_check
Fin. You have reached the end of the exam. Make sure you have submitted your work and congratulations on finishing the semester!
The additional content below is optional.
We defined 8 useful functions in the exercises above. Here, we will show how to use these functions to perform a simple regression analysis on some real estate sales data. Set the variable run_demo
to True
in order to run the demonstration code in the cells below.
Note: the demonstration below will not run correctly if you have not solved all of the exercises in the exam.
# set `run_demo` to `True` if you want to enable the demo epilogue in your work area
run_demo = True
Loading the data
Below we will load the data from a file and take a look at the columns.
if run_demo:
df_raw = pd.read_csv('resource/asnlib/publicdata/train.csv')
print(df_raw.columns)
Determining age
One step in our preprocessing that is not covered in the code we wrote above is the age of the home. In the raw data, we have the year that the home was last remodeled. It makes sense to compute the age by subtracting the remodel year from the year the home was sold and use this value instead of the years. We will do this in the cell below.
if run_demo:
df_raw['Age'] = df_raw['YrSold'] - df_raw['YearRemodAdd']
Cleaning the data
Based on an examination of the data descriptions we have elected to drop the columns 'GarageYrBlt', 'Id', and 'MoSold' because they won't add useful information to our model. Additionally, we are dropping 'YearBuilt' and 'YearRemodAdd' because they are incorporated into our age calculation.
The column 'MSSubClass' has all numerical values, but it is actually categorical. We need to change the type of this column so that it will be interpreted as such.
Now we will check which columns have missing values so that we can deal with them appropriately.
if run_demo:
print(df_raw.loc[:, df_raw.isnull().sum() > 0].columns)
The columns 'LotFrontage' and 'MasVnrArea' are numerical and have missing values. The remaining columns with missing values are numerical. We will replace the numerical missing values with 0.0
and the categorical ones with 'None'
. We do not want to include homes without electricity in our analysis at all.
Below, we set our cleaning parameters to meet our requirements and run clean
. Note that 'Electrical' is not included because we want to drop any rows with missing values in that column.
if run_demo:
params = {
'cols_to_drop': ['GarageYrBlt', 'Id', 'MoSold', 'YearBuilt', 'YearRemodAdd'],
'cols_to_zero': ['LotFrontage', 'MasVnrArea'],
'cols_to_none': ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'PoolQC', 'Fence', 'MiscFeature'],
'cols_to_string': ['MSSubClass']
}
df_clean = clean(df_raw, params)
Partitionng the data Next we need to separate our data into the response, numerical variables, and categorical variables. We will be performing the log transform on the response variable. We should also split each of these into training and test sets. We will hold out roughly 30% for testing.
There are a lot of categorical variables. One-hot encoding them all would give us a very high dimensional model. We used an off-the-shelf recursive feature elimination with cross-validation algorithm to select the most informative categorical variables - 'LotConfig', 'HouseStyle' and 'ExterQual'. Selecting these three categorical variables will give us 17 one-hot variables.
if run_demo:
# separate response, numerical, and categorical data
y, X_num, X_cat = partition_cols(df_clean, 'SalePrice', True)
# one-hot encode categorical data
X_cat_oh = one_hot_all(X_cat[['LotConfig', 'HouseStyle', 'ExterQual']])
# construct `data` tuple and partition into training and test
data = (y.reshape((-1,1)), X_num, X_cat_oh)
y_test, y_train, X_num_test, X_num_train, X_cat_test, X_cat_train = train_test_split(data, 0.3, ind_splitter)
Linear regression with categorical variables
Here we will use our categorical training set and the training response to train linear regression model.
if run_demo:
cat_lr_predict = lr_predictor(calc_theta(X_cat_train, y_train))
In the cell below, we use our model to make predictions on the training and test sets.
if run_demo:
pred_cat_train = cat_lr_predict(X_cat_train)
pred_cat_test = cat_lr_predict(X_cat_test)
Now we can evaluate the predictions. We will use the metrics r_sq
and r_sq_adj
. We will also plot the predictions against the observations. Blue points have observed values as the "x" coordinate and predicted values as the "y". The orange dots are on the line where predicted values are the same as observed values. You should interpret the horizontal distance between the observations and predictions as the prediction error.
Note: due to the log transformation, the purchase prices will be between 11 and 13.
if run_demo:
def plot_predictions(pred, y, p, title):
from matplotlib import pyplot as plt
fig = plt.figure()
fig.suptitle(f'{title} - R_sq: {r_sq(y, pred)}; R_sq_adj: {r_sq_adj(y, pred, p)}')
fig.supxlabel('Observations')
fig.supylabel('Predictions')
plt.scatter(y, pred)
plt.scatter(pred, pred)
plot_predictions(pred_cat_train, y_train, 18, 'Training Set')
plot_predictions(pred_cat_test, y_test, 18, 'Test Set')
These predictions are not great, but it is interesting that the lot configuration, house type, and overall exterior condition can explain around half of the variance in the sale price. It's also worth noting that there is not much difference in accuracy between the training and test data. The model actually performed a little bit better on the test data!
Linear regression with first 3 principal components of numerical variables Now we will see how the model does with the numerical data. We will determine the parameters necessary to transform the data into its first 3 principal components then use those to train a linear regression model and make predictions.
if run_demo:
# determine transformation parameters
pca_params = prin_comps(X_num_train, n=3)
# instantiate pca transformer
transformer = pca_transformer(*pca_params)
# transform training and test data
X_num_train_pc = transformer(X_num_train)
X_num_test_pc = transformer(X_num_test)
# train regression model
pca_lr_predict = lr_predictor(calc_theta(X_num_train_pc, y_train))
# make predictions
pred_num_train = pca_lr_predict(X_num_train_pc)
pred_num_test = pca_lr_predict(X_num_test_pc)
Below we will calculate the metrics and plot the predictions and observations in the same way we did for the categorical variables.
if run_demo:
plot_predictions(pred_num_train, y_train, 18, 'Training Set')
plot_predictions(pred_num_test, y_test, 18, 'Test Set')
The numerical predictors performed a bit better, even though we only used the first 3 principal components. Again it is worth mentioning that there does not appear to be any overfitting as the model made slightly better predictions on the training data!
Linear regression using both types of predictors
Below we will combine the numerical and categorical predictors into one data set. If the information contained in the categorical variables captures different qualities than the information in the numerical variables, we may be able to improve model performance even more! However, there is a chance that there is correlation between the two and not much information will be gained by combining.
if run_demo:
# Combine categorical and numerical variables
X_train = np.append(X_num_train_pc, X_cat_train.values, axis=1)
X_test = np.append(X_num_test_pc, X_cat_test.values, axis=1)
# Train regression model
full_predictor = lr_predictor(calc_theta(X_train, y_train))
# Make predictions
pred_train = full_predictor(X_train)
pred_test = full_predictor(X_test)
# Plot predictions
plot_predictions(pred_train, y_train, 18, 'Training Set')
plot_predictions(pred_test, y_test, 18, 'Test Set')
The models with the combined data performed marginally better, even accounting for the additional variables. There was not much improvement though which indicates that there was not much additional information in the categorical variables.
Fin! - for real this time.