Final Exam: Fall 2021 - Housing Prices

Version 1.0

This exam will test your understanding of PCA and Linear Regression along with manipulation of Pandas, Numpy, and base Python data structures. There are 8 exercises, numbered 0 to 7. There are 15 available points. However, to earn 100%, the threshold is just 13 points. (Therefore, once you hit 13 points, you can stop. There is no extra credit for exceeding this threshold.)

  • Exercise 0: 1 point
  • Exercise 1: 1 point
  • Exercise 2: 2 points
  • Exercise 3: 2 points
  • Exercise 4: 2 points
  • Exercise 5: 2 points
  • Exercise 6: 3 points
  • Exercise 7: 2 points

Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. One demo cell calls a function defined in an earlier demo - the rest are self-contained. If you see a comment # demo - used in later demo cell be sure to run that cell before moving on.

For more information see the "Final Exam Release Notes" post on the discussion forum.

Good luck!

In [4]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###

import pandas as pd
import numpy as np
import run_tests
from time import time
from importlib import reload

time_start = time()

Introduction

Today, we will be working with data on home sales. We will try to use information about the individual homes to build a linear regression model predict the sale prices. There are challenges to this approach, which will need to be addressed.

  • There are missing values in several columns of the data. Our linear regression strategy will not be able to handle this, so we will have to assign values to these observations or remove rows which contain missing values.
  • There are categorical variables in the data. The data will have to be manipulated to transform the data into all numerical values.
  • The data has a large number of columns (especially after handling categorical data). This can add instability to the model, so we want to use fewer columns if possible.
  • We will have to compare several models and determine which is the "best"

Metrics for evaluating model fit

Before we can go about choosing a model from many options, we must decide on a how to compare them. One common metric for evaluating linear regression models is the "coefficient of determination", often referred to as R2.

  • y - column vector of the observed responses (for our purposes, home sale price).
  • y^ - vector of predicted responses (for our purposes, predicted home sale price).
  • n - number of observations.
  • y¯ - mean of observed responses - 1ni=0n1yi
  • SSR - Sum of Squares of Residuals - i=0n1(yy^)2
  • SST - Sum of Squares Total - i=0n1(yy¯)2
  • R2 - Coefficient of determination - 1SSRSST

R2 reveals the share of variance in the response which is accounted for by the model. However, this metric is not effective at comparing models with different numbers of parameters. Specifically, adding parameters will always result in an increase in R2. To compare models of different sizes, there is a modified metric, Radj2, which penalizes larger models.

  • p - number of model parameters
  • Radj2 - adjusted coefficient of determination - 1(1R2)n1np1=1SSR(n1)SST(np1)

Exercise 0: (1 point) - Given Numpy arrays representing the observed responses (y) and predicted responses (y_hat), complete the function signature for r_sq to compute R2. Return your answer as a Python float.

In [5]:
def r_sq(y, y_hat):
    ###
    
    SSR = np.sum((y - y_hat) ** 2)
    
    y_bar = np.mean(y)
    SST = np.sum((y - y_bar) ** 2)
    
    R_sq = 1 - SSR / SST
    
    return R_sq
    ###

The demo cell below should output approximately 0.93690140. These are the values for the intermediate calculations.

  • y¯=6.5
  • SST=35.5
  • SSR=2.24
In [6]:
# demo
demo_y_0 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_0 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
r_sq(demo_y_0, demo_y_hat_0)
Out[6]:
0.9369014084507041
In [7]:
# exercise 0 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###

reload(run_tests)
for _ in np.arange(10):
    run_tests.chk_ex0(r_sq)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[7]:
2.2715511322021484
In [8]:
# load testing variables for `y`, `y_hat`, `true_output`, `your_output`
from run_tests import y_check, y_hat_check, r_sq_true, r_sq_output

Exercise 1: (1 point) - Given Numpy arrays representing the observed responses (y), predicted responses (y_hat), and the number of parameters in the model (p), complete the function signature for r_sq_adj to compute Radj2. If you successfully solved Exercise 0 you can use r_sq in your solution to this exercise, however you can pass by directly implementing the formula.

Here's the formula again:

  • Radj2 = 1(1R2)n1np1=1SSR(n1)SST(np1)
In [9]:
def r_sq_adj(y, y_hat, p):
    ###
    # Option 1)
    n = len(y)
    R2 = r_sq(y, y_hat)
    R2_adj = 1 - (1 - R2) * (n - 1) / (n - p - 1)    
    
    # Option 2)
#     n = len(y)
#     ssr = np.sum((y - y_hat) ** 2)
#     y_bar = np.mean(y)
#     sst = np.sum((y - y_bar) ** 2)
#     R2_adj = 1 - (ssr * (n - 1)) / (sst * (n - p - 1))
    
    return R2_adj
    ###

The demo cell below should output approximately 0.89483568. These are the values for the intermediate calculations.

  • y¯=6.5
  • SST=35.5
  • SSR=2.24
In [10]:
# demo
demo_y_1 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_1 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
demo_p_1 = 2
r_sq_adj(demo_y_1, demo_y_hat_1, demo_p_1)
Out[10]:
0.8948356807511736
In [11]:
# exercise 1 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
    run_tests.chk_ex1(r_sq_adj)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[11]:
34.014039039611816
In [12]:
# load testing variables for `y`, `y_hat`, `p`, `true_output`, `your_output`
from run_tests import y_check, y_hat_check, p_check, r_sq_adj_true, r_sq_adj_output

Cleaning the Data

Exercise 2: (2 points) - The data set has some missing values, columns which are the wrong type, and columns which we are not interested in. Complete the function signature for clean to return a new DataFrame that is a copy of df with the following modifications based on the Python dictionary, params. The values of params are all Python lists of column names.

  • Drop all columns in params['cols_to_drop'].
  • Convert the values in all columns in params['cols_to_string'] to Python strings.
  • Convert missing values in all columns in params['cols_to_zero'] to 0.0, while leaving all other values intact.
  • Convert missing values in params['cols_to_none'] to the Python string 'None' - Don't forget the quotation marks!
  • Drop any rows which still have missing values after all of the above modifications have been made.

Notes:

  • You can assume that params will always have all of the keys mentioned above.
  • You can assume that any column included in params['cols_to_string'] does not have missing values.
  • You can assume that the params lists are mutually exclusive - i.e. a column included in params['cols_to_string'] will not be in any of the other lists.

For example, given the following DataFrame input:

   foo_drop  bar_drop  foo_zero foo_none bar_none  foo_str  foo  bar
0       1.0       0.0       2.0     foo1      bar        1  foo  bar
1       2.0       NaN       4.0     foo2      bax        2  foo  bar
2       3.0       NaN       NaN     foo3      qux        3  foo  bar
3       4.0       NaN       NaN     foo4      bar        4  foo  bar
4       5.0       NaN       3.0      NaN      bax        5  foo  bar
5       6.0       1.0       4.0      NaN      qux        6  foo  bar
6       NaN       2.0       5.0     foo1      bar        7  foo  bar
7       NaN       3.0       6.0     foo2      bax        8  foo  NaN
8       NaN       4.0       7.0     foo3      qux        9  NaN  bar
9       NaN       5.0       1.0     foo4      NaN       10  foo  bar

With these params:

{
    'cols_to_drop': ['foo_drop', 'bar_drop'],
    'cols_to_zero': ['foo_zero'],
    'cols_to_none': ['foo_none', 'bar_none'],
    'cols_to_string': ['foo_str']
}

The function call to clean should return these results.

   foo_zero foo_none bar_none foo_str  foo  bar
0       2.0     foo1      bar       1  foo  bar
1       4.0     foo2      bax       2  foo  bar
2       0.0     foo3      qux       3  foo  bar
3       0.0     foo4      bar       4  foo  bar
4       3.0     None      bax       5  foo  bar
5       4.0     None      qux       6  foo  bar
6       5.0     foo1      bar       7  foo  bar
9       1.0     foo4     None      10  foo  bar

Notice that the columns 'foo_drop', 'bar_drop' were dropped entirely, 'foo_zero' had missing values replaced with 0.0 in rows 2 and 3, 'foo_none' and 'bar_none' had missing values replaced with 'None' in rows 4,5, and 9, the numbers in 'foo_str' were converted to strings, and rows 7 and 8 were dropped because of missing values in 'foo' and 'bar'.

The demo cell below should generate the same output.

In [13]:
def clean(df, params):
    ###
    res = df.copy()
    res = res.drop(params['cols_to_drop'], axis=1)
    res[params['cols_to_string']] = res[params['cols_to_string']].astype('str')
    res[params['cols_to_zero']] = res[params['cols_to_zero']].fillna(0.0)
    res[params['cols_to_none']] = res[params['cols_to_none']].fillna('None')
    res.dropna(inplace=True)
    return res
    ###
In [14]:
# demo
demo_params = {
    'cols_to_drop': ['foo_drop', 'bar_drop'],
    'cols_to_zero': ['foo_zero'],
    'cols_to_none': ['foo_none', 'bar_none'],
    'cols_to_string': ['foo_str']
}

demo_df_2 = pd.DataFrame(
    {
        'foo_drop': [1, 2, 3, 4, 5, 6, np.nan, np.nan, np.nan, np.nan],
        'bar_drop': [0, np.nan, np.nan, np.nan, np.nan, 1, 2, 3, 4, 5],
        'foo_zero': [2., 4., np.nan, np.nan, 3., 4., 5., 6., 7., 1],
        'foo_none': ['foo1', 'foo2', 'foo3', 'foo4', np.nan, np.nan, 'foo1', 'foo2', 'foo3', 'foo4'],
        'bar_none': ['bar', 'bax', 'qux', 'bar', 'bax', 'qux', 'bar', 'bax', 'qux', np.nan],
        'foo_str': [1,2,3,4,5,6,7,8,9,10],
        'foo': ['foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', np.nan, 'foo'],
        'bar': ['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', np.nan, 'bar', 'bar' ]
    }
)
print(clean(demo_df_2, demo_params))
   foo_zero foo_none bar_none foo_str  foo  bar
0       2.0     foo1      bar       1  foo  bar
1       4.0     foo2      bax       2  foo  bar
2       0.0     foo3      qux       3  foo  bar
3       0.0     foo4      bar       4  foo  bar
4       3.0     None      bax       5  foo  bar
5       4.0     None      qux       6  foo  bar
6       5.0     foo1      bar       7  foo  bar
9       1.0     foo4     None      10  foo  bar
In [15]:
# exercise 2 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
    run_tests.chk_ex2(clean)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[15]:
77.06055521965027
In [16]:
# load testing variables for `df`, `params`, `true_output`, `your_output`
from run_tests import df_check_in, params_check, df_check_out, clean_check

Partitioning the Data

The data has categorical predictors as well as numerical predictors. We have to treat each type separately to get meaningful predictions from our model. As such, we need to partition the data separating the numerical predictors, the categorical predictors, and the target variable. Additionally, there is sometimes not a linear relationship between predictors and the target variable. It may be appropriate to transform one or both to obtain the desired relationship. The most common transformation is the log transformation, where the natural logarithm of a particular variable is used for modeling in lieu of the variable itself.

Exercise 3: (2 points) - Given a DataFrame (df), the name of the column containing the responses (target), and a boolean value (log), complete the function signature for partition_cols. The function should output the following:

  • Y - Numpy array - Vector holding the response values. Y should be log-transformed based on the truth-value of log.
  • X_num - Numpy array - Matrix holding the values of all numerical columns of df aside from the target column.
  • X_cat_df - DataFrame - A copy of all the categorical columns of df.

Implementation details:

  • "Log transformed" means that each individual value, yi, is transformed to log(yi).
  • If log is True, Y should be log transformed. Otherwise, no transformation should be performed.
  • Numerical columns will have a "numerical" dtype (i.e. some flavor of int or float). All columns which do not have numerical dtypes should be consiederd categorical.

For example, given the following DataFrame input:

   foo  bar  baz          kag  tar
0  abf    1  2.0   148.413159  zyx
1  def    3  4.0    54.598150  wvu
2  ghi    5  6.0    20.085537  tsr
3  jkl    7  8.0  1096.633158  qpo

With 'kag' as the target and log = True, a call to partition_cols should return the following output:

(array([5., 4., 3., 7.]),
 array([[1., 2.],
        [3., 4.],
        [5., 6.],
        [7., 8.]]), 
    foo  tar 
 0  abf  zyx
 1  def  wvu
 2  ghi  tsr
 3  jkl  qpo)

Notice that the output is a tuple with three elements. The first is an array of the log transformed values in the target column, 'kag'. The second is an array of the values in the numerical columns 'bar' and 'baz'. The third is a DataFrame with the categorical columns 'foo' and 'tar'.

The demo cell below should return the following results.

In [64]:
def partition_cols(df, target, log=True):
    ###
    Y = np.log(df[target]) if log else df[target]
    Y = Y.to_numpy()
    
    X_num = df.drop(target, axis=1).select_dtypes(include='number').to_numpy()
    X_cat_df = df.select_dtypes(include='object').copy()
    ###
    return Y, X_num, X_cat_df
In [65]:
# demo
demo_df_3 = pd.DataFrame(
    {
        'foo': ['abf', 'def', 'ghi', 'jkl'],
        'bar': [1,3,5,7],
        'baz': [2,4.,6,8],
        'kag': [np.exp(5), np.exp(4), np.exp(3), np.exp(7)],
        'tar': ['zyx', 'wvu', 'tsr', 'qpo']
    }
)
demo_target = 'kag'
demo_log = True
partition_cols(demo_df_3, demo_target, demo_log)
Out[65]:
(array([5., 4., 3., 7.]),
 array([[1., 2.],
        [3., 4.],
        [5., 6.],
        [7., 8.]]),
    foo  tar
 0  abf  zyx
 1  def  wvu
 2  ghi  tsr
 3  jkl  qpo)
In [66]:
# excercise 3 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
    run_tests.chk_ex3(partition_cols)
print('Passed. Please sumbit your work now.')
time() - time_start
Passed. Please sumbit your work now.
Out[66]:
786.5038559436798
In [67]:
# load testing variables for `df`, `target`, `log`, `true_Y_output`, `your_Y_output`, `true_X_num_output`, `true_X_cat_df_output`, `your_X_cat_df_output`
from run_tests import df_check_in, target_check, do_log, Y_check, Y_out, X_num_check, X_num_out, X_cat_df_check, X_cat_df_out

One-hot Encoding

To use categorical variables in regression, they must be converted into some numerical form. One strategy for doing this is "One-hot" encoding. To accomplish this, several new variables are created - one for each unique value of the original variable. For each observation a 1 is assigned to the new variable corresponding to original variable, and the rest are assigned 0. This strategy is preferred over assigning each unique original value a number because the categorical variables don't have any "order" or "value".

Exercise 4: (2 points) - Given a DataFrame (df) complete the function signature for one_hot_all to return a new DataFrame which is a copy of df with all categorical columns replaced with a one-hot encoded version of the column (note that a single column will be replaced by multiple columns). Use the guidelines for performing the one-hot encoding for each categorical column.

  • As in exercise 3 - all columns with a non-numerical dtype should be considered categorical.

For each categorical column:

  • Determine the unique values in the column.
  • Create a new column for each unique value named <original column name>_<unique value>.
  • In each row the if the original value is <unique value> then <original column name>_<unique value> should have an entry of 1, all other new columns should have entries of 0.

All of the new columns should have a dtype of 'int64' - the columns which are not one-hot encoded should keep their original dtype.

Hint - There is a Pandas method which will accomplish most of what needs to be done in this exercise. We won't name it, but feel free to search the web and use whatever you find.

For example, given the input DataFrame:

   foo  bar  baz
0  baz  tar  2.3
1  qux  kag  3.0
2  baz  kag  4.0
3  qux  tuk  1.1
4  baz  tar  6.0
5  qux  kag  8.0

Notice that there are two categorical columns (you can check for yourself using dtypes).

  • 'foo' has unique values 'baz' and 'qux' the output should have columns 'foo_baz' and 'foo_qux'.
  • 'bar' has unique values 'kag', 'tar', and 'tuk' the output should have columns 'bar_kag', 'bar_tar', and 'bar_tuk'.

The result has replaced the columns 'foo' and 'bar' with the one-hot encoded versions:

   baz  foo_baz  foo_qux  bar_kag  bar_tar  bar_tuk
0  2.3        1        0        0        1        0
1  3.0        0        1        1        0        0
2  4.0        1        0        1        0        0
3  1.1        0        1        0        0        1
4  6.0        1        0        0        1        0
5  8.0        0        1        1        0        0

The demo cell below should produce the same output.

In [60]:
def one_hot_all(df):
    ###
    return pd.get_dummies(data=df, dtype='int64')
    ###
In [61]:
# demo
demo_df_4 = pd.DataFrame(
    {
        'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'], 
        'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag'], 
        'baz': [2.3, 3, 4, 1.1, 6, 8]
    }
)
print(demo_df_4)
print(one_hot_all(demo_df_4))
   foo  bar  baz
0  baz  tar  2.3
1  qux  kag  3.0
2  baz  kag  4.0
3  qux  tuk  1.1
4  baz  tar  6.0
5  qux  kag  8.0
   baz  foo_baz  foo_qux  bar_kag  bar_tar  bar_tuk
0  2.3        1        0        0        1        0
1  3.0        0        1        1        0        0
2  4.0        1        0        1        0        0
3  1.1        0        1        0        0        1
4  6.0        1        0        0        1        0
5  8.0        0        1        1        0        0
In [62]:
# exercise 4 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):
    run_tests.chk_ex4(one_hot_all)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[62]:
733.012992143631
In [63]:
# load testing variables for `df`, `true_output`, and `your_output`
from run_tests import df_check, oh_check, oh_out

Train-test split

It is common practice to randomly partition available data into a training and test set to more accurately gage how well a model will perform on new data. It is possible (and somewhat likely) that models will fit some of the randomness or "noise" in the data which is not actually accounted for in the predictive variables.

The cell below defines a function ind_splitter. It takes two inputs:

  • m - the number of rows bring split.
  • pct - a float indicating the percentage of rows to be assigned to the test set.

The function ind_splitter returns two lists:

  • test_inds - a Python list of the indices which will belong to the test set.
  • train_inds - a Python list of the indices which will belong to the training set.
In [25]:
# demo - used in later demo cell
def ind_splitter(m, pct):
    n = int(pct*m)
    rng = np.random.default_rng(6040)
    inds = np.arange(m)
    rng.shuffle(inds)
    return inds[:n], inds[n:] # test indices, training indices

Exercise 5: (2 points) - Given a tuple (data) containing some number of either DataFrames or Numpy arrays, the percentage desired in the test partition, and a function (splitter) which will decide the indices belonging to each partition, fill out the function definition for train_test_split. The function should do the following:

  • Make a single call to splitter to determine which indices (rows) belong to which partiton. splitter will take the same arguments as ind_splitter and return two list-like structures. The first is the indices of the test set, the second is the indices of the training set.
  • Use the outputs of splitter to return a tuple that has the test and training partitions for each data structure in data. There will be two entries in the output for each entry in data.
  • If the input data is of the form (data0, data1, data2), then the output should be of the form (data0_test, data0_train, data1_test, data1_train, data2_test, data2_train). There can be any number of entries in data, but the output should give the test then the trainning partition of each entry before moving on to the next entry.
  • You can assume that all elements of data will have the same number of rows. This is checked with the assert statement inside the for-loop given in the starter code. Feel free to use the variable m in your solution.

Note:

  • splitter takes the same inputs/outputs as the ind_splitter example above. However, the test code will not use ind_splitter. Your solution must call splitter to pass.
  • Additionally, the order of the data in the test and training sets (both column and row) must match the ordering determined by splitter. Do not reset any DataFrame indexes or shuffle any columns. This requirement is necessary to ensure that the relation between each piece of data is maintained properly when partitioning.

For example, consider the following tuple:

(array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]]),
    foo  bar
0  baz  tar
1  qux  kag
2  baz  kag
3  qux  tuk
4  baz  tar
5  qux  kag)

It has two entries - the first is an array and the second is a DataFrame. Notice how they both have the same number of rows. Using pct = 0.4 and ind_splitter as our splitter a call to train_test_split should produce the following output.

(array([[ 9, 10, 11],
        [ 0,  1,  2]]),
 array([[ 3,  4,  5],
        [15, 16, 17],
        [12, 13, 14],
        [ 6,  7,  8]]),
    foo  bar
 3  qux  tuk
 0  baz  tar,
    foo  bar
 1  qux  kag
 5  qux  kag
 4  baz  tar
 2  baz  kag)

Notice that rows 3 and 0 are assigned to the test partition for both entries in data. The ordering of the rows has been changed but is consistent between the array and the tuple as well.

The demo cell below should produce the same output.

In [26]:
def train_test_split(data, pct, splitter=ind_splitter):
    m = data[0].shape[0]
    for d in data:
        assert d.shape[0] == m
    ###
    test_ind, train_ind = splitter(m, pct)
    res = []
    
    for dataset in data:
        if isinstance(dataset, pd.DataFrame):
            res.append(dataset.iloc[test_ind, :])
            res.append(dataset.iloc[train_ind, :])
        else:
            res.append(dataset[test_ind])
            res.append(dataset[train_ind])
    
    return tuple(res)
    ###
In [27]:
# demo - calls function defined in earlier demo cell
demo_df_5 = pd.DataFrame(
    {
        'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'], 'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag']
    }
)
demo_arr_5 = np.arange(18).reshape((6,3))
demo_pct_5 = 0.4
demo_data_5 = (demo_arr_5, demo_df_5)
train_test_split(demo_data_5, demo_pct_5, ind_splitter)
Out[27]:
(array([[ 9, 10, 11],
        [ 0,  1,  2]]),
 array([[ 3,  4,  5],
        [15, 16, 17],
        [12, 13, 14],
        [ 6,  7,  8]]),
    foo  bar
 3  qux  tuk
 0  baz  tar,
    foo  bar
 1  qux  kag
 5  qux  kag
 4  baz  tar
 2  baz  kag)
In [28]:
# exercise 5 test cell
###
### AUTOGRADER TEST - DO NOT REMOVE
###
reload(run_tests)
for _ in np.arange(10):    
    run_tests.chk_ex5(train_test_split)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[28]:
169.05935788154602
In [29]:
# load testing variables for `data`, `pct`, `splitter`, `true_output`, `your_output`
from run_tests import data_check, pct_check, splitter_check, true_output, your_output

Column Standardization, PCA and Linear Regression

To attempt to reduce our data's dimensionality, we will use Principal Component Analysis. However, PCA works best when each data column is "standardized", i.e. has a 0 and 1 for its mean and standard deviation.

To standardize a matrix X, suppose that xi is a column of X. Also, suppose that μi and σi are the mean and standard deviation of xi. To calculate the standardized column xistd use this formula.

xistd=xiμiσi

To refresh your memory on PCA: Suppose our data is in a matrix X with m rows and n columns. Recall the Singular Value Decomposition (SVD) of X:

X=UΣVT
Each row of VT (column of V) is a principal component of X. We can use the diagonal of Σ to compute the fraction of the variance explained by each principal component as well. Let σi be the i-th entry on the diagonal of Σ. Then to compute the fraction of the variance of X accounted for in the i-th principal component, use the formula below.
Fraction of variance=σi2k=0n1σk2
To transform our data into the Principal Component space, we simply multiply XVk, where Vk is the matrix of the first k principal components.

To use the same column standardization and PCA transformation on new data, we need the vector of column means, vector of column standard deviations and the principal components. With this we can evaluate models based on our transformed data with new observations.

Exercise 6: (3 points) - Complete the function prin_comps to meet the following requirements.

  • Use the formulas above to standardize the columns of X and compute its SVD.
  • prin_comps should return the column mean vector (Numpy array), the column sample standard deviation vector (Numpy array), and a number of principal components to be determined by the rules below.
    • If pct is not None use the SVD to determine the minimum value of k such that the first k principal components account for at least pct of the variance. Return the first k principal components.
    • If n is not None return the first n principal components.
    • If both pct and n are None return all of the principal components.
    • You can assume the case where both pct and n are not None will not occur - this is checked with an assertion in the starter code.
  • Notes: Return the principal components as a slice of VT - do not use a transpose to modify the output from numpy.linalg.svd (which you should use for SVD calculation).
  • Your solution must compute the sample standard deviations (i.e. 1 degree of freedom) for each column - this is not the default for np.svd. Check the documentation for details on how to set the degrees of freedom. There are other libraries (like scipy.stats) which can also calculate column sample standard deviations, but you will need to check the documentation to ensure that the degrees of freedom are what's intended.

There is no simplified example or demo for this exercise. However, this exercise incorporates concepts almost directly from Notebook 15.

  • SVD calculation - 15.1.1
  • Calculating number of components necessary to account for a fraction of variance - 15.1.5
  • Note: 15.1.5 involves finding the number of components necessary to have an error less than a specified amount. Here we want to have the fraction of variance accounted for (1-error) greater than pct.
In [30]:
def prin_comps(X, pct=None, n=None):
    assert not (pct is not None and n is not None), 'Can\'t set both of these'
    assert pct is None or (0 < pct <= 1), 'Invalid pct. Must be between 0 and 1 or `None`.'
    assert n is None or (type(n) is int and 1 <= n <= X.shape[1]), 'n must be integer between 1 and number of columns in X'
    ###
    m = np.mean(X, axis=0)
    s = np.std(X, axis=0, ddof=1)

    X_std = (X - m) / s

    U, S, Vh = np.linalg.svd(X_std)
    
    if pct:
        frac_var = S ** 2 / np.sum(S ** 2) 
        acc = 0
        for i, frac in enumerate(frac_var):
            acc += frac
            if acc >= pct:
                k = i + 1
                break
                
        VTk = Vh[:k, :]
    elif n:
        VTk = Vh[:n, :]
    else:
        VTk = Vh 
    ###
    return m, s, VTk # means, standard deviations, and principal components
    
In [31]:
# exercise 6 test cell
reload(run_tests)
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
run_tests.chk_ex6(prin_comps, 50)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[31]:
396.2844936847687
In [32]:
# load testing variables `pca_test`, `your_m_output`, `your_s_output`, `your_VTk_output`
# `pca test` is a dictionary with keys ['X', 'pct', 'n', 'm', 's', 'VTk'] corresponding to 
# inputs and required outputs given in the problem description
from run_tests import pca_test, m_out, s_out, VTk_out

We have defined pca_transformers and calc_theta below. Take the time to look at them, because Exercise 7 will incorporate elements from both.

  • pca_transformer - You have seen above that functions are able to take other functions as arguments. Well, they can also return functions! This function returns the function transform, which uses the parameters from pca_transformers as constants and takes a single parameter itself. The idea is that pca_transformer can be used to create a function that projects a new matrix (X) into the same Principal Component space that was used to create m, s, and VTk.
In [33]:
# demo
def pca_transformer(m, s, VTk):
    def transform(X):
        X_std = ((X-m)/(s))
        return X_std.dot(VTk.T)
    return transform 
  • calc_theta - This function takes a data matrix X and a response vector y and computes the regression coefficient vector theta. Notice how the "column of ones" is added to the left of X when X_p is created.
In [34]:
# demo
def calc_theta(X, y):
    from scipy.linalg import solve_triangular
    # add intercept
    m = X.shape[0]
    X_p = np.append(np.ones((m,1)), X, axis=1)
    # compute regression parameter
    Q, R = np.linalg.qr(X_p)
    b = Q.T.dot(y)
    theta = solve_triangular(R, b)
    return theta

Exercise 7: (2 points) - Given a linear regression parameter (theta), complete the function lr_predictor to meet these requirements.

  • Return a function which takes one parameter (a data matrix as a Numpy array). This function will augment the data matrix and then compute regression estimates based on theta.

To compute predictions based on the data matrix X and a coefficient vector θ, complete the following steps:

  • Create an augmented data matrix Xaugment by concatenating a column of ones to the left of X. This is not the same orientation as seen in Notebook 12 or the lecture videos.
  • Perform the matrix-vector multiplication Xaugmentθ. The prediction vector is the result.

For example if given the input np.array([1,2,3]) for theta, a call to lr_predictor should produce a function which makes predictions on new data.

demo_predictor = lr_predictor(np.array([1,2,3]))

Here demo_predictor is a function that will make predictions based on theta. We can make predictions on the following data.

X0 = np.array([[112.89440365,  65.62227416],
 [111.84947876,  58.21425899],
 [ 99.58399277,  72.10462519],
 [ 89.75919282,  80.56690485],
 [106.24550229,  79.71511156]])

X1 = np.array([[102.37686976,  73.0361458 ],
 [108.56479013,  70.62355242],
 [ 96.92681804,  56.06983186],
 [101.33747938,  66.66655047],
 [ 93.75484396, 43.88275971]])

X2 = np.array([[106.42275221,  62.33434278],
 [108.38137453,  67.76061615],
 [112.14494972,  64.73565864],
 [110.38170377,  64.6506213],
 [113.28106798,  75.74780834]])

The demo below makes the following calls and should produce the same output.
demo_predictor(X0) [423.65562978 399.34173449 416.48186111 422.21910019 452.63633926]
demo_predictor(X1) [424.86217692 430.00023752 363.06313166 403.67461017 320.15796705]
demo_predictor(X2) [400.84853276 421.04459751 419.49687536 415.71527144 454.80556098]

In [35]:
def lr_predictor(theta):
    def predict(X):
        ###
        ones = np.ones((X.shape[0], 1))
       
        X_aug = np.hstack((ones, X))
        
        return X_aug @ theta
        ###
    return predict
In [36]:
#demo
demo_predictor = lr_predictor(np.array([1,2,3]))

X0 = np.array([[112.89440365,  65.62227416],
 [111.84947876,  58.21425899],
 [ 99.58399277,  72.10462519],
 [ 89.75919282,  80.56690485],
 [106.24550229,  79.71511156]])

X1 = np.array([[102.37686976,  73.0361458 ],
 [108.56479013,  70.62355242],
 [ 96.92681804,  56.06983186],
 [101.33747938,  66.66655047],
 [ 93.75484396, 43.88275971]])

X2 = np.array([[106.42275221,  62.33434278],
 [108.38137453,  67.76061615],
 [112.14494972,  64.73565864],
 [110.38170377,  64.6506213],
 [113.28106798,  75.74780834]])
print('demo_predictor(X0) ->', demo_predictor(X0), '\ndemo_predictor(X1) ->', demo_predictor(X1), '\ndemo_predictor(X2) ->', demo_predictor(X2))
demo_predictor(X0) -> [423.65562978 399.34173449 416.48186111 422.21910019 452.63633926] 
demo_predictor(X1) -> [424.86217692 430.00023752 363.06313166 403.67461017 320.15796705] 
demo_predictor(X2) -> [400.84853276 421.04459751 419.49687536 415.71527144 454.80556098]
In [37]:
# exercise 7 test cell
reload(run_tests)
###
### AUTOGRADER TEST - DO NOT REMOVE
###
###
### AUTOGRADER TEST - DO NOT REMOVE
###
run_tests.chk_ex7(lr_predictor, 50)
print('Passed. Please submit your work now.')
time() - time_start
Passed. Please submit your work now.
Out[37]:
432.3785603046417
In [38]:
from run_tests import lr_test, pred_check

Fin. You have reached the end of the exam. Make sure you have submitted your work and congratulations on finishing the semester!

The additional content below is optional.

Epilogue

We defined 8 useful functions in the exercises above. Here, we will show how to use these functions to perform a simple regression analysis on some real estate sales data. Set the variable run_demo to True in order to run the demonstration code in the cells below.

Note: the demonstration below will not run correctly if you have not solved all of the exercises in the exam.

In [68]:
# set `run_demo` to `True` if you want to enable the demo epilogue in your work area
run_demo = True

Loading the data
Below we will load the data from a file and take a look at the columns.

In [69]:
if run_demo:
    df_raw = pd.read_csv('resource/asnlib/publicdata/train.csv')
    print(df_raw.columns)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

Determining age
One step in our preprocessing that is not covered in the code we wrote above is the age of the home. In the raw data, we have the year that the home was last remodeled. It makes sense to compute the age by subtracting the remodel year from the year the home was sold and use this value instead of the years. We will do this in the cell below.

In [70]:
if run_demo:
    df_raw['Age'] = df_raw['YrSold'] - df_raw['YearRemodAdd']

Cleaning the data
Based on an examination of the data descriptions we have elected to drop the columns 'GarageYrBlt', 'Id', and 'MoSold' because they won't add useful information to our model. Additionally, we are dropping 'YearBuilt' and 'YearRemodAdd' because they are incorporated into our age calculation.

The column 'MSSubClass' has all numerical values, but it is actually categorical. We need to change the type of this column so that it will be interpreted as such.

Now we will check which columns have missing values so that we can deal with them appropriately.

In [71]:
if run_demo:
    print(df_raw.loc[:, df_raw.isnull().sum() > 0].columns)
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

The columns 'LotFrontage' and 'MasVnrArea' are numerical and have missing values. The remaining columns with missing values are numerical. We will replace the numerical missing values with 0.0 and the categorical ones with 'None'. We do not want to include homes without electricity in our analysis at all.

Below, we set our cleaning parameters to meet our requirements and run clean. Note that 'Electrical' is not included because we want to drop any rows with missing values in that column.

In [72]:
if run_demo:
    params = {
        'cols_to_drop': ['GarageYrBlt', 'Id', 'MoSold', 'YearBuilt', 'YearRemodAdd'],
        'cols_to_zero': ['LotFrontage', 'MasVnrArea'],
        'cols_to_none': ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond',
            'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 
            'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 
            'PoolQC', 'Fence', 'MiscFeature'],
        'cols_to_string': ['MSSubClass']
    }
    df_clean = clean(df_raw, params)

Partitionng the data Next we need to separate our data into the response, numerical variables, and categorical variables. We will be performing the log transform on the response variable. We should also split each of these into training and test sets. We will hold out roughly 30% for testing.

There are a lot of categorical variables. One-hot encoding them all would give us a very high dimensional model. We used an off-the-shelf recursive feature elimination with cross-validation algorithm to select the most informative categorical variables - 'LotConfig', 'HouseStyle' and 'ExterQual'. Selecting these three categorical variables will give us 17 one-hot variables.

In [73]:
if run_demo:
    # separate response, numerical, and categorical data
    y, X_num, X_cat = partition_cols(df_clean, 'SalePrice', True)
    # one-hot encode categorical data
    X_cat_oh = one_hot_all(X_cat[['LotConfig', 'HouseStyle', 'ExterQual']])
    # construct `data` tuple and partition into training and test
    data = (y.reshape((-1,1)), X_num, X_cat_oh)
    y_test, y_train, X_num_test, X_num_train, X_cat_test, X_cat_train = train_test_split(data, 0.3, ind_splitter)

Linear regression with categorical variables
Here we will use our categorical training set and the training response to train linear regression model.

In [74]:
if run_demo:
    cat_lr_predict = lr_predictor(calc_theta(X_cat_train, y_train))

In the cell below, we use our model to make predictions on the training and test sets.

In [75]:
if run_demo:
    pred_cat_train = cat_lr_predict(X_cat_train)
    pred_cat_test = cat_lr_predict(X_cat_test)

Now we can evaluate the predictions. We will use the metrics r_sq and r_sq_adj. We will also plot the predictions against the observations. Blue points have observed values as the "x" coordinate and predicted values as the "y". The orange dots are on the line where predicted values are the same as observed values. You should interpret the horizontal distance between the observations and predictions as the prediction error.

Note: due to the log transformation, the purchase prices will be between 11 and 13.

In [76]:
if run_demo:
    def plot_predictions(pred, y, p, title):
        from matplotlib import pyplot as plt
        fig = plt.figure()
        fig.suptitle(f'{title} - R_sq: {r_sq(y, pred)}; R_sq_adj: {r_sq_adj(y, pred, p)}')
        fig.supxlabel('Observations')
        fig.supylabel('Predictions')
        plt.scatter(y, pred)
        plt.scatter(pred, pred)
    plot_predictions(pred_cat_train, y_train, 18, 'Training Set')
    plot_predictions(pred_cat_test, y_test, 18, 'Test Set')
Matplotlib is building the font cache; this may take a moment.

These predictions are not great, but it is interesting that the lot configuration, house type, and overall exterior condition can explain around half of the variance in the sale price. It's also worth noting that there is not much difference in accuracy between the training and test data. The model actually performed a little bit better on the test data!

Linear regression with first 3 principal components of numerical variables Now we will see how the model does with the numerical data. We will determine the parameters necessary to transform the data into its first 3 principal components then use those to train a linear regression model and make predictions.

In [77]:
if run_demo:
    # determine transformation parameters
    pca_params = prin_comps(X_num_train, n=3)
    # instantiate pca transformer
    transformer = pca_transformer(*pca_params)
    # transform training and test data
    X_num_train_pc = transformer(X_num_train)
    X_num_test_pc = transformer(X_num_test)
    # train regression model
    pca_lr_predict = lr_predictor(calc_theta(X_num_train_pc, y_train))
    # make predictions
    pred_num_train = pca_lr_predict(X_num_train_pc)
    pred_num_test = pca_lr_predict(X_num_test_pc)

Below we will calculate the metrics and plot the predictions and observations in the same way we did for the categorical variables.

In [78]:
if run_demo:
    plot_predictions(pred_num_train, y_train, 18, 'Training Set')
    plot_predictions(pred_num_test, y_test, 18, 'Test Set')

The numerical predictors performed a bit better, even though we only used the first 3 principal components. Again it is worth mentioning that there does not appear to be any overfitting as the model made slightly better predictions on the training data!

Linear regression using both types of predictors
Below we will combine the numerical and categorical predictors into one data set. If the information contained in the categorical variables captures different qualities than the information in the numerical variables, we may be able to improve model performance even more! However, there is a chance that there is correlation between the two and not much information will be gained by combining.

In [79]:
if run_demo:
    # Combine categorical and numerical variables
    X_train = np.append(X_num_train_pc, X_cat_train.values, axis=1)
    X_test = np.append(X_num_test_pc, X_cat_test.values, axis=1)
    # Train regression model
    full_predictor = lr_predictor(calc_theta(X_train, y_train))
    # Make predictions
    pred_train = full_predictor(X_train)
    pred_test = full_predictor(X_test)
    # Plot predictions
    plot_predictions(pred_train, y_train, 18, 'Training Set')
    plot_predictions(pred_test, y_test, 18, 'Test Set')

The models with the combined data performed marginally better, even accounting for the additional variables. There was not much improvement though which indicates that there was not much additional information in the categorical variables.

Fin! - for real this time.