Midterm 2, Fall 2024
: Netflix and Bills
¶Version 1.0.3
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of ['General Python', 'Tabular Data', 'Pandas', 'SQL and SQLite', 'NumPy']
. It has 10 exercises numbered 0 to 9. There are 17 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!!
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s) - FREE
Exercise 2 - : 3 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 0 point(s) - FREE
Exercise 6 - : 1 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 2 point(s)
Final reminders:
Background. Netflix is a firm in the entertainment industry which has grown to prominence in the streaming era of movies and television. The company distributes both acquired and original programming via a streaming service.
Your overall task. Your goal is to attempt to evaluate whether there is a relationship between the programming that is distributed by Netflix and the financial performance of the firm. You will do this in two main steps:
Once you have obtained the results from both steps, you will use the results to build a model which will attempt to evaluate whether there is a relationship between the programs and the firm's financial standing.
The datasets. There are three sources of data which you will use to solve the following exercises.
The finanical data (source 1) is provided as a Python list of dictionaries. You will load this data into a Pandas DataFrame. The daily Netflix data (source 2) and original programming information (source 3) are contained in a SQLite Database.
SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!
### Global imports
import dill
from cse6040_devkit import plugins, utils
from pprint import pprint, pformat
import numpy as np
import pandas as pd
import sqlite3
import random
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
# Load starting data and define a database connection
stock_data = utils.load_object_from_publicdata('raw_stock_data_nflx.dill')
conn = sqlite3.connect('resource/asnlib/publicdata/NetflixDb.sqlite_db')
load_data
Your task: define load_data
as follows:
Load the data contained in records
into a Pandas DataFrame. See the instructions below for the requirements.
Inputs:
records
: A list of dictionaries. Each dictionary contains information for a single row of data.Return: full_data
: A Pandas DataFrame containing the data held in records
. The DataFrame should contain the following columns:
Firm
Stock
keyDate
datetime64[ns, America/New_York]
Open
Close
Volume
Requirements/steps:
NaN
values!Hints
convert_timezone_types()
to make this easier. convert_timezone_types()
takes an input of a pd.Series with dtype "string" and then outputs a pd.Series with dtype "datetime64[ns, America/New_York]". See the demo code for more examples.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
stock_subsample_df_demo_CORRECT = utils.load_object_from_publicdata('stock_subsample_df_demo.dill')
convert_timezone_types = utils.load_object_from_publicdata('convert_timezone_types.dill')
display(stock_subsample_df_demo_CORRECT)
stock_data[:10]
### Solution - Exercise 0
def load_data(records: list) -> pd.DataFrame:
'''
Steps:
1. Examine the input (records) to get a sense of what the data looks like.
2. Create a dataframe with the list of dictionaries.
3. Convert the date column to the data type "datetime64[ns, America/New_York]".
4. Rename the 'Stock' column to be 'Firm'.
5. Return the dataframe.
Note:
If you want to not use the built in function, this is how you would convert the dates:
my_df['Date'] = my_df['Date'].apply(lambda x: pd.to_datetime(x).tz_convert('America/New_York'))
'''
my_df = pd.DataFrame(records)
my_df['Date'] = convert_timezone_types(my_df['Date'])
my_df = my_df.rename(columns={'Stock': 'Firm'})
return my_df
### Demo function call
### `convert_timezone_types` demo ---------------------------------------------
# This helper function will help you change the data-type
# -- Input: pd.Series, with dtype "string"
# -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]"
type_conversion_demo = pd.DataFrame({
'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'],
'DemoColumn': [1, 2]
})
# Convert the type!
type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date'])
assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works!
### `load_data` demo ----------------------------------------------------------
# We'll sample the dataset to 20 records, just to make debugging easier.
random.seed(6040)
stock_data_subsample = random.sample(stock_data, k=10)
# Here's what your solution produces!
print('Here is what your loaded data looks like in Pandas:')
stock_subsample_df = load_data(stock_data_subsample)
display(stock_subsample_df)
The cell below will test your solution for load_data (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
ex_name='load_data',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to load_data did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
stock_df = utils.load_object_from_publicdata('stock_df')
preview_data
Example: we have defined preview_data
as follows:
Preview the three data sources as outlined below.
Inputs:
records
: A Pandas DataFrame containing the results of the full_data
output generated by Exercise 0.conn
: A SQLite connection to our database.Return: A tuple, containing the following elements:
records_preview
: A Pandas Dataframe containing 10 entries.nflx_originals_preview
: A Pandas DataFrame containing 5 different, random entries from the nflx_originals
SQL database table.nflx_top_preview
: A Pandas DataFrame containing 5 different, random entries from the nflx_top
SQL database table.Requirements/steps:
records_preview
: If there are fewer than 10 entries in records
, all entries should be returned.nflx_originals_preview
: The selected entries may be any 5 rows from the original table.nflx_top_preview
: The selected entires may be any 5 rows from the original table.A solution is provided for you below. Run it and the test cell to earn your free point, then submit the assignment!
### Solution - Exercise 1
def preview_data(records: list, conn: sqlite3.Connection) -> tuple:
# Filter out any records beyond the second element
records_preview = records.head(n=10)
# Select everything from nflx_originals and randomly sample 5 rows
nflx_originals_preview = pd\
.read_sql('SELECT * FROM nflx_originals', conn)\
.sample(5)
# Select everything from nflx_top and randomly sample 5 rows
nflx_top_preview = pd\
.read_sql('SELECT * FROM nflx_top', conn)\
.sample(5)
return records_preview, nflx_originals_preview, nflx_top_preview
### Demo function call
records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn)
print('Preview of Stock Data')
display(records_preview)
print('------------------------------------------------------------------------')
print('Preview of Netflix Originals Table')
display(nflx_originals)
print('------------------------------------------------------------------------')
print('Preview of Netflix Top Programs Table')
display(nflx_top)
The test cell below will always pass. Please submit to collect your free points for preview_data (exercise 1).
### Test Cell - Exercise 1
print('Passed! Please submit.')
### Run Me!!!
calculate_daily_growth_demo_input = utils.load_object_from_publicdata('calculate_daily_growth_demo_input')
calculate_daily_growth
Your task: define calculate_daily_growth
as follows:
Calculate the daily growth for a given firm, as outlined below.
Inputs:
stock_records
: a Pandas DataFrame. It contains the full stock data loaded in exercise 0.stock
: a string specifying a particular stock.Return: stock_growth
a Pandas DataFrame with the following columns:
Date
Firm
Open
Close
Volume
Day_Growth
Requirements/steps:
Day_Growth
appended as a new column. The growth should be specific to the firm specified by the firm
parameter.Close
, offset by . So, is the value of Close
, offset by 1 position.NaN
values when you initially calculate the growth.stock_growth
) should contain the same number of records as stock_records
for that firm
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print("Demo output for Netflix:")
daily_growth_df_CORRECT = utils.load_object_from_publicdata('daily_growth_df.dill')
display(daily_growth_df_CORRECT)
calculate_daily_growth_demo_input.head()
### Solution - Exercise 2
def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame:
'''
Steps:
1. Examine the input dataframe, stock_records, to understand what the data looks like. For us, that is
calculate_daily_growth_demo_input.
2. Create a new dataframe called df.
a. Create a copy of the stock_records df, so we don't accidentally modify the original dataframe.
b. Filter so we are only looking at the firm specified in the "firm" input string.
c. Create a new column called Day_Growth. I decided to initialize it so every value was 0.0, since this will
help with the days with NaN Open and Close values.
3. Isolate the rows that have valid open / close prices.
a. Create one dataframe that only contains rows with NaN valus for Open. We want to be able to merge these
back with our other dataframe in the end.
b. Create another dataframe with NaN rows dropped (for the columns open and close).
Note: for this, I looked at the Pandas documentation for dropna to remember the syntax for looking
at a subset.
4. Utilize the shift() function to easily get the last closing value.
a. Look at the linked documentation for shift and some examples.
b. Use the shift function and print the results to confirm that it is doing what we want it to.
c. Create a new column called "last_close" to hold these shifted values.
Note: I kept getting the SettingWithCopyWarning when doing this. I ignored it until the end and came
back to it. There were two ways that I found to fix it: either adding a .copy() at the end of the
statement creating df_no_na or by using an "assign" function that I found when looking through
StackOverflow. I have included both below.
5. Calculate the Day_Growth column using the formula above: (close0/close-1)-1.
a. Since we will want to use multiple columns, I automatically knew that we would want to use an apply
function along the non-default axis. I always forget which axis to use, so I typically try both to
determine which is correct. In this case, it would be axis=1.
b. Use the apply function and then set the Day_Growth column equal to these calculated values.
6. Combine the NaN rows with the rows that we just used (df_no_na).
a. Delete the column "last_close" that we created for temporary storage.
b. Use the concat function to add the na_rows to our dataframe, df_no_na.
c. Sort the values by date to fix the order.
7. Reset the index of stock_growth and return.
When I followed these steps, I had an error. I compared my output with the expected output to realize that the
first row of my df_no_na dataframe had a Day_Growth of NaN due to the nature of the shift function. So, I filled
the NaN rows in df_no_na to fix this problem.
'''
# step 2
df = stock_records.copy()
df = df[df['Firm']==firm]
df['Day_Growth'] = 0.000000
# step 3
na_mask = df['Open'].isna()
na_rows = df[na_mask]
# print(na_rows.head())
df_no_na = df.dropna(subset=['Open', 'Close']).copy()
# step 4
shifted_vals = df_no_na['Close'].shift(periods=1)
# print(shifted_vals)
df_no_na = df_no_na.assign(last_close = shifted_vals.values)
# if using .copy(), could just say df_no_na['last_close'] = list(shifted_vals.values)
# step 5
new_day_growth = df_no_na.apply(lambda x: (x['Close']/x['last_close'])-1, axis=1).copy()
df_no_na = df_no_na.assign(Day_Growth = new_day_growth.values)
# if using .copy(), could just say df_no_na['Day_Growth'] = list(new_day_growth.values)
# step 6
del df_no_na['last_close']
df_no_na.fillna(0, inplace=True) # step 8
stock_growth = pd.concat([na_rows, df_no_na])
stock_growth = stock_growth.sort_values(by='Date')
# step 7
return stock_growth.reset_index(drop=True)
### Demo function call
print("Demo daily growth for Netflix's stock:")
daily_growth_nflx_demo = calculate_daily_growth(
stock_records=calculate_daily_growth_demo_input,
firm='NFLX'
)
display(daily_growth_nflx_demo)
The cell below will test your solution for calculate_daily_growth (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
ex_name='calculate_daily_growth',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_daily_growth did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
stock_daily_growths_df = utils.load_object_from_publicdata('stock_daily_growths_df')
stock_daily_growths_reduced_df = utils.load_object_from_publicdata('stock_daily_growths_reduced_df')
calculate_annual_growth
Your task: define calculate_annual_growth
as follows:
Calculate the annual growth for every firm and join the results to a copy of the original stock_records
DataFrame.
Inputs:
stock_records
: a Pandas DataFrame. It contains the results calculated by exercise 2.Return: annual_growth
: a Pandas DataFrame with the same columns as stock_records
, along with columns for:
Year
Annual_Growth
Requirements/steps:
Date
column of stock_records
. Year is expected to be a string in the final output.Annual_Growth
Day_Growth
+ 1), grouped by firm and year. Make sure to subtract 1 at the end.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
annual_growth_grouped_df_demo
information useful for checking your work. If you wish to use it, uncomment the code below to view it.You should notice in this result that the same value for Annual_Growth
appears in each row for a given Firm
and Year
. For example rows where APPL is the Firm
and Year
is '2018'
all have Annual_Growth
of -0.053902
, and NFLX rows for '2020'
have 0.671138
.
The Annual_Growth
is calculated by adding 1 to each Day_Growth
observation. Multiplying all of them together. Then subtracting 1 from the product.
Mathematically, let represent the Day_Growth
value for firm , year , and observation . Then the Annual_Growth
value for the same firm () is given by:
print("Demo annual growth:")
annual_growth_demo_df_CORRECT = utils.load_object_from_publicdata('annual_growth_demo_df.dill')
display(annual_growth_demo_df_CORRECT)
### Uncomment these lines for additional information!
# print("Annual growth, grouped together:")
# annual_growth_grouped_df_demo = utils.load_object_from_publicdata('annual_growth_grouped_df_demo.dill')
# display(annual_growth_grouped_df_demo)
### Solution - Exercise 3
def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame:
'''
Steps:
1. Examine the stock_records dataframe to remind us visually the difference between our input and expected
output.
2. Create a new dataframe called annual_growth with a new "Year" column.
a. Create a copy of the input dataframe, stock_records.
b. Extract the year from the Date column.
Googled: "get year from datetime pandas"
Referenced: https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-separately-from-pandas-datetime-column
c. Convert Year to a string, as instructed above.
3. Look at the instructions to get an understanding of how to calculate Annual Growth. The below instruction helps
break it down into steps:
"The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them
together. Then subtracting 1 from the product."
4. Calculate Annual_Growth.
a. Create a new column called "updated_day" that holds the value of Day_Growth + 1.
b. Group by Firm and Year, multiplying the values of updated_day together.
Googled: "pandas groupby multiply values from group set to new column"
Referenced: https://stackoverflow.com/questions/69471371/pandas-multiply-row-value-by-groupby-of-another-column-as-a-new-column
c. Delete the temporary updated_day column.
d. Subtract 1 from the Annual_Growth.
5. Return the final dataframe after reordering the columns and resetting the index.
'''
# step 2
annual_growth = stock_records.copy()
annual_growth['Year'] = annual_growth['Date'].dt.year
annual_growth['Year'] = annual_growth['Year'].astype(str)
# step 4
annual_growth['updated_day'] = annual_growth.apply(lambda x: x['Day_Growth']+1, axis=1)
annual_growth['Annual_Growth'] = annual_growth.groupby(by=['Firm', 'Year'])['updated_day'].transform('prod')
del annual_growth['updated_day']
annual_growth['Annual_Growth'] = annual_growth['Annual_Growth'].apply(lambda x: x-1)
# step 5
annual_growth = annual_growth[['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth']]
return annual_growth.reset_index(drop=True)
### Demo function call
annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df)
display(annual_growth_demo_df)
The cell below will test your solution for calculate_annual_growth (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
ex_name='calculate_annual_growth',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_annual_growth did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
annual_growth_df = utils.load_object_from_publicdata('annual_growth_df')
golden_cross_strategy
Your task: define golden_cross_strategy
as follows:
Implement a function which calculates a Golden Cross strategy, as detailed below.
Inputs:
stockdf
: a Pandas DataFrame. It contains the results calculated by exercise 3.firm
: a string. It specifies which firm to calculate the golden cross results for.short_average
: an integer. It defines the window for the short moving average. The number represents the number of days to include in the window.long_average
: an integer. It defines the window for the long moving average. The number represents the number of days to include in the window.Return: golden_cross_results
: a Pandas DataFrame with all of the columns from stockdf
, plus:
Short_Average
Long_Average
Golden_Cross
Requirements/steps:
firm
parameter.Close
values are recorded as NaN
. These should be filtered out before calculating the moving averages.Short_Average
and Long_Average
to the Pandas DataFrame.Close
column.short_average
and long_average
arguments.firm
argument.stockdf
is already ordered by date in ascending order.Notes
The Golden Cross (https://www.investopedia.com/terms/g/goldencross.asp) is a technical analysis pattern that indicates a potential bullish trend reversal. It occurs when a faster, short-term moving average crosses above a slower, longer-term moving average. For example, a 50-day moving average (the faster MA) crossing above a 200-day moving average (the slower MA) creates a Golden Cross signal.
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print("Example Golden Cross Demo:")
golden_cross_demo_df_CORRECT = utils.load_object_from_publicdata('golden_cross_demo_df.dill')
display(golden_cross_demo_df_CORRECT.head(15))
### Solution - Exercise 4
def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame:
'''
Steps:
1. Filter the DataFrame to include records for the firm parameter.
a. Create a copy of the stockdf dataframe.
b. Only include rows where the 'Firm' is equal to the input string firm.
2. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
a. Use dropna to filter out rows where Close values are NaN.
3. Add the columns Short_Average and Long_Average to the Pandas DataFrame.
a. Look at the linked documentation for how to use the rolling function and look at some examples.
b. Since a Rolling object has no average attribute that I found, manually calculate the average values.
c. Use the assign method to avoid the SettingWithCopyWarning.
4. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
Method 1:
a. Create a Series where values are True if the Short_Average is greater than the Long_Average.
b. Create a new column called 'Golden_Cross' and assign these values to it.
Method 2:
Googled: "pandas new column based on condition of other columns"
Referenced: https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
a. Use the example from StackOverflow to set the Golden_Cross column to be True if Short_Average >
Long_Average and False if Short_Average <= Long_Average.
b. When checking this, rows where Short_Average and/or Long_Average still have NaN.
Googled: "fill na for one column pandas"
Referenced: https://stackoverflow.com/questions/38134012/pandas-dataframe-fillna-only-some-columns-in-place
c. Fill the NaN values in the Golden_Cross column with False.
5. Finally, reset the indexes before returning the resulting dataframe.
'''
# step 1
df = stockdf.copy()
df = df[df['Firm']==firm]
# step 2
df_notna = df.dropna(subset=['Close'])
# step 3
short_window = (df_notna['Close'].rolling(short_average).sum())/short_average
long_window = (df_notna['Close'].rolling(long_average).sum())/long_average
df_notna = df_notna.assign(Short_Average = short_window.values)
df_notna = df_notna.assign(Long_Average = long_window.values)
# step 4, method 1
golden_cross = df_notna['Short_Average']>df_notna['Long_Average']
df_notna = df_notna.assign(Golden_Cross = golden_cross.values)
# step 4, method 2
# df_notna.loc[df_notna['Short_Average']>df_notna['Long_Average'], 'Golden_Cross'] = True
# df_notna.loc[df_notna['Short_Average']<=df_notna['Long_Average'], 'Golden_Cross'] = False
# df_notna = df_notna.fillna({'Golden_Cross': False})
# step 5
golden_cross_results = df_notna.copy()
return golden_cross_results.reset_index(drop=True)
### Demo function call
golden_cross_demo_df = golden_cross_strategy(
stockdf=annual_growth_df,
firm='NFLX',
short_average=3,
long_average=7)
# Let's only look at the first 15 days for our input
display(golden_cross_demo_df.head(15))
The cell below will test your solution for golden_cross_strategy (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
ex_name='golden_cross_strategy',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to golden_cross_strategy did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
golden_crosses_df = utils.load_object_from_publicdata('golden_crosses_df')
golden_crosses_reduced_df = utils.load_object_from_publicdata('golden_crosses_reduced_df')
normalize_stock_growths
Example: we have defined normalize_stock_growths
as follows:
Normalize the daily growth of a given firm by comparing it to the growth of the other provided firms.
Inputs:
daily_growth_by_firm
: a Pandas DataFrame. It contains many of the variables we have already calculated.firm
: a string. It specifies which firm to calculate the Relative_Growth
for.Return: normalized_results
: a Pandas DataFrame with the following columns:
Date
, which contains every unique date value in the input.daily_growth_by_firm
, where the value is equal to the growth of that firm's stock for the relevant Date
.Day_Growth
column of daily_growth_by_firm
.Non_<FIRM>_Average_Growth
(as defined below)Requirements/steps:
Non_<FIRM>_Average_Growth
:Date
, take the mean of the growths for every firm which is not equal to firm
.AAPL
, NFLX
, and AMZN
, and firm=NFLX
, then Non_<FIRM>_Average_Growth
would be the average growth of AAPL
and AMZN
for each day.<FIRM>
" in the column name with the value given by firm
.firm='AAPL'
, then your column name should be Non_AAPL_Average_Growth
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('Properly Averaged Industry Growth:')
normalized_returns_demo_df_CORRECT = utils.load_object_from_publicdata('normalized_returns_demo_df.dill')
display(normalized_returns_demo_df_CORRECT)
### Solution - Exercise 5
def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame:
# Create a list of every firm, except for the one we specified in the inputs
non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm])
non_chosen_firms = list(non_chosen_firms)
# Pivot the data for easy manipulation
pivoted_data = daily_growth_by_firm.pivot(
index='Date',
columns='Firm',
values='Day_Growth'
).reset_index()
# Calculate the total growth for every firm that isn't our firm
pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\
.sum(axis=1) / len(non_chosen_firms)
return pivoted_data
### Demo function call
normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX')
display(normalized_returns_demo_df)
The test cell below will always pass. Please submit to collect your free points for normalize_stock_growths (exercise 5).
### Test Cell - Exercise 5
print('Passed! Please submit.')
summarize_netflix_original_genres
Your task: define summarize_netflix_original_genres
as follows:
Define the function to return a SQL query which summarizes the trends of Netflix originals by year and genre, as detailed below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database to obtain a table with three columns:
year
genre
genre_count
Requirements/steps:
year
variable should be extracted from the premiere
variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year
from the premiere
variable.genre_count
(descending)year
(ascending)genre
(ascending)nflx_originals
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper summary of genres for the demo:')
netflix_original_genre_summary_df_CORRECT = utils.load_object_from_publicdata('netflix_original_genre_summary_df.dill')
display(netflix_original_genre_summary_df_CORRECT)
### Solution - Exercise 6
def summarize_netflix_original_genres()-> str:
'''
Steps:
1. Look at the database table in question: 'nflx_originals'. This helps us understand the structure of the
table that we are working with.
2. Look at the linked documentation for STRFTIME and test a small query with it to make sure it works.
3. Add in the group by and count() statements.
Count the number of Netflix original shows, grouped by year and by genre.
4. Add in the order by statements.
Order them by:
genre_count (descending)
year (ascending)
genre (ascending)
'''
query_1 = 'SELECT * FROM nflx_originals LIMIT 10'
query_2 = f'''SELECT premiere, STRFTIME('%Y', premiere) AS year FROM nflx_originals LIMIT 10'''
query_3 = f'''
SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
FROM nflx_originals
GROUP BY year, genre
LIMIT 10
'''
query_final = f'''
SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
FROM nflx_originals
GROUP BY year, genre
ORDER BY genre_count DESC, year, genre
LIMIT(10)
'''
return query_final
### Demo function call
summary_query = summarize_netflix_original_genres()
netflix_original_genres = pd.read_sql(summary_query, conn)
display(netflix_original_genres)
The cell below will test your solution for summarize_netflix_original_genres (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
ex_name='summarize_netflix_original_genres',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calc_netflix_top10_scores
Your task: define calc_netflix_top10_scores
as follows:
Define the function to return a SQL query which calculates the number of times each show appeared in the top 10, the show's score, and the show's average score when it was in the top 10. The concept of a 'score' is defined below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database table nflx_top
to obtain a table with the following attributes:
title
: the title of the showtotal_score
: the sum of the show's score.occurrence
: the number of times a show appeared in the top 10.avg_score
: the total_score
divided by the occurrence
value for a given show.Requirements/steps:
avg_score
to produce a floating-point valuetotal_score
in descending ordertitle
in descending orderHints:
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper summary of show score metrics:')
netflix_top_10_scores_df_CORRECT = utils.load_object_from_publicdata('netflix_top_10_scores_df.dill')
display(netflix_top_10_scores_df_CORRECT)
### Solution - Exercise 7
def calc_netflix_top10_scores() -> str:
'''
Steps:
1. Examine the nflx_top table.
2. Calculate scores for each title.
Note: I did this very inefficiently. The better way to do this (without a CASE WHEN statement) is:
(11 - rank) AS scores
3. Get the sum and average of all of the scores by title.
4. Get the number of times a show appeared in the top 10.
Note: If using the CASE WHEN method, we need to use COUNT(scores>0) to count the number of times that
scores is greater than 0. If we use the 11-rank way, we would just need to COUNT(scores)
5. Add in the order by statements:
total_score in descending order
title in descending order
Note: I include LIMIT 10 from the beginning to avoid excess printing, since the final query will be limited to 10.
'''
query_1 = 'SELECT * FROM nflx_top LIMIT 10'
query_2 = f'''
SELECT
title,
CASE
WHEN rank=1 THEN 10
WHEN rank=2 THEN 9
WHEN rank=3 THEN 8
WHEN rank=4 THEN 7
WHEN rank=5 THEN 6
WHEN rank=6 THEN 5
WHEN rank=7 THEN 4
WHEN rank=8 THEN 3
WHEN rank=9 THEN 2
WHEN rank=10 THEN 1
ELSE 0
END AS scores
FROM nflx_top
'''
query_3 = f'''
SELECT title, SUM(scores) AS total_score, AVG(scores) AS avg_score
FROM (
SELECT
title,
CASE
WHEN rank=1 THEN 10
WHEN rank=2 THEN 9
WHEN rank=3 THEN 8
WHEN rank=4 THEN 7
WHEN rank=5 THEN 6
WHEN rank=6 THEN 5
WHEN rank=7 THEN 4
WHEN rank=8 THEN 3
WHEN rank=9 THEN 2
WHEN rank=10 THEN 1
ELSE 0
END AS scores
FROM nflx_top
)
GROUP BY title
LIMIT 10
'''
query_4 = f'''
SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
FROM (
SELECT
title,
CASE
WHEN rank=1 THEN 10
WHEN rank=2 THEN 9
WHEN rank=3 THEN 8
WHEN rank=4 THEN 7
WHEN rank=5 THEN 6
WHEN rank=6 THEN 5
WHEN rank=7 THEN 4
WHEN rank=8 THEN 3
WHEN rank=9 THEN 2
WHEN rank=10 THEN 1
ELSE 0
END AS scores
FROM nflx_top
)
GROUP BY title
LIMIT 10
'''
query_final = f'''
SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
FROM (
SELECT
title,
CASE
WHEN rank=1 THEN 10
WHEN rank=2 THEN 9
WHEN rank=3 THEN 8
WHEN rank=4 THEN 7
WHEN rank=5 THEN 6
WHEN rank=6 THEN 5
WHEN rank=7 THEN 4
WHEN rank=8 THEN 3
WHEN rank=9 THEN 2
WHEN rank=10 THEN 1
ELSE 0
END AS scores
FROM nflx_top
)
GROUP BY title
ORDER BY total_score DESC, title DESC
LIMIT 10
'''
return query_final
### Demo function call
scores_query = calc_netflix_top10_scores()
netflix_top10_scores = pd.read_sql(scores_query, conn)
display(netflix_top10_scores)
The cell below will test your solution for calc_netflix_top10_scores (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
ex_name='calc_netflix_top10_scores',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
longterm_avg_score
Your task: define longterm_avg_score
as follows:
Define the function to return a SQL query which calculates the long-term score for each Netflix Original, as defined below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database to obtain a table with the following attributes:
title
- (string): the titles of the Netflix Originalstotal_score
- (int): the sum of all scores for each title given in nflx_originals
date_count
- (int): the number of dates that a given title is available, as detailed belowlongterm_avg_score
- (float): the long-term average score, defined as:longterm_avg_score = total_score/date_count
Requirements/steps:
nflx_originals
nflx_top
title
on a given date, where the score is defined as:date_count
, count the number of date
values in nflx_top
that exist for a title
, where the date
is greater than or equal to the premiere
for that title
<distinct_dates>
: Identify all distinct date values in nflx_top
.<title_dates>
: For each title in nflx_originals
, identify any dates in <distinct_dates>
which are greater than or equal to the title's premiere.<title_date_score>
: For each title/date pair in <title_dates>
, identify any records from nflx_top
which have the same title/date pair.rank
field from nflx_top
to calculate a given score.<title_date_score>
but not in nflx_top
should not be dropped and should have a NULL (score) value in <title_date_score>
.<title_date_score>
, calculate:total_score
- compute the sum of all the score values sharing a common title.date_count
- compute the count of all date values sharing a common title.longterm_avg_score
- compute the quotient of the total_score over date_count.longterm_avg_score
that are NULL.longterm_avg_score
in descending ordertitle
in ascending orderHints:
An Example:
Let's take a deeper look specifically at the show Vikings: Valhalla
.
nflx_originals
table, the show premiered on 2022-02-25.longterm_avg_vikings_breakout_df_CORRECT
below uses the results from the nflx_top
table for the show titled Vikings: Valhalla
.nflx_top
for Vikings: Valhalla
, the dates greater than or equal to the premiere date range from 2022-02-25 to 2022-03-11.Vikings: Valhalla
does not have a score for that date. Vikings: Valhalla
received its first ranking on 2022-02-26. Therefore, its score for 2022-02-26 would be:
longterm_avg_vikings_breakout_df_CORRECT
below shows a table with the show title
, premiere
,date
, rank
, and score
for all dates in the range from 2022-02-25 to 2022-03-11 for Vikings: Valhalla
. total_score
, date_count
, and longterm_avg_score
for the show Vikings: Valhalla
:
longterm_avg_vikings_breakout_df_CORRECT = utils.load_object_from_publicdata('longterm_avg_vikings_breakout_df.dill')
display(longterm_avg_vikings_breakout_df_CORRECT)
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
longterm_avg_scores_df = utils.load_object_from_publicdata('longterm_avg_scores_df.dill')
display(longterm_avg_scores_df)
# NOTE: This is my original query before the recommended approach was added.
query_final = f'''
WITH total_score_calc AS (
SELECT nflx_originals.title, SUM(scores) AS total_score, nflx_release_date
FROM (
SELECT
title,
(11 - rank) AS scores,
nflx_release_date
FROM nflx_top
) AS top_scores
JOIN nflx_originals ON nflx_originals.title = top_scores.title
GROUP BY nflx_originals.title, nflx_release_date
),
poss_days AS (
SELECT
premiere,
COUNT(DISTINCT date) AS num_poss
FROM nflx_top
CROSS JOIN nflx_originals
WHERE date >= premiere
GROUP BY premiere
ORDER BY premiere DESC
)
SELECT title, total_score, num_poss AS date_count,
CAST(total_score AS float) / CAST(num_poss AS float) AS longterm_avg_score
FROM total_score_calc
JOIN poss_days ON premiere = nflx_release_date
ORDER BY longterm_avg_score DESC, title ASC
LIMIT 10
'''
### Solution - Exercise 8
def longterm_avg_score():
'''
Steps:
1. Look at the two tables that we will be querying: nflx_originals and nflx_top.
2. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
3. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are
greater than or equal to the title's premiere.
4. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top
which have the same title/date pair.
a. Use the rank field from nflx_top to calculate a given score.
b. Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should
have a NULL (score) value in <title_date_score>.
5. Calculate the final result.
a. For each title in <title_date_score>, calculate:
total_score - compute the sum of all the score values sharing a common title.
date_count - compute the count of all date values sharing a common title.
longterm_avg_score - compute the quotient of the total_score over date_count.
b. Filter out all records with longterm_avg_score that are NULL.
c. Finally, the results should be ordered by:
longterm_avg_score in descending order
The show's title in ascending order
d. Return the top 10 rows
'''
query_1a = 'SELECT * FROM nflx_top LIMIT 10'
query_1b = 'SELECT * FROM nflx_originals LIMIT 10'
query_2 = f''' SELECT DISTINCT date FROM nflx_top '''
query_3 = f'''
WITH distinct_dates AS (
SELECT DISTINCT date FROM nflx_top
)
SELECT
title,
premiere,
date
FROM nflx_originals
CROSS JOIN distinct_dates
WHERE premiere <= date
LIMIT 10
'''
query_4 = f'''
WITH title_dates AS (
SELECT
title,
date
FROM nflx_originals
CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
WHERE premiere <= date
)
SELECT
title_dates.title,
title_dates.date,
(11 - rank) AS scores
FROM title_dates
LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
'''
query_5a = f'''
WITH title_date_score AS (
SELECT
title_dates.title,
title_dates.date,
(11 - rank) AS scores
FROM (
SELECT
title,
date
FROM nflx_originals
CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
WHERE premiere <= date
) title_dates
LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
)
SELECT
title,
SUM(scores) AS total_score,
COUNT(date) AS date_count,
CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
FROM title_date_score
GROUP BY title
'''
query_5b = f'''
WITH title_date_score AS (
SELECT
title_dates.title,
title_dates.date,
(11 - rank) AS scores
FROM (
SELECT
title,
date
FROM nflx_originals
CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
WHERE premiere <= date
) title_dates
LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
)
SELECT
title,
SUM(scores) AS total_score,
COUNT(date) AS date_count,
CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
FROM title_date_score
GROUP BY title
HAVING longterm_avg_score IS NOT NULL
'''
query_5c = f'''
WITH title_date_score AS (
SELECT
title_dates.title,
title_dates.date,
(11 - rank) AS scores
FROM (
SELECT
title,
date
FROM nflx_originals
CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
WHERE premiere <= date
) title_dates
LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
)
SELECT
title,
SUM(scores) AS total_score,
COUNT(date) AS date_count,
CAST(SUM(scores) AS float)/COUNT(date) AS longterm_avg_score
FROM title_date_score
GROUP BY title
HAVING longterm_avg_score IS NOT NULL
ORDER BY longterm_avg_score DESC, title ASC
LIMIT 10
'''
return query_5c
### Demo function call
query = longterm_avg_score()
longterm_avg_scores = pd.read_sql(query, conn)
display(longterm_avg_scores)
The cell below will test your solution for longterm_avg_score (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
ex_name='longterm_avg_score',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to longterm_avg_score did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
nflx_scores_demo_df = utils.load_object_from_publicdata('nflx_scores_demo_df')
normalized_growths_df = utils.load_object_from_publicdata('normalized_growths_df')
construct_model_inputs
Your task: define construct_model_inputs
as follows:
Define the model matrix and response for our regression model, as shown below.
Inputs:
normalized_growths_df
: a Pandas DataFrame. It contains the results from exercise 5.show_scores_df
: a Pandas DataFrame. It contains the pre-computed scores for each show on a given date.Return: model_matrix
: a Pandas DataFrame containing the independent variables (show scores) and response variable (Relative_Growth)
Requirements/steps:
show_scores_df
so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score.model_matrix
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper model matrix for the demo:')
model_matrix_demo_df_CORRECT = utils.load_object_from_publicdata('model_matrix_demo_df.dill')
display(model_matrix_demo_df_CORRECT)
### Solution - Exercise 9
def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame:
'''
Steps:
1. Examine the two input dataframes and read the linked documentation for pivot.
2. Pivot the show_scores_df as described above.
a. Each show is represented as its own column. The index should reflect the dates. The values should be set
to the show's score.
columns: shows (title)
index: dates
values: score
b. Fill in the NaN values with 0s.
c. Reset the index of the pivoted dataframe keeping the index as a column.
3. Create the dataframe "model_matrix".
a. Keep only the Date and Relative Growth values from the normalized_growths_df dataframe.
b. Join the pivoted dataframe with normalized_growths_df on date.
b. Delete the unnecessary columns of date and Date
4. Return the final dataframe.
'''
# step 2
pivoted_df = show_scores_df.pivot(index='date', columns='title', values='score')
pivoted_df.fillna(0, inplace=True)
pivoted_df = pivoted_df.reset_index(drop=False)
# step 3
normalized_small_df = normalized_growths_df[['Date', 'Relative_Growth']]
model_matrix = pd.merge(normalized_small_df, pivoted_df, left_on = 'Date', right_on='date')
del model_matrix['date']
del model_matrix['Date']
# step 4
return model_matrix
### Demo function call
model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df)
display(model_matrix)
The cell below will test your solution for construct_model_inputs (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=construct_model_inputs,
ex_name='construct_model_inputs',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to construct_model_inputs did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
If you have made it this far, congratulations! You are done. Don't forget to submit the assignment!
If you run the regression model below with all of the data collected, you would have obtained summary model results like the following:
##############################################
### The Code below is provided for illustrative and reproducability but is commented out due to additional time processing to the auto-grader. Note, the model was run on all NFLX growth data and top shows data.
##############################################
# Let's use step-wise regression to do feature selection
mm=model_matrix.copy(deep=True)
y = mm['Relative_Growth']
del mm['Relative_Growth']
X = mm
import statsmodels.api as sm
model = LinearRegression()
feature_selector = SequentialFeatureSelector(model, n_features_to_select=min(len(X.columns)-1,10))
feature_selector.fit(X, y)
selected_X = X[feature_selector.get_feature_names_out()]
selected_X = sm.add_constant(selected_X)
print(selected_X.columns)
selected_model = sm.OLS(y, selected_X)
results = selected_model.fit()
results.summary()
So, how does our model do?
Step-wise regression is a greedy algorithm which works to iteratively select the best possible features for improving the performance of our model. We tried to find the top-10 shows which were most predictive of the normalized stock values and found the following shows:
Note that these don't necessarily mean that they caused the stock value to improve. For example, The Royal Treatment was significantly associated with a negative trend in Netflix's stock growth.
A better question might be: "how reliable are these results?"
Ultimately, the performance of the model constructed from the input provided by exercise 9 is relatively poor. This model exhibits an adjusted value of only 0.042. Predicting stock prices is intrinsically hard, as there is an exceptionally high amount of variance in each observation and there are many reasons for fluctuations in market prices. Therefore, it is unsurprising to see that a lot of the variance remains.
We are also fighting "the curse of dimensionality," which means the excess data we have can be more of a hinderance than a help. However, the F-score for our model is quite good (with a p-value of 0.000684), which suggests that our model does a much better job at predicting stock value changes than a naive model.
You will learn more about various modeling techniques and how to use them during the rest of the course. Working to improve the regression results with the results of this exam is left as an exercise to the students; for high-quality regression outputs, you may find the statsmodels package particularly useful. Here's one approach you might find worth exploring: