Midterm 2, Fall 2024
: Netflix and Bills
¶Version 1.0.3
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of ['General Python', 'Tabular Data', 'Pandas', 'SQL and SQLite', 'NumPy']
. It has 10 exercises numbered 0 to 9. There are 17 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!!
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s) - FREE
Exercise 2 - : 3 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 0 point(s) - FREE
Exercise 6 - : 1 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 2 point(s)
Final reminders:
Background. Netflix is a firm in the entertainment industry which has grown to prominence in the streaming era of movies and television. The company distributes both acquired and original programming via a streaming service.
Your overall task. Your goal is to attempt to evaluate whether there is a relationship between the programming that is distributed by Netflix and the financial performance of the firm. You will do this in two main steps:
Once you have obtained the results from both steps, you will use the results to build a model which will attempt to evaluate whether there is a relationship between the programs and the firm's financial standing.
The datasets. There are three sources of data which you will use to solve the following exercises.
The finanical data (source 1) is provided as a Python list of dictionaries. You will load this data into a Pandas DataFrame. The daily Netflix data (source 2) and original programming information (source 3) are contained in a SQLite Database.
SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!
### Global imports
import dill
from cse6040_devkit import plugins, utils
from pprint import pprint, pformat
import numpy as np
import pandas as pd
import sqlite3
import random
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
# Load starting data and define a database connection
stock_data = utils.load_object_from_publicdata('raw_stock_data_nflx.dill')
conn = sqlite3.connect('resource/asnlib/publicdata/NetflixDb.sqlite_db')
load_data
Your task: define load_data
as follows:
Load the data contained in records
into a Pandas DataFrame. See the instructions below for the requirements.
Inputs:
records
: A list of dictionaries. Each dictionary contains information for a single row of data.Return: full_data
: A Pandas DataFrame containing the data held in records
. The DataFrame should contain the following columns:
Firm
Stock
keyDate
datetime64[ns, America/New_York]
Open
Close
Volume
Requirements/steps:
NaN
values!Hints
convert_timezone_types()
to make this easier. convert_timezone_types()
takes an input of a pd.Series with dtype "string" and then outputs a pd.Series with dtype "datetime64[ns, America/New_York]". See the demo code for more examples.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
stock_subsample_df_demo_CORRECT = utils.load_object_from_publicdata('stock_subsample_df_demo.dill')
convert_timezone_types = utils.load_object_from_publicdata('convert_timezone_types.dill')
display(stock_subsample_df_demo_CORRECT)
### Solution - Exercise 0
def load_data(records: list) -> pd.DataFrame:
### BEGIN SOLUTION
# A four-line solution, for the TAs --------------------------------------------
# full_data = pd.DataFrame(records)
# full_data['Date'] = convert_timezone_types(full_data['Date'])
# full_data = full_data.rename(columns={'Stock': 'Firm'})
# return full_data
# ------------------------------------------------------------------------------
# Build the full DataFrame
full_data = pd.DataFrame(records)
# Convert the data type (Here are two suggested approaches)
### APPROACH 1: Pass in a series
date_vector_1 = convert_timezone_types(full_data['Date'])
### APPROACH 2: Use .apply and timezone conversion ---------------------
# date_vector_2 = full_data['Date'].apply(convert_timezone_types)
### They should be the same thing!
# assert (date_vector_1 == date_vector_2).all()
full_data['Date'] = date_vector_1
# Rename the columns
full_data = full_data.rename(columns={'Stock': 'Firm'})
# Return our dataframe
return full_data
### END SOLUTION
### Demo function call
### `convert_timezone_types` demo ---------------------------------------------
# This helper function will help you change the data-type
# -- Input: pd.Series, with dtype "string"
# -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]"
type_conversion_demo = pd.DataFrame({
'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'],
'DemoColumn': [1, 2]
})
# Convert the type!
type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date'])
assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works!
### `load_data` demo ----------------------------------------------------------
# We'll sample the dataset to 20 records, just to make debugging easier.
random.seed(6040)
stock_data_subsample = random.sample(stock_data, k=10)
# Here's what your solution produces!
print('Here is what your loaded data looks like in Pandas:')
stock_subsample_df = load_data(stock_data_subsample)
display(stock_subsample_df)
The cell below will test your solution for load_data (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
ex_name='load_data',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to load_data did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
ex_name='load_data',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to load_data did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
stock_df = utils.load_object_from_publicdata('stock_df')
preview_data
Example: we have defined preview_data
as follows:
Preview the three data sources as outlined below.
Inputs:
records
: A Pandas DataFrame containing the results of the full_data
output generated by Exercise 0.conn
: A SQLite connection to our database.Return: A tuple, containing the following elements:
records_preview
: A Pandas Dataframe containing 10 entries.nflx_originals_preview
: A Pandas DataFrame containing 5 different, random entries from the nflx_originals
SQL database table.nflx_top_preview
: A Pandas DataFrame containing 5 different, random entries from the nflx_top
SQL database table.Requirements/steps:
records_preview
: If there are fewer than 10 entries in records
, all entries should be returned.nflx_originals_preview
: The selected entries may be any 5 rows from the original table.nflx_top_preview
: The selected entires may be any 5 rows from the original table.A solution is provided for you below. Run it and the test cell to earn your free point, then submit the assignment!
### Solution - Exercise 1
def preview_data(records: list, conn: sqlite3.Connection) -> tuple:
# Filter out any records beyond the second element
records_preview = records.head(n=10)
# Select everything from nflx_originals and randomly sample 5 rows
nflx_originals_preview = pd\
.read_sql('SELECT * FROM nflx_originals', conn)\
.sample(5)
# Select everything from nflx_top and randomly sample 5 rows
nflx_top_preview = pd\
.read_sql('SELECT * FROM nflx_top', conn)\
.sample(5)
return records_preview, nflx_originals_preview, nflx_top_preview
### Demo function call
records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn)
print('Preview of Stock Data')
display(records_preview)
print('------------------------------------------------------------------------')
print('Preview of Netflix Originals Table')
display(nflx_originals)
print('------------------------------------------------------------------------')
print('Preview of Netflix Top Programs Table')
display(nflx_top)
The test cell below will always pass. Please submit to collect your free points for preview_data (exercise 1).
### Test Cell - Exercise 1
print('Passed! Please submit.')
### Run Me!!!
calculate_daily_growth_demo_input = utils.load_object_from_publicdata('calculate_daily_growth_demo_input')
calculate_daily_growth
Your task: define calculate_daily_growth
as follows:
Calculate the daily growth for a given firm, as outlined below.
Inputs:
stock_records
: a Pandas DataFrame. It contains the full stock data loaded in exercise 0.stock
: a string specifying a particular stock.Return: stock_growth
a Pandas DataFrame with the following columns:
Date
Firm
Open
Close
Volume
Day_Growth
Requirements/steps:
Day_Growth
appended as a new column. The growth should be specific to the firm specified by the firm
parameter.Close
, offset by $x$. So, $\text{close}_{-1}$ is the value of Close
, offset by 1 position.NaN
values when you initially calculate the growth.stock_growth
) should contain the same number of records as stock_records
for that firm
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print("Demo output for Netflix:")
daily_growth_df_CORRECT = utils.load_object_from_publicdata('daily_growth_df.dill')
display(daily_growth_df_CORRECT)
### Solution - Exercise 2
def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame:
### BEGIN SOLUTION
# Filter the DataFrame for the stock
stockdf = stock_records[stock_records['Firm']==firm].copy(deep=True)
# Remove NA values in a copy.
# We'll join everything back together at the end.
stockdf2 = stockdf.copy(deep=True)
stockdf2 = stockdf2.dropna()
# Lag the close and calculate the daily growth
stockdf2['Yest_Close'] = stockdf2.groupby('Firm')['Close'].shift(1)
stockdf2['Day_Growth'] = (stockdf2['Close']/stockdf2['Yest_Close']) - 1
# Merge everything back together and fill missing values with 0
output = pd.merge(
stockdf,
stockdf2,
how='outer',
left_index=True,
right_index=True,
suffixes=(None, '_y'))[
['Date', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth']
]
output['Day_Growth'] = output['Day_Growth'].fillna(0)
# Return the result
return output.reset_index(drop=True)
### END SOLUTION
### Demo function call
print("Demo daily growth for Netflix's stock:")
daily_growth_nflx_demo = calculate_daily_growth(
stock_records=calculate_daily_growth_demo_input,
firm='NFLX'
)
display(daily_growth_nflx_demo)
The cell below will test your solution for calculate_daily_growth (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
ex_name='calculate_daily_growth',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_daily_growth did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
ex_name='calculate_daily_growth',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_daily_growth did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
stock_daily_growths_df = utils.load_object_from_publicdata('stock_daily_growths_df')
stock_daily_growths_reduced_df = utils.load_object_from_publicdata('stock_daily_growths_reduced_df')
calculate_annual_growth
Your task: define calculate_annual_growth
as follows:
Calculate the annual growth for every firm and join the results to a copy of the original stock_records
DataFrame.
Inputs:
stock_records
: a Pandas DataFrame. It contains the results calculated by exercise 2.Return: annual_growth
: a Pandas DataFrame with the same columns as stock_records
, along with columns for:
Year
Annual_Growth
Requirements/steps:
Date
column of stock_records
. Year is expected to be a string in the final output.Annual_Growth
Day_Growth
+ 1), grouped by firm and year. Make sure to subtract 1 at the end.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
annual_growth_grouped_df_demo
information useful for checking your work. If you wish to use it, uncomment the code below to view it.You should notice in this result that the same value for Annual_Growth
appears in each row for a given Firm
and Year
. For example rows where APPL is the Firm
and Year
is '2018'
all have Annual_Growth
of -0.053902
, and NFLX rows for '2020'
have 0.671138
.
The Annual_Growth
is calculated by adding 1 to each Day_Growth
observation. Multiplying all of them together. Then subtracting 1 from the product.
Mathematically, let $g^{(f,y)}_i$ represent the Day_Growth
value for firm $f$, year $y$, and observation $i$. Then the Annual_Growth
value for the same firm ($G^{(f, y)}$) is given by:
print("Demo annual growth:")
annual_growth_demo_df_CORRECT = utils.load_object_from_publicdata('annual_growth_demo_df.dill')
display(annual_growth_demo_df_CORRECT)
### Uncomment these lines for additional information!
# print("Annual growth, grouped together:")
# annual_growth_grouped_df_demo = utils.load_object_from_publicdata('annual_growth_grouped_df_demo.dill')
# display(annual_growth_grouped_df_demo)
### Solution - Exercise 3
def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame:
### BEGIN SOLUTION
# Create a copy
stockdf = stock_records.copy(deep=True)
# Extract the year
stockdf['Year'] = stockdf.Date.dt.strftime('%Y')
# Create the growth ratio and calculate the annual growth
stockdf['Growth_Ratio'] = 1 + stockdf['Day_Growth']
stockdf2 = stockdf[['Year', 'Growth_Ratio', 'Firm']].copy(deep=True)
resultdf = stockdf2.groupby(['Year', 'Firm'])\
.prod('Growth_Ratio') - 1
# Merge everything back together
resultdf = resultdf.reset_index()
resultdf.columns = ['Year', 'Firm', 'Annual_Growth']
output = pd.merge(stockdf, resultdf)[
['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth']
]
# Return the output
return output
### END SOLUTION
### Demo function call
annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df)
display(annual_growth_demo_df)
The cell below will test your solution for calculate_annual_growth (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
ex_name='calculate_annual_growth',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_annual_growth did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
ex_name='calculate_annual_growth',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_annual_growth did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
annual_growth_df = utils.load_object_from_publicdata('annual_growth_df')
golden_cross_strategy
Your task: define golden_cross_strategy
as follows:
Implement a function which calculates a Golden Cross strategy, as detailed below.
Inputs:
stockdf
: a Pandas DataFrame. It contains the results calculated by exercise 3.firm
: a string. It specifies which firm to calculate the golden cross results for.short_average
: an integer. It defines the window for the short moving average. The number represents the number of days to include in the window.long_average
: an integer. It defines the window for the long moving average. The number represents the number of days to include in the window.Return: golden_cross_results
: a Pandas DataFrame with all of the columns from stockdf
, plus:
Short_Average
Long_Average
Golden_Cross
Requirements/steps:
firm
parameter.Close
values are recorded as NaN
. These should be filtered out before calculating the moving averages.Short_Average
and Long_Average
to the Pandas DataFrame.Close
column.short_average
and long_average
arguments.firm
argument.stockdf
is already ordered by date in ascending order.Notes
The Golden Cross (https://www.investopedia.com/terms/g/goldencross.asp) is a technical analysis pattern that indicates a potential bullish trend reversal. It occurs when a faster, short-term moving average crosses above a slower, longer-term moving average. For example, a 50-day moving average (the faster MA) crossing above a 200-day moving average (the slower MA) creates a Golden Cross signal.
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print("Example Golden Cross Demo:")
golden_cross_demo_df_CORRECT = utils.load_object_from_publicdata('golden_cross_demo_df.dill')
display(golden_cross_demo_df_CORRECT.head(15))
### Solution - Exercise 4
def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame:
### BEGIN SOLUTION
# Copy the inputs and filter where necesary
stockdf = stockdf.copy(deep=True)
stockdf = stockdf[stockdf['Firm'] == firm]
stockdf = stockdf[stockdf.notna()['Close']]
# Calculate the moving averages
stockdf['Short_Average'] = stockdf['Close'].rolling(short_average).mean()
stockdf['Long_Average'] = stockdf['Close'].rolling(long_average).mean()
# Calculate the boolean indicator
stockdf['Golden_Cross'] = stockdf['Short_Average'] > stockdf['Long_Average']
# Return the result
return stockdf.reset_index(drop=True)
### END SOLUTION
### Demo function call
golden_cross_demo_df = golden_cross_strategy(
stockdf=annual_growth_df,
firm='NFLX',
short_average=3,
long_average=7)
# Let's only look at the first 15 days for our input
display(golden_cross_demo_df.head(15))
The cell below will test your solution for golden_cross_strategy (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
ex_name='golden_cross_strategy',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to golden_cross_strategy did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
ex_name='golden_cross_strategy',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to golden_cross_strategy did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
golden_crosses_df = utils.load_object_from_publicdata('golden_crosses_df')
golden_crosses_reduced_df = utils.load_object_from_publicdata('golden_crosses_reduced_df')
normalize_stock_growths
Example: we have defined normalize_stock_growths
as follows:
Normalize the daily growth of a given firm by comparing it to the growth of the other provided firms.
Inputs:
daily_growth_by_firm
: a Pandas DataFrame. It contains many of the variables we have already calculated.firm
: a string. It specifies which firm to calculate the Relative_Growth
for.Return: normalized_results
: a Pandas DataFrame with the following columns:
Date
, which contains every unique date value in the input.daily_growth_by_firm
, where the value is equal to the growth of that firm's stock for the relevant Date
.Day_Growth
column of daily_growth_by_firm
.Non_<FIRM>_Average_Growth
(as defined below)Requirements/steps:
Non_<FIRM>_Average_Growth
:Date
, take the mean of the growths for every firm which is not equal to firm
.AAPL
, NFLX
, and AMZN
, and firm=NFLX
, then Non_<FIRM>_Average_Growth
would be the average growth of AAPL
and AMZN
for each day.<FIRM>
" in the column name with the value given by firm
.firm='AAPL'
, then your column name should be Non_AAPL_Average_Growth
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('Properly Averaged Industry Growth:')
normalized_returns_demo_df_CORRECT = utils.load_object_from_publicdata('normalized_returns_demo_df.dill')
display(normalized_returns_demo_df_CORRECT)
### Solution - Exercise 5
def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame:
# Create a list of every firm, except for the one we specified in the inputs
non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm])
non_chosen_firms = list(non_chosen_firms)
# Pivot the data for easy manipulation
pivoted_data = daily_growth_by_firm.pivot(
index='Date',
columns='Firm',
values='Day_Growth'
).reset_index()
# Calculate the total growth for every firm that isn't our firm
pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\
.sum(axis=1) / len(non_chosen_firms)
return pivoted_data
### Demo function call
normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX')
display(normalized_returns_demo_df)
The test cell below will always pass. Please submit to collect your free points for normalize_stock_growths (exercise 5).
### Test Cell - Exercise 5
print('Passed! Please submit.')
summarize_netflix_original_genres
Your task: define summarize_netflix_original_genres
as follows:
Define the function to return a SQL query which summarizes the trends of Netflix originals by year and genre, as detailed below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database to obtain a table with three columns:
year
genre
genre_count
Requirements/steps:
year
variable should be extracted from the premiere
variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year
from the premiere
variable.genre_count
(descending)year
(ascending)genre
(ascending)nflx_originals
.Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper summary of genres for the demo:')
netflix_original_genre_summary_df_CORRECT = utils.load_object_from_publicdata('netflix_original_genre_summary_df.dill')
display(netflix_original_genre_summary_df_CORRECT)
### Solution - Exercise 6
def summarize_netflix_original_genres()-> str:
### BEGIN SOLUTION
query = '''
SELECT
STRFTIME("%Y", premiere) AS year,
genre,
COUNT(genre) AS genre_count
FROM nflx_originals no2
GROUP BY genre, year
ORDER BY
genre_count DESC,
year,
genre
LIMIT 10
'''
return query
### END SOLUTION
### Demo function call
summary_query = summarize_netflix_original_genres()
netflix_original_genres = pd.read_sql(summary_query, conn)
display(netflix_original_genres)
The cell below will test your solution for summarize_netflix_original_genres (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
ex_name='summarize_netflix_original_genres',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
ex_name='summarize_netflix_original_genres',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
calc_netflix_top10_scores
Your task: define calc_netflix_top10_scores
as follows:
Define the function to return a SQL query which calculates the number of times each show appeared in the top 10, the show's score, and the show's average score when it was in the top 10. The concept of a 'score' is defined below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database table nflx_top
to obtain a table with the following attributes:
title
: the title of the showtotal_score
: the sum of the show's score.occurrence
: the number of times a show appeared in the top 10.avg_score
: the total_score
divided by the occurrence
value for a given show.Requirements/steps:
avg_score
to produce a floating-point valuetotal_score
in descending ordertitle
in descending orderHints:
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper summary of show score metrics:')
netflix_top_10_scores_df_CORRECT = utils.load_object_from_publicdata('netflix_top_10_scores_df.dill')
display(netflix_top_10_scores_df_CORRECT)
### Solution - Exercise 7
def calc_netflix_top10_scores() -> str:
### BEGIN SOLUTION
# An approaching using CTEs
ctequery = '''
WITH scores AS (
SELECT
sum(11 - rank) AS total_score,
count(title) AS occurrence,
title
FROM nflx_top
GROUP BY title
ORDER BY total_score DESC, title DESC)
SELECT
title,
total_score,
occurrence,
(total_score * 1.0 / occurrence) as avg_score
FROM scores
LIMIT 10;
'''
# An approaching without using CTEs
query='''
SELECT title,
sum(11 - rank) AS total_score,
count(title) AS occurrence,
1.0*sum(11 - rank)/count(title) avg_score
FROM nflx_top
GROUP BY title
ORDER BY total_score DESC, title DESC
LIMIT 10
'''
return ctequery
### END SOLUTION
### Demo function call
scores_query = calc_netflix_top10_scores()
netflix_top10_scores = pd.read_sql(scores_query, conn)
display(netflix_top10_scores)
The cell below will test your solution for calc_netflix_top10_scores (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
ex_name='calc_netflix_top10_scores',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
ex_name='calc_netflix_top10_scores',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
longterm_avg_score
Your task: define longterm_avg_score
as follows:
Define the function to return a SQL query which calculates the long-term score for each Netflix Original, as defined below.
Inputs: None
Return: query
: a Python string, containing a SQLite query. It should query the database to obtain a table with the following attributes:
title
- (string): the titles of the Netflix Originalstotal_score
- (int): the sum of all scores for each title given in nflx_originals
date_count
- (int): the number of dates that a given title is available, as detailed belowlongterm_avg_score
- (float): the long-term average score, defined as:longterm_avg_score = total_score/date_count
Requirements/steps:
nflx_originals
nflx_top
title
on a given date, where the score is defined as:date_count
, count the number of date
values in nflx_top
that exist for a title
, where the date
is greater than or equal to the premiere
for that title
<distinct_dates>
: Identify all distinct date values in nflx_top
.<title_dates>
: For each title in nflx_originals
, identify any dates in <distinct_dates>
which are greater than or equal to the title's premiere.<title_date_score>
: For each title/date pair in <title_dates>
, identify any records from nflx_top
which have the same title/date pair.rank
field from nflx_top
to calculate a given score.<title_date_score>
but not in nflx_top
should not be dropped and should have a NULL (score) value in <title_date_score>
.<title_date_score>
, calculate:total_score
- compute the sum of all the score values sharing a common title.date_count
- compute the count of all date values sharing a common title.longterm_avg_score
- compute the quotient of the total_score over date_count.longterm_avg_score
that are NULL.longterm_avg_score
in descending ordertitle
in ascending orderHints:
An Example:
Let's take a deeper look specifically at the show Vikings: Valhalla
.
nflx_originals
table, the show premiered on 2022-02-25.longterm_avg_vikings_breakout_df_CORRECT
below uses the results from the nflx_top
table for the show titled Vikings: Valhalla
.nflx_top
for Vikings: Valhalla
, the dates greater than or equal to the premiere date range from 2022-02-25 to 2022-03-11.Vikings: Valhalla
does not have a score for that date. Vikings: Valhalla
received its first ranking on 2022-02-26. Therefore, its score for 2022-02-26 would be:
$$(11-rank)=(11-4)=7$$longterm_avg_vikings_breakout_df_CORRECT
below shows a table with the show title
, premiere
,date
, rank
, and score
for all dates in the range from 2022-02-25 to 2022-03-11 for Vikings: Valhalla
. total_score
, date_count
, and longterm_avg_score
for the show Vikings: Valhalla
:
$$\text{longterm_avg_score}=\text{total_score}/\text{date_count}=100/15=6.666667$$longterm_avg_vikings_breakout_df_CORRECT = utils.load_object_from_publicdata('longterm_avg_vikings_breakout_df.dill')
display(longterm_avg_vikings_breakout_df_CORRECT)
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
longterm_avg_scores_df = utils.load_object_from_publicdata('longterm_avg_scores_df.dill')
display(longterm_avg_scores_df)
### Solution - Exercise 8
def longterm_avg_score():
### BEGIN SOLUTION
# An approaching using CTEs
ctequery='''
WITH
distinct_dates as (
SELECT DISTINCT date FROM nflx_top),
title_dates as (
SELECT no.title, no.premiere, dd.date
FROM nflx_originals no
LEFT JOIN distinct_dates dd
ON no.premiere <= dd.date),
title_date_score as (
SELECT td.title, td.date,
(11 - nt.rank) score
FROM title_dates td
LEFT JOIN nflx_top nt
ON nt.title = td.title AND nt.date = td.date)
--
-- Main Query
--
SELECT title,
sum(score) total_score,
count(date) date_count,
sum(score) * 1.0 / count(date) longterm_avg_score
FROM title_date_score
GROUP BY title
HAVING longterm_avg_score is not null
ORDER BY longterm_avg_score DESC, title
LIMIT 10
'''
# An approach without using CTEs
query = '''
SELECT
n_orig.title,
SUM(11 - n_top.rank) total_score,
COUNT(n_orig.title) date_count,
1.0 * SUM(11 - n_top.rank) / COUNT(n_orig.title) AS longterm_avg_score
FROM nflx_originals n_orig
LEFT JOIN (
SELECT
DISTINCT date
FROM nflx_top
) all_dates
ON n_orig.premiere <= all_dates.date
LEFT JOIN nflx_top n_top
ON n_top.title = n_orig.title
AND n_top.date = all_dates.date
GROUP BY n_orig.title
HAVING longterm_avg_score is not null
ORDER BY
longterm_avg_score DESC,
n_orig.title
LIMIT 10
'''
return query
### END SOLUTION
### Demo function call
query = longterm_avg_score()
longterm_avg_scores = pd.read_sql(query, conn)
display(longterm_avg_scores)
The cell below will test your solution for longterm_avg_score (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
ex_name='longterm_avg_score',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to longterm_avg_score did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
ex_name='longterm_avg_score',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to longterm_avg_score did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
nflx_scores_demo_df = utils.load_object_from_publicdata('nflx_scores_demo_df')
normalized_growths_df = utils.load_object_from_publicdata('normalized_growths_df')
construct_model_inputs
Your task: define construct_model_inputs
as follows:
Define the model matrix and response for our regression model, as shown below.
Inputs:
normalized_growths_df
: a Pandas DataFrame. It contains the results from exercise 5.show_scores_df
: a Pandas DataFrame. It contains the pre-computed scores for each show on a given date.Return: model_matrix
: a Pandas DataFrame containing the independent variables (show scores) and response variable (Relative_Growth)
Requirements/steps:
show_scores_df
so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score.model_matrix
Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.
print('A proper model matrix for the demo:')
model_matrix_demo_df_CORRECT = utils.load_object_from_publicdata('model_matrix_demo_df.dill')
display(model_matrix_demo_df_CORRECT)
### Solution - Exercise 9
def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame:
### BEGIN SOLUTION
show_scores_filled = show_scores_df.pivot(index='date', columns='title', values='score').fillna(0)
all_shows = show_scores_filled.columns
model_matrix = show_scores_filled.merge(normalized_growths_df, left_on='date', right_on='Date')
return model_matrix[['Relative_Growth', *all_shows]]
### END SOLUTION
### Demo function call
model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df)
display(model_matrix)
The cell below will test your solution for construct_model_inputs (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=construct_model_inputs,
ex_name='construct_model_inputs',
key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to construct_model_inputs did not pass the test.'
### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=construct_model_inputs,
ex_name='construct_model_inputs',
key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=',
n_iter=10,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to construct_model_inputs did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
If you have made it this far, congratulations! You are done. Don't forget to submit the assignment!
If you run the regression model below with all of the data collected, you would have obtained summary model results like the following:
##############################################
### The Code below is provided for illustrative and reproducability but is commented out due to additional time processing to the auto-grader. Note, the model was run on all NFLX growth data and top shows data.
##############################################
# Let's use step-wise regression to do feature selection
mm=model_matrix.copy(deep=True)
y = mm['Relative_Growth']
del mm['Relative_Growth']
X = mm
import statsmodels.api as sm
model = LinearRegression()
feature_selector = SequentialFeatureSelector(model, n_features_to_select=min(len(X.columns)-1,10))
feature_selector.fit(X, y)
selected_X = X[feature_selector.get_feature_names_out()]
selected_X = sm.add_constant(selected_X)
print(selected_X.columns)
selected_model = sm.OLS(y, selected_X)
results = selected_model.fit()
results.summary()
So, how does our model do?
Step-wise regression is a greedy algorithm which works to iteratively select the best possible features for improving the performance of our model. We tried to find the top-10 shows which were most predictive of the normalized stock values and found the following shows:
Note that these don't necessarily mean that they caused the stock value to improve. For example, The Royal Treatment was significantly associated with a negative trend in Netflix's stock growth.
A better question might be: "how reliable are these results?"
Ultimately, the performance of the model constructed from the input provided by exercise 9 is relatively poor. This model exhibits an adjusted $R^2$ value of only 0.042. Predicting stock prices is intrinsically hard, as there is an exceptionally high amount of variance in each observation and there are many reasons for fluctuations in market prices. Therefore, it is unsurprising to see that a lot of the variance remains.
We are also fighting "the curse of dimensionality," which means the excess data we have can be more of a hinderance than a help. However, the F-score for our model is quite good (with a p-value of 0.000684), which suggests that our model does a much better job at predicting stock value changes than a naive model.
You will learn more about various modeling techniques and how to use them during the rest of the course. Working to improve the regression results with the results of this exam is left as an exercise to the students; for high-quality regression outputs, you may find the statsmodels package particularly useful. Here's one approach you might find worth exploring: