Midterm 2, Fall 2024: Netflix and Bills¶

Version 1.0.3

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of ['General Python', 'Tabular Data', 'Pandas', 'SQL and SQLite', 'NumPy']. It has 10 exercises numbered 0 to 9. There are 17 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 1 point(s)

  • Exercise 1 - : 1 point(s) - FREE

  • Exercise 2 - : 3 point(s)

  • Exercise 3 - : 2 point(s)

  • Exercise 4 - : 2 point(s)

  • Exercise 5 - : 0 point(s) - FREE

  • Exercise 6 - : 1 point(s)

  • Exercise 7 - : 2 point(s)

  • Exercise 8 - : 3 point(s)

  • Exercise 9 - : 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Analyzing Netflix's Stocks and Services¶

Background. Netflix is a firm in the entertainment industry which has grown to prominence in the streaming era of movies and television. The company distributes both acquired and original programming via a streaming service.

Your overall task. Your goal is to attempt to evaluate whether there is a relationship between the programming that is distributed by Netflix and the financial performance of the firm. You will do this in two main steps:

  1. First, you will analyze the general finanical standing of the firm.
  2. Second, you will investigate the trends present in Netflix's program offerings.

Once you have obtained the results from both steps, you will use the results to build a model which will attempt to evaluate whether there is a relationship between the programs and the firm's financial standing.

The datasets. There are three sources of data which you will use to solve the following exercises.

  1. Financial data for several tech firms, sourced from Yahoo Finance.
  2. A daily ranking of Netflix's top-10 programs which spans 2 years, taken from The Numbers project by Nash Information Services.
  3. A collection of information taken from Wikipedia on Netflix original programming.

The finanical data (source 1) is provided as a Python list of dictionaries. You will load this data into a Pandas DataFrame. The daily Netflix data (source 2) and original programming information (source 3) are contained in a SQLite Database.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

In [ ]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
In [ ]:
from pprint import pprint, pformat

import numpy as np
import pandas as pd
import sqlite3
import random
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector

# Load starting data and define a database connection
stock_data = utils.load_object_from_publicdata('raw_stock_data_nflx.dill')

conn = sqlite3.connect('resource/asnlib/publicdata/NetflixDb.sqlite_db')

Exercise 0: (1 points)¶

load_data

Your task: define load_data as follows:

Load the data contained in records into a Pandas DataFrame. See the instructions below for the requirements.

Inputs:

  • records: A list of dictionaries. Each dictionary contains information for a single row of data.

Return: full_data: A Pandas DataFrame containing the data held in records. The DataFrame should contain the following columns:

  • Firm
    • This variable is contained in the dictionaries under the Stock key
  • Date
    • This variable's datatype should be set to datetime64[ns, America/New_York]
  • Open
  • Close
  • Volume

Requirements/steps:

  • Do not remove any records with NaN values!
  • The DataFrame rows may be ordered arbitrarily.

Hints

  • You might find this part of the Pandas Documentation useful if you're struggling to load the data into Pandas.
  • There are many ways to approach this problem. We recommend this general approach:
    1. Load the data into Pandas
    2. Convert the date to the proper datatype
      • We have defined a helper function, convert_timezone_types() to make this easier. convert_timezone_types() takes an input of a pd.Series with dtype "string" and then outputs a pd.Series with dtype "datetime64[ns, America/New_York]". See the demo code for more examples.
      • You do not have to use our helper function. If you wish, you could instead take a look at to_datetime, tz_convert, and this post on stackoverflow
    3. Rename the columns
    4. Return the results

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
stock_subsample_df_demo_CORRECT = utils.load_object_from_publicdata('stock_subsample_df_demo.dill')
convert_timezone_types          = utils.load_object_from_publicdata('convert_timezone_types.dill')
display(stock_subsample_df_demo_CORRECT)
In [ ]:
### Solution - Exercise 0  
def load_data(records: list) -> pd.DataFrame:
    ### BEGIN SOLUTION
    
    # A four-line solution, for the TAs --------------------------------------------
    # full_data = pd.DataFrame(records)
    # full_data['Date'] = convert_timezone_types(full_data['Date'])
    # full_data = full_data.rename(columns={'Stock': 'Firm'})
    # return full_data
    # ------------------------------------------------------------------------------
    
    # Build the full DataFrame
    full_data = pd.DataFrame(records)
    
    # Convert the data type (Here are two suggested approaches)
    ### APPROACH 1: Pass in a series
    date_vector_1 = convert_timezone_types(full_data['Date'])
    ### APPROACH 2: Use .apply and timezone conversion ---------------------
    # date_vector_2 = full_data['Date'].apply(convert_timezone_types)
    ### They should be the same thing!
    # assert (date_vector_1 == date_vector_2).all() 
    full_data['Date'] = date_vector_1
    
    # Rename the columns
    full_data = full_data.rename(columns={'Stock': 'Firm'})

    # Return our dataframe
    return full_data
    ### END SOLUTION

### Demo function call
### `convert_timezone_types` demo ---------------------------------------------
# This helper function will help you change the data-type
# -- Input:  pd.Series, with dtype "string"
# -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]"
type_conversion_demo = pd.DataFrame({
    'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'],
    'DemoColumn': [1, 2]
})
# Convert the type!
type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date'])
assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works!


### `load_data` demo ----------------------------------------------------------
# We'll sample the dataset to 20 records, just to make debugging easier.
random.seed(6040)
stock_data_subsample = random.sample(stock_data, k=10)
# Here's what your solution produces!
print('Here is what your loaded data looks like in Pandas:')
stock_subsample_df = load_data(stock_data_subsample)
display(stock_subsample_df)


The cell below will test your solution for load_data (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
              ex_name='load_data',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to load_data did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
              ex_name='load_data',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to load_data did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In [ ]:
### Run Me!!!
stock_df = utils.load_object_from_publicdata('stock_df')

Exercise 1: (1 points) - FREE¶

preview_data

Example: we have defined preview_data as follows:

Preview the three data sources as outlined below.

Inputs:

  • records: A Pandas DataFrame containing the results of the full_data output generated by Exercise 0.
  • conn: A SQLite connection to our database.

Return: A tuple, containing the following elements:

  • records_preview: A Pandas Dataframe containing 10 entries.
  • nflx_originals_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_originals SQL database table.
  • nflx_top_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_top SQL database table.

Requirements/steps:

  • records_preview: If there are fewer than 10 entries in records, all entries should be returned.
  • nflx_originals_preview: The selected entries may be any 5 rows from the original table.
  • nflx_top_preview: The selected entires may be any 5 rows from the original table.

A solution is provided for you below. Run it and the test cell to earn your free point, then submit the assignment!

In [ ]:
### Solution - Exercise 1  
def preview_data(records: list, conn: sqlite3.Connection) -> tuple:
    # Filter out any records beyond the second element
    records_preview = records.head(n=10)

    # Select everything from nflx_originals and randomly sample 5 rows
    nflx_originals_preview = pd\
        .read_sql('SELECT * FROM nflx_originals', conn)\
        .sample(5)

    # Select everything from nflx_top and randomly sample 5 rows
    nflx_top_preview = pd\
        .read_sql('SELECT * FROM nflx_top', conn)\
        .sample(5)

    return records_preview, nflx_originals_preview, nflx_top_preview

### Demo function call
records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn)
print('Preview of Stock Data')
display(records_preview)
print('------------------------------------------------------------------------')
print('Preview of Netflix Originals Table')
display(nflx_originals)
print('------------------------------------------------------------------------')
print('Preview of Netflix Top Programs Table')
display(nflx_top)


The test cell below will always pass. Please submit to collect your free points for preview_data (exercise 1).

In [ ]:
### Test Cell - Exercise 1  


print('Passed! Please submit.')
In [ ]:
### Run Me!!!
calculate_daily_growth_demo_input = utils.load_object_from_publicdata('calculate_daily_growth_demo_input')

Exercise 2: (3 points)¶

calculate_daily_growth

Your task: define calculate_daily_growth as follows:

Calculate the daily growth for a given firm, as outlined below.

Inputs:

  • stock_records: a Pandas DataFrame. It contains the full stock data loaded in exercise 0.
  • stock: a string specifying a particular stock.

Return: stock_growth a Pandas DataFrame with the following columns:

  • Date
  • Firm
  • Open
  • Close
  • Volume
  • Day_Growth

Requirements/steps:

  1. Do not use SQLite.
  2. Create a new Pandas DataFrame with Day_Growth appended as a new column. The growth should be specific to the firm specified by the firm parameter.
  3. Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
    • You should use the next valid close price to calculate growth in these instances.
  4. Daily growth is defined as $$\text{growth}_0 = (\text{close}_0 / \text{close}_{-1}) - 1$$
    • In English: "Daily Growth is Today's Close divided by Yesterday's Close, minus one."
    • $\text{close}_x$ is the value of the Close, offset by $x$. So, $\text{close}_{-1}$ is the value of Close, offset by 1 position.
    • You may find the pd.DataFrame.shift() method, and the related GroupBy shift method helpful for offsetting the close value!
  5. To repeat: Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
    • You should use the next valid close price to calculate growth in these instances.
    • For example, if:
      1. The close price on Friday is 10
      2. There are no close prices for the next two days (Saturday and Sunday)
      3. The close price on Monday is 12 ... then the daily growth for Monday should be equal to (12/10)-1=0.2.
    • You can make this easier by filtering out all NaN values when you initially calculate the growth.
  6. Your final DataFrame (stock_growth) should contain the same number of records as stock_records for that firm.
  7. Finally, reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print("Demo output for Netflix:")

daily_growth_df_CORRECT = utils.load_object_from_publicdata('daily_growth_df.dill')
display(daily_growth_df_CORRECT)
In [ ]:
### Solution - Exercise 2  
def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame:
    ### BEGIN SOLUTION

    # Filter the DataFrame for the stock
    stockdf = stock_records[stock_records['Firm']==firm].copy(deep=True)

    # Remove NA values in a copy.
    # We'll join everything back together at the end.
    stockdf2 = stockdf.copy(deep=True)
    stockdf2 = stockdf2.dropna()

    # Lag the close and calculate the daily growth
    stockdf2['Yest_Close'] = stockdf2.groupby('Firm')['Close'].shift(1)
    stockdf2['Day_Growth'] = (stockdf2['Close']/stockdf2['Yest_Close']) - 1

    # Merge everything back together and fill missing values with 0
    output = pd.merge(
        stockdf,
        stockdf2,
        how='outer',
        left_index=True,
        right_index=True,
        suffixes=(None, '_y'))[
            ['Date', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth']
        ]
    output['Day_Growth'] = output['Day_Growth'].fillna(0)

    # Return the result
    return output.reset_index(drop=True)
    ### END SOLUTION

### Demo function call
print("Demo daily growth for Netflix's stock:")
daily_growth_nflx_demo = calculate_daily_growth(
    stock_records=calculate_daily_growth_demo_input,
    firm='NFLX'
)
display(daily_growth_nflx_demo)


The cell below will test your solution for calculate_daily_growth (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
              ex_name='calculate_daily_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_daily_growth did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
              ex_name='calculate_daily_growth',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_daily_growth did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In [ ]:
### Run Me!!!
stock_daily_growths_df = utils.load_object_from_publicdata('stock_daily_growths_df')
stock_daily_growths_reduced_df = utils.load_object_from_publicdata('stock_daily_growths_reduced_df')

Exercise 3: (2 points)¶

calculate_annual_growth

Your task: define calculate_annual_growth as follows:

Calculate the annual growth for every firm and join the results to a copy of the original stock_records DataFrame.

Inputs:

  • stock_records: a Pandas DataFrame. It contains the results calculated by exercise 2.

Return: annual_growth: a Pandas DataFrame with the same columns as stock_records, along with columns for:

  • Year
  • Annual_Growth

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. Extract the year from the Date column of stock_records. Year is expected to be a string in the final output.
  3. The most common way to calculate Annual_Growth
    • Take the product of (Day_Growth + 1), grouped by firm and year. Make sure to subtract 1 at the end.
  4. You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

  • Note: While not reflective of the output you are expected to produce, you may also find the annual_growth_grouped_df_demo information useful for checking your work. If you wish to use it, uncomment the code below to view it.

You should notice in this result that the same value for Annual_Growth appears in each row for a given Firm and Year. For example rows where APPL is the Firm and Year is '2018' all have Annual_Growth of -0.053902, and NFLX rows for '2020' have 0.671138.

The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them together. Then subtracting 1 from the product.

  • The first few terms in the (APPL, 2018) group would be $(1+ 0.0)(1 + .012904)(1 - 0.000174)(1 + 0.004645)(1 + 0.11385)$
  • The product of all terms for (APPL, 2018) is $0.946098$.
  • Subtracting off 1 leaves $-0.053902$

Mathematically, let $g^{(f,y)}_i$ represent the Day_Growth value for firm $f$, year $y$, and observation $i$. Then the Annual_Growth value for the same firm ($G^{(f, y)}$) is given by:

$$G^{(f, y)} = \left [ \prod_{\forall i}{(g^{(f,y)}_i + 1)} \right ] - 1$$
In [ ]:
print("Demo annual growth:")

annual_growth_demo_df_CORRECT = utils.load_object_from_publicdata('annual_growth_demo_df.dill')
display(annual_growth_demo_df_CORRECT)

### Uncomment these lines for additional information!
# print("Annual growth, grouped together:")

# annual_growth_grouped_df_demo = utils.load_object_from_publicdata('annual_growth_grouped_df_demo.dill')
# display(annual_growth_grouped_df_demo)
In [ ]:
### Solution - Exercise 3  
def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION

    # Create a copy
    stockdf = stock_records.copy(deep=True)

    # Extract the year
    stockdf['Year'] = stockdf.Date.dt.strftime('%Y')

    # Create the growth ratio and calculate the annual growth
    stockdf['Growth_Ratio'] = 1 + stockdf['Day_Growth']
    stockdf2 = stockdf[['Year', 'Growth_Ratio', 'Firm']].copy(deep=True)
    resultdf = stockdf2.groupby(['Year', 'Firm'])\
        .prod('Growth_Ratio') - 1
    
    # Merge everything back together
    resultdf = resultdf.reset_index()
    resultdf.columns = ['Year', 'Firm', 'Annual_Growth']
    output = pd.merge(stockdf, resultdf)[
        ['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth']
    ]

    # Return the output
    return output
    ### END SOLUTION

### Demo function call
annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df)
display(annual_growth_demo_df)


The cell below will test your solution for calculate_annual_growth (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
              ex_name='calculate_annual_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_annual_growth did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
              ex_name='calculate_annual_growth',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calculate_annual_growth did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In [ ]:
### Run Me!!!
annual_growth_df = utils.load_object_from_publicdata('annual_growth_df')

Exercise 4: (2 points)¶

golden_cross_strategy

Your task: define golden_cross_strategy as follows:

Implement a function which calculates a Golden Cross strategy, as detailed below.

Inputs:

  • stockdf: a Pandas DataFrame. It contains the results calculated by exercise 3.
  • firm: a string. It specifies which firm to calculate the golden cross results for.
  • short_average: an integer. It defines the window for the short moving average. The number represents the number of days to include in the window.
  • long_average: an integer. It defines the window for the long moving average. The number represents the number of days to include in the window.

Return: golden_cross_results: a Pandas DataFrame with all of the columns from stockdf, plus:

  • Short_Average
  • Long_Average
  • Golden_Cross

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. Filter the DataFrame to include records for the firm parameter.
  3. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
    • You may find the pd.DataFrame.notna() method useful for filtering these values out!
  4. Add the columns Short_Average and Long_Average to the Pandas DataFrame.
    • These columns should calculate the moving average of the Close column.
    • The window of the moving average should be equal to the size specified by the short_average and long_average arguments.
    • You may find the pd.DataFrame.rolling() method useful for creating the windows!
  5. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
    • This column should indicate whether the short-window moving average is greater than the long-window moving average.
  6. You may assume that only one firm is provided for the firm argument.
  7. You may assume that stockdf is already ordered by date in ascending order.
  8. Finally, reset the indexes before returning the resulting dataframe.

Notes

The Golden Cross (https://www.investopedia.com/terms/g/goldencross.asp) is a technical analysis pattern that indicates a potential bullish trend reversal. It occurs when a faster, short-term moving average crosses above a slower, longer-term moving average. For example, a 50-day moving average (the faster MA) crossing above a 200-day moving average (the slower MA) creates a Golden Cross signal.

alt text

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print("Example Golden Cross Demo:")

golden_cross_demo_df_CORRECT = utils.load_object_from_publicdata('golden_cross_demo_df.dill')
display(golden_cross_demo_df_CORRECT.head(15))
In [ ]:
### Solution - Exercise 4  
def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame:
    ### BEGIN SOLUTION
    
    # Copy the inputs and filter where necesary
    stockdf = stockdf.copy(deep=True)
    stockdf = stockdf[stockdf['Firm'] == firm]
    stockdf = stockdf[stockdf.notna()['Close']]

    # Calculate the moving averages
    stockdf['Short_Average'] = stockdf['Close'].rolling(short_average).mean()
    stockdf['Long_Average'] = stockdf['Close'].rolling(long_average).mean()

    # Calculate the boolean indicator
    stockdf['Golden_Cross'] = stockdf['Short_Average'] > stockdf['Long_Average']

    # Return the result
    return stockdf.reset_index(drop=True)
    ### END SOLUTION

### Demo function call
golden_cross_demo_df = golden_cross_strategy(
    stockdf=annual_growth_df,
    firm='NFLX',
    short_average=3,
    long_average=7)
# Let's only look at the first 15 days for our input
display(golden_cross_demo_df.head(15))


The cell below will test your solution for golden_cross_strategy (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
              ex_name='golden_cross_strategy',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to golden_cross_strategy did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
              ex_name='golden_cross_strategy',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to golden_cross_strategy did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In [ ]:
### Run Me!!!
golden_crosses_df = utils.load_object_from_publicdata('golden_crosses_df')
golden_crosses_reduced_df = utils.load_object_from_publicdata('golden_crosses_reduced_df')

Exercise 5: (0 points) - FREE¶

normalize_stock_growths

Example: we have defined normalize_stock_growths as follows:

Normalize the daily growth of a given firm by comparing it to the growth of the other provided firms.

Inputs:

  • daily_growth_by_firm: a Pandas DataFrame. It contains many of the variables we have already calculated.
  • firm: a string. It specifies which firm to calculate the Relative_Growth for.

Return: normalized_results: a Pandas DataFrame with the following columns:

  • Date, which contains every unique date value in the input.
  • A column for each unique firm in daily_growth_by_firm, where the value is equal to the growth of that firm's stock for the relevant Date.
    • The value for the stock growth is contained in the Day_Growth column of daily_growth_by_firm.
  • Non_<FIRM>_Average_Growth (as defined below)

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. pandas.DataFrame.pivot may be helpful
  3. To calculate Non_<FIRM>_Average_Growth:
    1. For a given date in Date, take the mean of the growths for every firm which is not equal to firm.
      • For example, if our firms were AAPL, NFLX, and AMZN, and firm=NFLX, then Non_<FIRM>_Average_Growth would be the average growth of AAPL and AMZN for each day.
    2. You should replace the string "<FIRM>" in the column name with the value given by firm.
      • For example, if firm='AAPL', then your column name should be Non_AAPL_Average_Growth.
  4. You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print('Properly Averaged Industry Growth:')

normalized_returns_demo_df_CORRECT = utils.load_object_from_publicdata('normalized_returns_demo_df.dill')
display(normalized_returns_demo_df_CORRECT)
In [ ]:
### Solution - Exercise 5  
def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame:
    # Create a list of every firm, except for the one we specified in the inputs
    non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm])
    non_chosen_firms = list(non_chosen_firms)

    # Pivot the data for easy manipulation
    pivoted_data = daily_growth_by_firm.pivot(
        index='Date', 
        columns='Firm',
        values='Day_Growth'
        ).reset_index()

    # Calculate the total growth for every firm that isn't our firm
    pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\
        .sum(axis=1) / len(non_chosen_firms)
    
    return pivoted_data

### Demo function call
normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX')
display(normalized_returns_demo_df)


The test cell below will always pass. Please submit to collect your free points for normalize_stock_growths (exercise 5).

In [ ]:
### Test Cell - Exercise 5  


print('Passed! Please submit.')

Exercise 6: (1 points)¶

summarize_netflix_original_genres

Your task: define summarize_netflix_original_genres as follows:

Define the function to return a SQL query which summarizes the trends of Netflix originals by year and genre, as detailed below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with three columns:

  • year
  • genre
  • genre_count

Requirements/steps:

  • Your query should:
    1. Count the number of Netflix original shows, grouped by year and by genre.
      • The year variable should be extracted from the premiere variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year from the premiere variable.
    2. Order them by:
      1. genre_count (descending)
      2. year (ascending)
      3. genre (ascending)
    3. Return the top 10 rows
  • The database table you will need is named nflx_originals.
  • Make sure you rename your column variables so that they match the specifications above.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print('A proper summary of genres for the demo:')

netflix_original_genre_summary_df_CORRECT = utils.load_object_from_publicdata('netflix_original_genre_summary_df.dill')
display(netflix_original_genre_summary_df_CORRECT)
In [ ]:
### Solution - Exercise 6  
def summarize_netflix_original_genres()-> str:
    ### BEGIN SOLUTION
    query = '''
            SELECT
                STRFTIME("%Y", premiere) AS year,
                genre,
                COUNT(genre) AS genre_count
            FROM nflx_originals no2
            GROUP BY genre, year
            ORDER BY 
                genre_count DESC,
                year,
                genre
            LIMIT 10
            '''
    return query
    ### END SOLUTION

### Demo function call
summary_query = summarize_netflix_original_genres()
netflix_original_genres = pd.read_sql(summary_query, conn)
display(netflix_original_genres)


The cell below will test your solution for summarize_netflix_original_genres (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
              ex_name='summarize_netflix_original_genres',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
              ex_name='summarize_netflix_original_genres',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 7: (2 points)¶

calc_netflix_top10_scores

Your task: define calc_netflix_top10_scores as follows:

Define the function to return a SQL query which calculates the number of times each show appeared in the top 10, the show's score, and the show's average score when it was in the top 10. The concept of a 'score' is defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database table nflx_top to obtain a table with the following attributes:

  • title: the title of the show
  • total_score: the sum of the show's score.
  • occurrence: the number of times a show appeared in the top 10.
  • avg_score: the total_score divided by the occurrence value for a given show.

Requirements/steps:

  1. You will need to calculate show scores to solve this problem:
    • A show ranked #1 on a day should earn 10 points towards its total score. A show ranked #2 should earn 9, etc. until a show ranked #10 should earn 1 point.
  2. After counting the occurrences, you'll need to calculate the avg_score to produce a floating-point value
  3. Order the results by show:
    1. total_score in descending order
    2. title in descending order
  4. Return the top 10 rows

Hints:

  • You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print('A proper summary of show score metrics:')

netflix_top_10_scores_df_CORRECT = utils.load_object_from_publicdata('netflix_top_10_scores_df.dill')
display(netflix_top_10_scores_df_CORRECT)
In [ ]:
### Solution - Exercise 7  
def calc_netflix_top10_scores() -> str:
    ### BEGIN SOLUTION
    # An approaching using CTEs
    ctequery = '''
    WITH scores AS (
        SELECT 
            sum(11 - rank) AS total_score,
            count(title) AS occurrence,
            title
        FROM nflx_top
        GROUP BY title
        ORDER BY total_score DESC, title DESC)
    SELECT
        title,
        total_score,
        occurrence,
        (total_score * 1.0 / occurrence) as avg_score
    FROM scores
    LIMIT 10;
    '''
    # An approaching without using CTEs
    query='''
        SELECT title,
            sum(11 - rank) AS total_score,
            count(title) AS occurrence,
            1.0*sum(11 - rank)/count(title) avg_score
        FROM nflx_top
        GROUP BY title
        ORDER BY total_score DESC, title DESC
        LIMIT 10
    '''

    return ctequery
    ### END SOLUTION

### Demo function call
scores_query = calc_netflix_top10_scores()
netflix_top10_scores = pd.read_sql(scores_query, conn)
display(netflix_top10_scores)


The cell below will test your solution for calc_netflix_top10_scores (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
              ex_name='calc_netflix_top10_scores',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
              ex_name='calc_netflix_top10_scores',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Exercise 8: (3 points)¶

longterm_avg_score

Your task: define longterm_avg_score as follows:

Define the function to return a SQL query which calculates the long-term score for each Netflix Original, as defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with the following attributes:

  • title - (string): the titles of the Netflix Originals
  • total_score - (int): the sum of all scores for each title given in nflx_originals
    • The definition of a "score" is detailed below
  • date_count - (int): the number of dates that a given title is available, as detailed below
  • longterm_avg_score - (float): the long-term average score, defined as:
    • longterm_avg_score = total_score/date_count

Requirements/steps:

  • The data you will need is in the following tables:
    • nflx_originals
    • nflx_top
  • Calculate the values above based on the following details:
    • A score is a value associated with a Netflix title on a given date, where the score is defined as:
      • $\text{score}_{d} = 11 - \text{rank}_{d}$, for a given date $d$
      • If no rank exists for a given show on a given date, you should assume the score is implicitly equal to 0.
    • To calculate date_count, count the number of date values in nflx_top that exist for a title, where the date is greater than or equal to the premiere for that title
  • We recommend the following approach. Intermediate steps are indicated by angular brackets (you may find using CTEs helpful!):
    1. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
    2. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are greater than or equal to the title's premiere.
    3. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top which have the same title/date pair.
      • Use the rank field from nflx_top to calculate a given score.
      • Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should have a NULL (score) value in <title_date_score>.
    4. Calculate the final result.
      1. For each title in <title_date_score>, calculate:
        1. total_score - compute the sum of all the score values sharing a common title.
        2. date_count - compute the count of all date values sharing a common title.
        3. longterm_avg_score - compute the quotient of the total_score over date_count.
      2. Filter out all records with longterm_avg_score that are NULL.
      3. Finally, the results should be ordered by:
        1. longterm_avg_score in descending order
        2. The show's title in ascending order
      4. Return the top 10 rows

Hints:

  • You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

An Example:

Let's take a deeper look specifically at the show Vikings: Valhalla.

  • According to nflx_originals table, the show premiered on 2022-02-25.
  • The dataframe longterm_avg_vikings_breakout_df_CORRECT below uses the results from the nflx_top table for the show titled Vikings: Valhalla.
    • Since the show premiered 2022-02-25, within nflx_top for Vikings: Valhalla, the dates greater than or equal to the premiere date range from 2022-02-25 to 2022-03-11.
      • Therefore, the show could theoretically receive a score from 2022-02-25 to 2022-03-11, which is 15 days
      • However, there is no entry for 2022-02-25 indicating that Vikings: Valhalla does not have a score for that date.
      • Vikings: Valhalla received its first ranking on 2022-02-26. Therefore, its score for 2022-02-26 would be: $$(11-rank)=(11-4)=7$$
    • The dataframe longterm_avg_vikings_breakout_df_CORRECT below shows a table with the show title, premiere,date, rank, and score for all dates in the range from 2022-02-25 to 2022-03-11 for Vikings: Valhalla.
  • It is possible to then calculate the total_score, date_count, and longterm_avg_score for the show Vikings: Valhalla: $$\text{longterm_avg_score}=\text{total_score}/\text{date_count}=100/15=6.666667$$
In [ ]:
longterm_avg_vikings_breakout_df_CORRECT = utils.load_object_from_publicdata('longterm_avg_vikings_breakout_df.dill')
display(longterm_avg_vikings_breakout_df_CORRECT)

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
longterm_avg_scores_df = utils.load_object_from_publicdata('longterm_avg_scores_df.dill')
display(longterm_avg_scores_df)
In [ ]:
### Solution - Exercise 8  
def longterm_avg_score():
    ### BEGIN SOLUTION
    # An approaching using CTEs
    ctequery='''
            WITH 
            distinct_dates as (
                SELECT DISTINCT date FROM nflx_top), 
            title_dates as (
                SELECT no.title, no.premiere, dd.date
                FROM nflx_originals no
                LEFT JOIN distinct_dates dd
                ON no.premiere <= dd.date), 
            title_date_score as (
                SELECT td.title,  td.date, 
                    (11 - nt.rank) score
                FROM title_dates td
                LEFT JOIN nflx_top nt
                ON nt.title = td.title AND nt.date = td.date)
            -- 
            -- Main Query
            --
            SELECT title,
                sum(score) total_score,
                count(date) date_count,
                sum(score) * 1.0 / count(date) longterm_avg_score
            FROM title_date_score
            GROUP BY title
            HAVING longterm_avg_score is not null
            ORDER BY longterm_avg_score DESC, title
            LIMIT 10
            '''
    # An approach without using CTEs
    query = '''
            SELECT
                n_orig.title,
                SUM(11 - n_top.rank) total_score,
                COUNT(n_orig.title) date_count,
                1.0 * SUM(11 - n_top.rank) / COUNT(n_orig.title) AS longterm_avg_score
            FROM nflx_originals n_orig
            LEFT JOIN (
            SELECT
                DISTINCT date
            FROM nflx_top
            ) all_dates
                ON n_orig.premiere <= all_dates.date
            LEFT JOIN nflx_top n_top
                ON n_top.title = n_orig.title
                AND n_top.date = all_dates.date
            GROUP BY n_orig.title
            HAVING longterm_avg_score is not null
            ORDER BY
                longterm_avg_score DESC,
                n_orig.title
            LIMIT 10
            '''
    return query
    ### END SOLUTION

### Demo function call
query = longterm_avg_score()
longterm_avg_scores = pd.read_sql(query, conn)
display(longterm_avg_scores)


The cell below will test your solution for longterm_avg_score (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
              ex_name='longterm_avg_score',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to longterm_avg_score did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
              ex_name='longterm_avg_score',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to longterm_avg_score did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
In [ ]:
### Run Me!!!
nflx_scores_demo_df = utils.load_object_from_publicdata('nflx_scores_demo_df')
normalized_growths_df = utils.load_object_from_publicdata('normalized_growths_df')

Exercise 9: (2 points)¶

construct_model_inputs

Your task: define construct_model_inputs as follows:

Define the model matrix and response for our regression model, as shown below.

Inputs:

  • normalized_growths_df: a Pandas DataFrame. It contains the results from exercise 5.
  • show_scores_df: a Pandas DataFrame. It contains the pre-computed scores for each show on a given date.

Return: model_matrix: a Pandas DataFrame containing the independent variables (show scores) and response variable (Relative_Growth)

Requirements/steps:

  1. To obtain the independent variables, you will need to pivot the show_scores_df so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score.
    • Fill any missing values with 0's.
    • See pandas.DataFrame.pivot for more information
    • You may also refer to Exercise 5 from above too
  2. You'll need to join the pivoted dataframe with the normalized_growths_df on date. We'll call this model_matrix
  3. Return the model_matrix but only with the columns for the shows and Relative_Growth

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print('A proper model matrix for the demo:')

model_matrix_demo_df_CORRECT = utils.load_object_from_publicdata('model_matrix_demo_df.dill')
display(model_matrix_demo_df_CORRECT)
In [ ]:
### Solution - Exercise 9  
def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    show_scores_filled = show_scores_df.pivot(index='date', columns='title', values='score').fillna(0)
    all_shows = show_scores_filled.columns
    model_matrix = show_scores_filled.merge(normalized_growths_df, left_on='date', right_on='Date')
    return model_matrix[['Relative_Growth', *all_shows]]
    ### END SOLUTION

### Demo function call
model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df)
display(model_matrix)


The cell below will test your solution for construct_model_inputs (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 9  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=construct_model_inputs,
              ex_name='construct_model_inputs',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to construct_model_inputs did not pass the test.'

### BEGIN HIDDEN TESTS
passed, test_case_vars = execute_tests(func=construct_model_inputs,
              ex_name='construct_model_inputs',
              key=b'deVhBT3kO4xkGnbJLP5zPloS7UHLzunkF-JOvkYVvhI=', 
              n_iter=10,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to construct_model_inputs did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')

Fin¶

If you have made it this far, congratulations! You are done. Don't forget to submit the assignment!

Postscript¶

If you run the regression model below with all of the data collected, you would have obtained summary model results like the following:

##############################################
### The Code below is provided for illustrative and reproducability but is commented out due to additional time processing to the auto-grader. Note, the model was run on all NFLX growth data and top shows data.
##############################################
# Let's use step-wise regression to do feature selection
mm=model_matrix.copy(deep=True)
y = mm['Relative_Growth']
del mm['Relative_Growth']
X = mm
import statsmodels.api as sm
model = LinearRegression()
feature_selector = SequentialFeatureSelector(model, n_features_to_select=min(len(X.columns)-1,10))
feature_selector.fit(X, y)
selected_X = X[feature_selector.get_feature_names_out()]
selected_X = sm.add_constant(selected_X)
print(selected_X.columns)
selected_model = sm.OLS(y, selected_X)
results = selected_model.fit()
results.summary()

alt text

So, how does our model do?

Step-wise regression is a greedy algorithm which works to iteratively select the best possible features for improving the performance of our model. We tried to find the top-10 shows which were most predictive of the normalized stock values and found the following shows:

  1. Big Mouth
  2. Dirty John
  3. Emily in Paris
  4. I Am a Killer Released
  5. Lucifer
  6. Selling Sunset
  7. Squid Game
  8. The Last Kingdom
  9. The Royal Treatment
  10. Vivo

Note that these don't necessarily mean that they caused the stock value to improve. For example, The Royal Treatment was significantly associated with a negative trend in Netflix's stock growth.

A better question might be: "how reliable are these results?"

Ultimately, the performance of the model constructed from the input provided by exercise 9 is relatively poor. This model exhibits an adjusted $R^2$ value of only 0.042. Predicting stock prices is intrinsically hard, as there is an exceptionally high amount of variance in each observation and there are many reasons for fluctuations in market prices. Therefore, it is unsurprising to see that a lot of the variance remains.

We are also fighting "the curse of dimensionality," which means the excess data we have can be more of a hinderance than a help. However, the F-score for our model is quite good (with a p-value of 0.000684), which suggests that our model does a much better job at predicting stock value changes than a naive model.

You will learn more about various modeling techniques and how to use them during the rest of the course. Working to improve the regression results with the results of this exam is left as an exercise to the students; for high-quality regression outputs, you may find the statsmodels package particularly useful. Here's one approach you might find worth exploring:

  • Calculate the logarithm of the golden cross ratio for the standardized stock prices. Use this as your response variable.
  • Calculate the logarithm of the golden cross ratio for the Netflix original show scores. Use these ratios as your predictors.
  • Use variable selection, as shown above, to find the strongest correlations. You may be surprised by the strength of the relationships!