Midterm 2, Fall 2024: Netflix and Bills

Version 1.0.3

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of ['General Python', 'Tabular Data', 'Pandas', 'SQL and SQLite', 'NumPy']. It has 10 exercises numbered 0 to 9. There are 17 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 1 point(s)

  • Exercise 1 - : 1 point(s) - FREE

  • Exercise 2 - : 3 point(s)

  • Exercise 3 - : 2 point(s)

  • Exercise 4 - : 2 point(s)

  • Exercise 5 - : 0 point(s) - FREE

  • Exercise 6 - : 1 point(s)

  • Exercise 7 - : 2 point(s)

  • Exercise 8 - : 3 point(s)

  • Exercise 9 - : 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Analyzing Netflix's Stocks and Services

Background. Netflix is a firm in the entertainment industry which has grown to prominence in the streaming era of movies and television. The company distributes both acquired and original programming via a streaming service.

Your overall task. Your goal is to attempt to evaluate whether there is a relationship between the programming that is distributed by Netflix and the financial performance of the firm. You will do this in two main steps:

  1. First, you will analyze the general finanical standing of the firm.
  2. Second, you will investigate the trends present in Netflix's program offerings.

Once you have obtained the results from both steps, you will use the results to build a model which will attempt to evaluate whether there is a relationship between the programs and the firm's financial standing.

The datasets. There are three sources of data which you will use to solve the following exercises.

  1. Financial data for several tech firms, sourced from Yahoo Finance.
  2. A daily ranking of Netflix's top-10 programs which spans 2 years, taken from The Numbers project by Nash Information Services.
  3. A collection of information taken from Wikipedia on Netflix original programming.

The finanical data (source 1) is provided as a Python list of dictionaries. You will load this data into a Pandas DataFrame. The daily Netflix data (source 2) and original programming information (source 3) are contained in a SQLite Database.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

In [1]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
In [2]:
from pprint import pprint, pformat

import numpy as np
import pandas as pd
import sqlite3
import random
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector

# Load starting data and define a database connection
stock_data = utils.load_object_from_publicdata('raw_stock_data_nflx.dill')

conn = sqlite3.connect('resource/asnlib/publicdata/NetflixDb.sqlite_db')
Successfully loaded raw_stock_data_nflx.dill.

Exercise 0: (1 points)

load_data

Your task: define load_data as follows:

Load the data contained in records into a Pandas DataFrame. See the instructions below for the requirements.

Inputs:

  • records: A list of dictionaries. Each dictionary contains information for a single row of data.

Return: full_data: A Pandas DataFrame containing the data held in records. The DataFrame should contain the following columns:

  • Firm
    • This variable is contained in the dictionaries under the Stock key
  • Date
    • This variable's datatype should be set to datetime64[ns, America/New_York]
  • Open
  • Close
  • Volume

Requirements/steps:

  • Do not remove any records with NaN values!
  • The DataFrame rows may be ordered arbitrarily.

Hints

  • You might find this part of the Pandas Documentation useful if you're struggling to load the data into Pandas.
  • There are many ways to approach this problem. We recommend this general approach:
    1. Load the data into Pandas
    2. Convert the date to the proper datatype
      • We have defined a helper function, convert_timezone_types() to make this easier. convert_timezone_types() takes an input of a pd.Series with dtype "string" and then outputs a pd.Series with dtype "datetime64[ns, America/New_York]". See the demo code for more examples.
      • You do not have to use our helper function. If you wish, you could instead take a look at to_datetime, tz_convert, and this post on stackoverflow
    3. Rename the columns
    4. Return the results

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [3]:
stock_subsample_df_demo_CORRECT = utils.load_object_from_publicdata('stock_subsample_df_demo.dill')
convert_timezone_types          = utils.load_object_from_publicdata('convert_timezone_types.dill')
display(stock_subsample_df_demo_CORRECT)
Successfully loaded stock_subsample_df_demo.dill.
Successfully loaded convert_timezone_types.dill.
Date Firm Open Close Volume
0 2018-09-02 00:00:00-04:00 GOOG NaN NaN NaN
1 2021-09-07 00:00:00-04:00 AMZN 173.899994 175.464493 54758000.0
2 2023-09-26 00:00:00-04:00 GOOG 130.914001 129.449997 20378800.0
3 2017-11-15 00:00:00-05:00 NFLX 194.539993 192.119995 4203100.0
4 2022-02-08 00:00:00-05:00 GOOG 138.991257 139.212997 34256000.0
5 2023-08-28 00:00:00-04:00 META 287.694748 289.952362 14239300.0
6 2023-12-09 00:00:00-05:00 NFLX NaN NaN NaN
7 2021-01-15 00:00:00-05:00 META 247.637243 251.093582 24942900.0
8 2018-11-26 00:00:00-05:00 NFLX 260.549988 261.429993 12498600.0
9 2018-03-13 00:00:00-04:00 AAPL 43.227056 42.606789 126774000.0
In [5]:
stock_data[:10]
Out[5]:
[{'Date': '2015-01-02 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.86195818653811,
  'Close': 24.402172088623047,
  'Volume': 212818400.0},
 {'Date': '2015-01-03 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-04 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-05 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.170046484250545,
  'Close': 23.714723587036133,
  'Volume': 257142000.0},
 {'Date': '2015-01-06 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 23.77945591892561,
  'Close': 23.716960906982425,
  'Volume': 263188400.0},
 {'Date': '2015-01-07 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 23.92675744290404,
  'Close': 24.049516677856445,
  'Volume': 160423600.0},
 {'Date': '2015-01-08 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.379850468289227,
  'Close': 24.973554611206055,
  'Volume': 237458000.0},
 {'Date': '2015-01-09 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 25.14765081488744,
  'Close': 25.000341415405273,
  'Volume': 214798000.0},
 {'Date': '2015-01-10 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-11 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan}]
In [8]:
### Solution - Exercise 0  
def load_data(records: list) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the input (records) to get a sense of what the data looks like.
    2. Create a dataframe with the list of dictionaries.
    3. Convert the date column to the data type "datetime64[ns, America/New_York]".
    4. Rename the 'Stock' column to be 'Firm'.
    5. Return the dataframe.
    
    Note:
    If you want to not use the built in function, this is how you would convert the dates:
    my_df['Date'] = my_df['Date'].apply(lambda x: pd.to_datetime(x).tz_convert('America/New_York'))
    '''
    
    my_df = pd.DataFrame(records)
    my_df['Date'] = convert_timezone_types(my_df['Date'])
    my_df = my_df.rename(columns={'Stock': 'Firm'})
        
    return my_df

### Demo function call
### `convert_timezone_types` demo ---------------------------------------------
# This helper function will help you change the data-type
# -- Input:  pd.Series, with dtype "string"
# -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]"
type_conversion_demo = pd.DataFrame({
    'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'],
    'DemoColumn': [1, 2]
})
# Convert the type!
type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date'])
assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works!


### `load_data` demo ----------------------------------------------------------
# We'll sample the dataset to 20 records, just to make debugging easier.
random.seed(6040)
stock_data_subsample = random.sample(stock_data, k=10)
# Here's what your solution produces!
print('Here is what your loaded data looks like in Pandas:')
stock_subsample_df = load_data(stock_data_subsample)
display(stock_subsample_df)
Here is what your loaded data looks like in Pandas:
Date Firm Open Close Volume
0 2018-09-02 00:00:00-04:00 GOOG NaN NaN NaN
1 2021-09-07 00:00:00-04:00 AMZN 173.899994 175.464493 54758000.0
2 2023-09-26 00:00:00-04:00 GOOG 130.914001 129.449997 20378800.0
3 2017-11-15 00:00:00-05:00 NFLX 194.539993 192.119995 4203100.0
4 2022-02-08 00:00:00-05:00 GOOG 138.991257 139.212997 34256000.0
5 2023-08-28 00:00:00-04:00 META 287.694748 289.952362 14239300.0
6 2023-12-09 00:00:00-05:00 NFLX NaN NaN NaN
7 2021-01-15 00:00:00-05:00 META 247.637243 251.093582 24942900.0
8 2018-11-26 00:00:00-05:00 NFLX 260.549988 261.429993 12498600.0
9 2018-03-13 00:00:00-04:00 AAPL 43.227056 42.606789 126774000.0


The cell below will test your solution for load_data (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [9]:
### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
              ex_name='load_data',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to load_data did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
load_data test ran 100 iterations in 0.91 seconds
Passed! Please submit.
In [10]:
### Run Me!!!
stock_df = utils.load_object_from_publicdata('stock_df')
Successfully loaded stock_df.

Exercise 1: (1 points) - FREE

preview_data

Example: we have defined preview_data as follows:

Preview the three data sources as outlined below.

Inputs:

  • records: A Pandas DataFrame containing the results of the full_data output generated by Exercise 0.
  • conn: A SQLite connection to our database.

Return: A tuple, containing the following elements:

  • records_preview: A Pandas Dataframe containing 10 entries.
  • nflx_originals_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_originals SQL database table.
  • nflx_top_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_top SQL database table.

Requirements/steps:

  • records_preview: If there are fewer than 10 entries in records, all entries should be returned.
  • nflx_originals_preview: The selected entries may be any 5 rows from the original table.
  • nflx_top_preview: The selected entires may be any 5 rows from the original table.

A solution is provided for you below. Run it and the test cell to earn your free point, then submit the assignment!

In [11]:
### Solution - Exercise 1  
def preview_data(records: list, conn: sqlite3.Connection) -> tuple:
    # Filter out any records beyond the second element
    records_preview = records.head(n=10)

    # Select everything from nflx_originals and randomly sample 5 rows
    nflx_originals_preview = pd\
        .read_sql('SELECT * FROM nflx_originals', conn)\
        .sample(5)

    # Select everything from nflx_top and randomly sample 5 rows
    nflx_top_preview = pd\
        .read_sql('SELECT * FROM nflx_top', conn)\
        .sample(5)

    return records_preview, nflx_originals_preview, nflx_top_preview

### Demo function call
records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn)
print('Preview of Stock Data')
display(records_preview)
print('------------------------------------------------------------------------')
print('Preview of Netflix Originals Table')
display(nflx_originals)
print('------------------------------------------------------------------------')
print('Preview of Netflix Top Programs Table')
display(nflx_top)
Preview of Stock Data
Date Firm Open Close Volume
0 2015-01-02 00:00:00-05:00 AAPL 24.861958 24.402172 212818400.0
1 2015-01-03 00:00:00-05:00 AAPL NaN NaN NaN
2 2015-01-04 00:00:00-05:00 AAPL NaN NaN NaN
3 2015-01-05 00:00:00-05:00 AAPL 24.170046 23.714724 257142000.0
4 2015-01-06 00:00:00-05:00 AAPL 23.779456 23.716961 263188400.0
5 2015-01-07 00:00:00-05:00 AAPL 23.926757 24.049517 160423600.0
6 2015-01-08 00:00:00-05:00 AAPL 24.379850 24.973555 237458000.0
7 2015-01-09 00:00:00-05:00 AAPL 25.147651 25.000341 214798000.0
8 2015-01-10 00:00:00-05:00 AAPL NaN NaN NaN
9 2015-01-11 00:00:00-05:00 AAPL NaN NaN NaN
------------------------------------------------------------------------
Preview of Netflix Originals Table
title genre premiere seasons runtime status language previous_network netflix_exclusive_region partner_or_country
58 Pluto Science fiction 2023-10-26 1 season, 8 episodes 56–71 min Pending Japanese
129 Bloodhounds Action drama 2023-06-09 1 season, 8 episodes 54–74 min Renewed
242 Rhythm + Flow France Music competition 2022-06-09 2 seasons, 16 episodes 36–48 min Season 3 due to premiere on July 4, 2024 French
220 The Indrani Mukerjea Story: Buried Truth True crime 2024-02-29 1 season, 4 episodes 44–50 min Pending English
154 Alpha Males Comedy 2022-12-30 2 seasons, 20 episodes 30–40 min Renewed
------------------------------------------------------------------------
Preview of Netflix Top Programs Table
date rank rank_one_week_prior title content_type nflx_exclusive nflx_release_date
3850 2021-04-21 1 - The Baker and the Beauty TV Show 2021-04-13
5674 2021-10-20 5 9 Cocomelon TV Show 2020-06-01
6072 2021-11-29 3 - Selling Sunset TV Show Yes 2019-03-22
1195 2020-07-29 6 - Fear City: New York vs. The… TV Show Yes 2020-07-22
2727 2020-12-29 8 3 Tiny Pretty Things TV Show Yes 2020-12-14


The test cell below will always pass. Please submit to collect your free points for preview_data (exercise 1).

In [12]:
### Test Cell - Exercise 1  


print('Passed! Please submit.')
Passed! Please submit.
In [13]:
### Run Me!!!
calculate_daily_growth_demo_input = utils.load_object_from_publicdata('calculate_daily_growth_demo_input')
Successfully loaded calculate_daily_growth_demo_input.

Exercise 2: (3 points)

calculate_daily_growth

Your task: define calculate_daily_growth as follows:

Calculate the daily growth for a given firm, as outlined below.

Inputs:

  • stock_records: a Pandas DataFrame. It contains the full stock data loaded in exercise 0.
  • stock: a string specifying a particular stock.

Return: stock_growth a Pandas DataFrame with the following columns:

  • Date
  • Firm
  • Open
  • Close
  • Volume
  • Day_Growth

Requirements/steps:

  1. Do not use SQLite.
  2. Create a new Pandas DataFrame with Day_Growth appended as a new column. The growth should be specific to the firm specified by the firm parameter.
  3. Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
    • You should use the next valid close price to calculate growth in these instances.
  4. Daily growth is defined as
    growth0=(close0/close1)1
    • In English: "Daily Growth is Today's Close divided by Yesterday's Close, minus one."
    • closex is the value of the Close, offset by x. So, close1 is the value of Close, offset by 1 position.
    • You may find the pd.DataFrame.shift() method, and the related GroupBy shift method helpful for offsetting the close value!
  5. To repeat: Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
    • You should use the next valid close price to calculate growth in these instances.
    • For example, if:
      1. The close price on Friday is 10
      2. There are no close prices for the next two days (Saturday and Sunday)
      3. The close price on Monday is 12 ... then the daily growth for Monday should be equal to (12/10)-1=0.2.
    • You can make this easier by filtering out all NaN values when you initially calculate the growth.
  6. Your final DataFrame (stock_growth) should contain the same number of records as stock_records for that firm.
  7. Finally, reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [14]:
print("Demo output for Netflix:")

daily_growth_df_CORRECT = utils.load_object_from_publicdata('daily_growth_df.dill')
display(daily_growth_df_CORRECT)
Demo output for Netflix:
Successfully loaded daily_growth_df.dill.
Date Firm Open Close Volume Day_Growth
0 2019-05-01 00:00:00-04:00 NFLX 374.000000 378.809998 9257300.0 0.000000
1 2019-05-02 00:00:00-04:00 NFLX 378.000000 379.059998 5398200.0 0.000660
2 2019-05-03 00:00:00-04:00 NFLX 381.529999 385.029999 5130300.0 0.015749
3 2019-05-04 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
4 2019-05-05 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
5 2019-05-06 00:00:00-04:00 NFLX 377.690002 378.670013 5793100.0 -0.016518
6 2019-05-07 00:00:00-04:00 NFLX 377.000000 370.459991 6974900.0 -0.021681
7 2019-05-08 00:00:00-04:00 NFLX 367.920013 364.369995 6572000.0 -0.016439
8 2019-05-09 00:00:00-04:00 NFLX 360.899994 362.750000 5882600.0 -0.004446
9 2019-05-10 00:00:00-04:00 NFLX 361.619995 361.040009 5657100.0 -0.004714
10 2019-05-11 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
11 2019-05-12 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
12 2019-05-13 00:00:00-04:00 NFLX 352.290009 345.260010 8026700.0 -0.043707
13 2019-05-14 00:00:00-04:00 NFLX 348.709991 345.609985 5353000.0 0.001014
14 2019-05-15 00:00:00-04:00 NFLX 343.339996 354.989990 6340100.0 0.027140
In [15]:
calculate_daily_growth_demo_input.head()
Out[15]:
Date Firm Open Close Volume
1580 2019-05-01 00:00:00-04:00 AAPL 50.444390 50.598213 259309200.0
1581 2019-05-02 00:00:00-04:00 AAPL 50.434774 50.268932 127985200.0
1582 2019-05-03 00:00:00-04:00 AAPL 50.687145 50.893845 83569600.0
1583 2019-05-04 00:00:00-04:00 AAPL NaN NaN NaN
1584 2019-05-05 00:00:00-04:00 AAPL NaN NaN NaN
In [21]:
### Solution - Exercise 2  
def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the input dataframe, stock_records, to understand what the data looks like. For us, that is 
        calculate_daily_growth_demo_input.
    2. Create a new dataframe called df.
        a. Create a copy of the stock_records df, so we don't accidentally modify the original dataframe.
        b. Filter so we are only looking at the firm specified in the "firm" input string.
        c. Create a new column called Day_Growth. I decided to initialize it so every value was 0.0, since this will
            help with the days with NaN Open and Close values.
    3. Isolate the rows that have valid open / close prices.
        a. Create one dataframe that only contains rows with NaN valus for Open. We want to be able to merge these
            back with our other dataframe in the end.
        b. Create another dataframe with NaN rows dropped (for the columns open and close).
                Note: for this, I looked at the Pandas documentation for dropna to remember the syntax for looking
                    at a subset.
    4. Utilize the shift() function to easily get the last closing value.
        a. Look at the linked documentation for shift and some examples.
        b. Use the shift function and print the results to confirm that it is doing what we want it to.
        c. Create a new column called "last_close" to hold these shifted values.
            Note: I kept getting the SettingWithCopyWarning when doing this. I ignored it until the end and came
                back to it. There were two ways that I found to fix it: either adding a .copy() at the end of the
                statement creating df_no_na or by using an "assign" function that I found when looking through
                StackOverflow. I have included both below.
    5. Calculate the Day_Growth column using the formula above: (close0/close-1)-1.
        a. Since we will want to use multiple columns, I automatically knew that we would want to use an apply
            function along the non-default axis. I always forget which axis to use, so I typically try both to
            determine which is correct. In this case, it would be axis=1.
        b. Use the apply function and then set the Day_Growth column equal to these calculated values.
    6. Combine the NaN rows with the rows that we just used (df_no_na).
        a. Delete the column "last_close" that we created for temporary storage.
        b. Use the concat function to add the na_rows to our dataframe, df_no_na.
        c. Sort the values by date to fix the order.
    7. Reset the index of stock_growth and return.
    
    When I followed these steps, I had an error. I compared my output with the expected output to realize that the
    first row of my df_no_na dataframe had a Day_Growth of NaN due to the nature of the shift function. So, I filled
    the NaN rows in df_no_na to fix this problem.
    '''
    
    # step 2
    df = stock_records.copy()
    df = df[df['Firm']==firm]
    df['Day_Growth'] = 0.000000
    
    # step 3
    na_mask = df['Open'].isna()
    na_rows = df[na_mask]
    # print(na_rows.head())
    
    df_no_na = df.dropna(subset=['Open', 'Close']).copy()
    
    # step 4
    shifted_vals = df_no_na['Close'].shift(periods=1)
    # print(shifted_vals)
    df_no_na = df_no_na.assign(last_close = shifted_vals.values)
    # if using .copy(), could just say df_no_na['last_close'] = list(shifted_vals.values)
    
    # step 5
    new_day_growth = df_no_na.apply(lambda x: (x['Close']/x['last_close'])-1, axis=1).copy()
    df_no_na = df_no_na.assign(Day_Growth = new_day_growth.values)
    # if using .copy(), could just say df_no_na['Day_Growth'] = list(new_day_growth.values)
    
    # step 6
    del df_no_na['last_close']
    df_no_na.fillna(0, inplace=True)        # step 8

    stock_growth = pd.concat([na_rows, df_no_na])
    stock_growth = stock_growth.sort_values(by='Date')
    
    # step 7
    return stock_growth.reset_index(drop=True)
    
    

### Demo function call
print("Demo daily growth for Netflix's stock:")
daily_growth_nflx_demo = calculate_daily_growth(
    stock_records=calculate_daily_growth_demo_input,
    firm='NFLX'
)
display(daily_growth_nflx_demo)
Demo daily growth for Netflix's stock:
Date Firm Open Close Volume Day_Growth
0 2019-05-01 00:00:00-04:00 NFLX 374.000000 378.809998 9257300.0 0.000000
1 2019-05-02 00:00:00-04:00 NFLX 378.000000 379.059998 5398200.0 0.000660
2 2019-05-03 00:00:00-04:00 NFLX 381.529999 385.029999 5130300.0 0.015749
3 2019-05-04 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
4 2019-05-05 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
5 2019-05-06 00:00:00-04:00 NFLX 377.690002 378.670013 5793100.0 -0.016518
6 2019-05-07 00:00:00-04:00 NFLX 377.000000 370.459991 6974900.0 -0.021681
7 2019-05-08 00:00:00-04:00 NFLX 367.920013 364.369995 6572000.0 -0.016439
8 2019-05-09 00:00:00-04:00 NFLX 360.899994 362.750000 5882600.0 -0.004446
9 2019-05-10 00:00:00-04:00 NFLX 361.619995 361.040009 5657100.0 -0.004714
10 2019-05-11 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
11 2019-05-12 00:00:00-04:00 NFLX NaN NaN NaN 0.000000
12 2019-05-13 00:00:00-04:00 NFLX 352.290009 345.260010 8026700.0 -0.043707
13 2019-05-14 00:00:00-04:00 NFLX 348.709991 345.609985 5353000.0 0.001014
14 2019-05-15 00:00:00-04:00 NFLX 343.339996 354.989990 6340100.0 0.027140


The cell below will test your solution for calculate_daily_growth (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [22]:
### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
              ex_name='calculate_daily_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_daily_growth did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_daily_growth test ran 100 iterations in 1.42 seconds
Passed! Please submit.
In [23]:
### Run Me!!!
stock_daily_growths_df = utils.load_object_from_publicdata('stock_daily_growths_df')
stock_daily_growths_reduced_df = utils.load_object_from_publicdata('stock_daily_growths_reduced_df')
Successfully loaded stock_daily_growths_df.
Successfully loaded stock_daily_growths_reduced_df.

Exercise 3: (2 points)

calculate_annual_growth

Your task: define calculate_annual_growth as follows:

Calculate the annual growth for every firm and join the results to a copy of the original stock_records DataFrame.

Inputs:

  • stock_records: a Pandas DataFrame. It contains the results calculated by exercise 2.

Return: annual_growth: a Pandas DataFrame with the same columns as stock_records, along with columns for:

  • Year
  • Annual_Growth

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. Extract the year from the Date column of stock_records. Year is expected to be a string in the final output.
  3. The most common way to calculate Annual_Growth
    • Take the product of (Day_Growth + 1), grouped by firm and year. Make sure to subtract 1 at the end.
  4. You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

  • Note: While not reflective of the output you are expected to produce, you may also find the annual_growth_grouped_df_demo information useful for checking your work. If you wish to use it, uncomment the code below to view it.

You should notice in this result that the same value for Annual_Growth appears in each row for a given Firm and Year. For example rows where APPL is the Firm and Year is '2018' all have Annual_Growth of -0.053902, and NFLX rows for '2020' have 0.671138.

The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them together. Then subtracting 1 from the product.

  • The first few terms in the (APPL, 2018) group would be (1+0.0)(1+.012904)(10.000174)(1+0.004645)(1+0.11385)
  • The product of all terms for (APPL, 2018) is 0.946098.
  • Subtracting off 1 leaves 0.053902

Mathematically, let gi(f,y) represent the Day_Growth value for firm f, year y, and observation i. Then the Annual_Growth value for the same firm (G(f,y)) is given by:

G(f,y)=[i(gi(f,y)+1)]1
In [24]:
print("Demo annual growth:")

annual_growth_demo_df_CORRECT = utils.load_object_from_publicdata('annual_growth_demo_df.dill')
display(annual_growth_demo_df_CORRECT)

### Uncomment these lines for additional information!
# print("Annual growth, grouped together:")

# annual_growth_grouped_df_demo = utils.load_object_from_publicdata('annual_growth_grouped_df_demo.dill')
# display(annual_growth_grouped_df_demo)
Demo annual growth:
Successfully loaded annual_growth_demo_df.dill.
Date Year Firm Open Close Volume Day_Growth Annual_Growth
0 2018-01-01 00:00:00-05:00 2018 AAPL NaN NaN NaN 0.000000 -0.053902
1 2018-01-02 00:00:00-05:00 2018 AAPL 40.120742 40.615883 102223600.0 0.017904 -0.053902
2 2018-01-03 00:00:00-05:00 2018 AAPL 40.679550 40.608814 118071600.0 -0.000174 -0.053902
3 2018-01-04 00:00:00-05:00 2018 AAPL 40.681893 40.797428 89738400.0 0.004645 -0.053902
4 2018-01-05 00:00:00-05:00 2018 AAPL 40.894105 41.261925 94640000.0 0.011385 -0.053902
... ... ... ... ... ... ... ... ...
5475 2020-12-27 00:00:00-05:00 2020 NFLX NaN NaN NaN 0.000000 0.671138
5476 2020-12-28 00:00:00-05:00 2020 NFLX 516.429993 519.119995 2891900.0 0.010020 0.671138
5477 2020-12-29 00:00:00-05:00 2020 NFLX 519.900024 530.869995 4022400.0 0.022634 0.671138
5478 2020-12-30 00:00:00-05:00 2020 NFLX 530.130005 524.590027 1876300.0 -0.011830 0.671138
5479 2020-12-31 00:00:00-05:00 2020 NFLX 525.530029 540.729980 5392300.0 0.030767 0.671138

5480 rows × 8 columns

In [34]:
### Solution - Exercise 3  
def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the stock_records dataframe to remind us visually the difference between our input and expected
        output.
    2. Create a new dataframe called annual_growth with a new "Year" column.
        a. Create a copy of the input dataframe, stock_records.
        b. Extract the year from the Date column.
            Googled: "get year from datetime pandas"
            Referenced: https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-separately-from-pandas-datetime-column
        c. Convert Year to a string, as instructed above.
    3. Look at the instructions to get an understanding of how to calculate Annual Growth. The below instruction helps
        break it down into steps:
            "The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them 
             together. Then subtracting 1 from the product."
    4. Calculate Annual_Growth.
        a. Create a new column called "updated_day" that holds the value of Day_Growth + 1.
        b. Group by Firm and Year, multiplying the values of updated_day together.
            Googled: "pandas groupby multiply values from group set to new column"
            Referenced: https://stackoverflow.com/questions/69471371/pandas-multiply-row-value-by-groupby-of-another-column-as-a-new-column
        c. Delete the temporary updated_day column.
        d. Subtract 1 from the Annual_Growth.
    5. Return the final dataframe after reordering the columns and resetting the index.

    '''
    
    # step 2
    annual_growth = stock_records.copy()
    annual_growth['Year'] = annual_growth['Date'].dt.year
    annual_growth['Year'] = annual_growth['Year'].astype(str)
    
    # step 4
    annual_growth['updated_day'] = annual_growth.apply(lambda x: x['Day_Growth']+1, axis=1)
    annual_growth['Annual_Growth'] = annual_growth.groupby(by=['Firm', 'Year'])['updated_day'].transform('prod')
    del annual_growth['updated_day']
    annual_growth['Annual_Growth'] = annual_growth['Annual_Growth'].apply(lambda x: x-1)
    
    # step 5
    annual_growth = annual_growth[['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth']]
    return annual_growth.reset_index(drop=True)

### Demo function call
annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df)
display(annual_growth_demo_df)
Date Year Firm Open Close Volume Day_Growth Annual_Growth
0 2018-01-01 00:00:00-05:00 2018 AAPL NaN NaN NaN 0.000000 -0.053902
1 2018-01-02 00:00:00-05:00 2018 AAPL 40.120742 40.615883 102223600.0 0.017904 -0.053902
2 2018-01-03 00:00:00-05:00 2018 AAPL 40.679550 40.608814 118071600.0 -0.000174 -0.053902
3 2018-01-04 00:00:00-05:00 2018 AAPL 40.681893 40.797428 89738400.0 0.004645 -0.053902
4 2018-01-05 00:00:00-05:00 2018 AAPL 40.894105 41.261925 94640000.0 0.011385 -0.053902
... ... ... ... ... ... ... ... ...
5475 2020-12-27 00:00:00-05:00 2020 NFLX NaN NaN NaN 0.000000 0.671138
5476 2020-12-28 00:00:00-05:00 2020 NFLX 516.429993 519.119995 2891900.0 0.010020 0.671138
5477 2020-12-29 00:00:00-05:00 2020 NFLX 519.900024 530.869995 4022400.0 0.022634 0.671138
5478 2020-12-30 00:00:00-05:00 2020 NFLX 530.130005 524.590027 1876300.0 -0.011830 0.671138
5479 2020-12-31 00:00:00-05:00 2020 NFLX 525.530029 540.729980 5392300.0 0.030767 0.671138

5480 rows × 8 columns


The cell below will test your solution for calculate_annual_growth (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [35]:
### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
              ex_name='calculate_annual_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_annual_growth did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calculate_annual_growth test ran 100 iterations in 21.78 seconds
Passed! Please submit.
In [36]:
### Run Me!!!
annual_growth_df = utils.load_object_from_publicdata('annual_growth_df')
Successfully loaded annual_growth_df.

Exercise 4: (2 points)

golden_cross_strategy

Your task: define golden_cross_strategy as follows:

Implement a function which calculates a Golden Cross strategy, as detailed below.

Inputs:

  • stockdf: a Pandas DataFrame. It contains the results calculated by exercise 3.
  • firm: a string. It specifies which firm to calculate the golden cross results for.
  • short_average: an integer. It defines the window for the short moving average. The number represents the number of days to include in the window.
  • long_average: an integer. It defines the window for the long moving average. The number represents the number of days to include in the window.

Return: golden_cross_results: a Pandas DataFrame with all of the columns from stockdf, plus:

  • Short_Average
  • Long_Average
  • Golden_Cross

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. Filter the DataFrame to include records for the firm parameter.
  3. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
  4. Add the columns Short_Average and Long_Average to the Pandas DataFrame.
    • These columns should calculate the moving average of the Close column.
    • The window of the moving average should be equal to the size specified by the short_average and long_average arguments.
    • You may find the pd.DataFrame.rolling() method useful for creating the windows!
  5. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
    • This column should indicate whether the short-window moving average is greater than the long-window moving average.
  6. You may assume that only one firm is provided for the firm argument.
  7. You may assume that stockdf is already ordered by date in ascending order.
  8. Finally, reset the indexes before returning the resulting dataframe.

Notes

The Golden Cross (https://www.investopedia.com/terms/g/goldencross.asp) is a technical analysis pattern that indicates a potential bullish trend reversal. It occurs when a faster, short-term moving average crosses above a slower, longer-term moving average. For example, a 50-day moving average (the faster MA) crossing above a 200-day moving average (the slower MA) creates a Golden Cross signal.

alt text

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [37]:
print("Example Golden Cross Demo:")

golden_cross_demo_df_CORRECT = utils.load_object_from_publicdata('golden_cross_demo_df.dill')
display(golden_cross_demo_df_CORRECT.head(15))
Example Golden Cross Demo:
Successfully loaded golden_cross_demo_df.dill.
Date Year Firm Open Close Volume Day_Growth Annual_Growth Short_Average Long_Average Golden_Cross
0 2015-01-02 00:00:00-05:00 2015 NFLX 49.151428 49.848572 13475000.0 0.000000 1.294549 NaN NaN False
1 2015-01-05 00:00:00-05:00 2015 NFLX 49.258572 47.311428 18165000.0 -0.050897 1.294549 NaN NaN False
2 2015-01-06 00:00:00-05:00 2015 NFLX 47.347141 46.501431 16037700.0 -0.017121 1.294549 47.887143 NaN False
3 2015-01-07 00:00:00-05:00 2015 NFLX 47.347141 46.742859 9849700.0 0.005192 1.294549 46.851906 NaN False
4 2015-01-08 00:00:00-05:00 2015 NFLX 47.119999 47.779999 9601900.0 0.022188 1.294549 47.008096 NaN False
5 2015-01-09 00:00:00-05:00 2015 NFLX 47.631428 47.041428 9578100.0 -0.015458 1.294549 47.188095 NaN False
6 2015-01-12 00:00:00-05:00 2015 NFLX 47.090000 45.547142 12944400.0 -0.031765 1.294549 46.789523 47.253265 False
7 2015-01-13 00:00:00-05:00 2015 NFLX 46.021427 46.255714 18721500.0 0.015557 1.294549 46.281428 46.740000 False
8 2015-01-14 00:00:00-05:00 2015 NFLX 47.071430 46.320000 19884900.0 0.001390 1.294549 46.040952 46.598367 False
9 2015-01-15 00:00:00-05:00 2015 NFLX 46.547142 46.251431 11669000.0 -0.001480 1.294549 46.275715 46.562653 False
10 2015-01-16 00:00:00-05:00 2015 NFLX 46.857143 48.191429 19696600.0 0.041945 1.294549 46.920953 46.769592 True
11 2015-01-20 00:00:00-05:00 2015 NFLX 48.571430 49.828571 40551000.0 0.033972 1.294549 48.090477 47.062245 True
12 2015-01-21 00:00:00-05:00 2015 NFLX 59.234287 58.468571 66902500.0 0.173394 1.294549 52.162857 48.694694 True
13 2015-01-22 00:00:00-05:00 2015 NFLX 58.571430 61.205715 33319300.0 0.046814 1.294549 56.500952 50.931633 True
14 2015-01-23 00:00:00-05:00 2015 NFLX 60.988571 62.494286 26784100.0 0.021053 1.294549 60.722857 53.251429 True
In [50]:
### Solution - Exercise 4  
def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame:
    
    '''
    Steps:
    1. Filter the DataFrame to include records for the firm parameter.
        a. Create a copy of the stockdf dataframe.
        b. Only include rows where the 'Firm' is equal to the input string firm.
    2. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
        a. Use dropna to filter out rows where Close values are NaN.
    3. Add the columns Short_Average and Long_Average to the Pandas DataFrame.
        a. Look at the linked documentation for how to use the rolling function and look at some examples.
        b. Since a Rolling object has no average attribute that I found, manually calculate the average values.
        c. Use the assign method to avoid the SettingWithCopyWarning.
    4. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
        Method 1:
            a. Create a Series where values are True if the Short_Average is greater than the Long_Average.
            b. Create a new column called 'Golden_Cross' and assign these values to it.
        Method 2:
            Googled: "pandas new column based on condition of other columns"
            Referenced: https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
            a. Use the example from StackOverflow to set the Golden_Cross column to be True if Short_Average >
                Long_Average and False if Short_Average <= Long_Average.
            b. When checking this, rows where Short_Average and/or Long_Average still have NaN.
            Googled: "fill na for one column pandas"
            Referenced: https://stackoverflow.com/questions/38134012/pandas-dataframe-fillna-only-some-columns-in-place
            c. Fill the NaN values in the Golden_Cross column with False.
    5. Finally, reset the indexes before returning the resulting dataframe.
    '''

    # step 1
    df = stockdf.copy()
    df = df[df['Firm']==firm]
    
    # step 2
    df_notna = df.dropna(subset=['Close'])
    
    # step 3
    short_window = (df_notna['Close'].rolling(short_average).sum())/short_average
    long_window = (df_notna['Close'].rolling(long_average).sum())/long_average
    
    df_notna = df_notna.assign(Short_Average = short_window.values)
    df_notna = df_notna.assign(Long_Average = long_window.values)

    
    # step 4, method 1
    golden_cross = df_notna['Short_Average']>df_notna['Long_Average']
    df_notna = df_notna.assign(Golden_Cross = golden_cross.values)
    
    # step 4, method 2
    # df_notna.loc[df_notna['Short_Average']>df_notna['Long_Average'], 'Golden_Cross'] = True
    # df_notna.loc[df_notna['Short_Average']<=df_notna['Long_Average'], 'Golden_Cross'] = False
    # df_notna = df_notna.fillna({'Golden_Cross': False})

    # step 5
    golden_cross_results = df_notna.copy()
    return golden_cross_results.reset_index(drop=True)



### Demo function call
golden_cross_demo_df = golden_cross_strategy(
    stockdf=annual_growth_df,
    firm='NFLX',
    short_average=3,
    long_average=7)
# Let's only look at the first 15 days for our input
display(golden_cross_demo_df.head(15))
Date Year Firm Open Close Volume Day_Growth Annual_Growth Short_Average Long_Average Golden_Cross
0 2015-01-02 00:00:00-05:00 2015 NFLX 49.151428 49.848572 13475000.0 0.000000 1.294549 NaN NaN False
1 2015-01-05 00:00:00-05:00 2015 NFLX 49.258572 47.311428 18165000.0 -0.050897 1.294549 NaN NaN False
2 2015-01-06 00:00:00-05:00 2015 NFLX 47.347141 46.501431 16037700.0 -0.017121 1.294549 47.887143 NaN False
3 2015-01-07 00:00:00-05:00 2015 NFLX 47.347141 46.742859 9849700.0 0.005192 1.294549 46.851906 NaN False
4 2015-01-08 00:00:00-05:00 2015 NFLX 47.119999 47.779999 9601900.0 0.022188 1.294549 47.008096 NaN False
5 2015-01-09 00:00:00-05:00 2015 NFLX 47.631428 47.041428 9578100.0 -0.015458 1.294549 47.188095 NaN False
6 2015-01-12 00:00:00-05:00 2015 NFLX 47.090000 45.547142 12944400.0 -0.031765 1.294549 46.789523 47.253265 False
7 2015-01-13 00:00:00-05:00 2015 NFLX 46.021427 46.255714 18721500.0 0.015557 1.294549 46.281428 46.740000 False
8 2015-01-14 00:00:00-05:00 2015 NFLX 47.071430 46.320000 19884900.0 0.001390 1.294549 46.040952 46.598367 False
9 2015-01-15 00:00:00-05:00 2015 NFLX 46.547142 46.251431 11669000.0 -0.001480 1.294549 46.275715 46.562653 False
10 2015-01-16 00:00:00-05:00 2015 NFLX 46.857143 48.191429 19696600.0 0.041945 1.294549 46.920953 46.769592 True
11 2015-01-20 00:00:00-05:00 2015 NFLX 48.571430 49.828571 40551000.0 0.033972 1.294549 48.090477 47.062245 True
12 2015-01-21 00:00:00-05:00 2015 NFLX 59.234287 58.468571 66902500.0 0.173394 1.294549 52.162857 48.694694 True
13 2015-01-22 00:00:00-05:00 2015 NFLX 58.571430 61.205715 33319300.0 0.046814 1.294549 56.500952 50.931633 True
14 2015-01-23 00:00:00-05:00 2015 NFLX 60.988571 62.494286 26784100.0 0.021053 1.294549 60.722857 53.251429 True


The cell below will test your solution for golden_cross_strategy (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [51]:
### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
              ex_name='golden_cross_strategy',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to golden_cross_strategy did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
golden_cross_strategy test ran 100 iterations in 19.99 seconds
Passed! Please submit.
In [52]:
### Run Me!!!
golden_crosses_df = utils.load_object_from_publicdata('golden_crosses_df')
golden_crosses_reduced_df = utils.load_object_from_publicdata('golden_crosses_reduced_df')
Successfully loaded golden_crosses_df.
Successfully loaded golden_crosses_reduced_df.

Exercise 5: (0 points) - FREE

normalize_stock_growths

Example: we have defined normalize_stock_growths as follows:

Normalize the daily growth of a given firm by comparing it to the growth of the other provided firms.

Inputs:

  • daily_growth_by_firm: a Pandas DataFrame. It contains many of the variables we have already calculated.
  • firm: a string. It specifies which firm to calculate the Relative_Growth for.

Return: normalized_results: a Pandas DataFrame with the following columns:

  • Date, which contains every unique date value in the input.
  • A column for each unique firm in daily_growth_by_firm, where the value is equal to the growth of that firm's stock for the relevant Date.
    • The value for the stock growth is contained in the Day_Growth column of daily_growth_by_firm.
  • Non_<FIRM>_Average_Growth (as defined below)

Requirements/steps:

  1. Use Pandas. Do not use SQLite.
  2. pandas.DataFrame.pivot may be helpful
  3. To calculate Non_<FIRM>_Average_Growth:
    1. For a given date in Date, take the mean of the growths for every firm which is not equal to firm.
      • For example, if our firms were AAPL, NFLX, and AMZN, and firm=NFLX, then Non_<FIRM>_Average_Growth would be the average growth of AAPL and AMZN for each day.
    2. You should replace the string "<FIRM>" in the column name with the value given by firm.
      • For example, if firm='AAPL', then your column name should be Non_AAPL_Average_Growth.
  4. You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [53]:
print('Properly Averaged Industry Growth:')

normalized_returns_demo_df_CORRECT = utils.load_object_from_publicdata('normalized_returns_demo_df.dill')
display(normalized_returns_demo_df_CORRECT)
Properly Averaged Industry Growth:
Successfully loaded normalized_returns_demo_df.dill.
Firm Date AAPL AMZN GOOG META NFLX Non_NFLX_Average_Growth
0 2020-01-02 00:00:00-05:00 0.022816 0.027151 0.022700 0.022071 0.019285 0.023684
1 2020-01-03 00:00:00-05:00 -0.009722 -0.012139 -0.004907 -0.005291 -0.011855 -0.008015
2 2020-01-06 00:00:00-05:00 0.007968 0.014886 0.024657 0.018834 0.030469 0.016586
3 2020-01-07 00:00:00-05:00 -0.004703 0.002092 -0.000624 0.002164 -0.015127 -0.000268
4 2020-01-08 00:00:00-05:00 0.016086 -0.007809 0.007880 0.010138 0.025729 0.006574
5 2020-01-09 00:00:00-05:00 0.021241 0.004799 0.011044 0.014311 -0.010611 0.012849
6 2020-01-10 00:00:00-05:00 0.002261 -0.009411 0.006973 -0.001099 -0.019693 -0.000319
7 2020-01-13 00:00:00-05:00 0.021364 0.004323 0.006645 0.017656 0.029996 0.012497
8 2020-01-14 00:00:00-05:00 -0.013503 -0.011558 -0.005802 -0.012843 -0.000679 -0.010927
9 2020-01-15 00:00:00-05:00 -0.004286 -0.003969 0.005815 0.009541 0.001122 0.001775
In [54]:
### Solution - Exercise 5  
def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame:
    # Create a list of every firm, except for the one we specified in the inputs
    non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm])
    non_chosen_firms = list(non_chosen_firms)

    # Pivot the data for easy manipulation
    pivoted_data = daily_growth_by_firm.pivot(
        index='Date', 
        columns='Firm',
        values='Day_Growth'
        ).reset_index()

    # Calculate the total growth for every firm that isn't our firm
    pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\
        .sum(axis=1) / len(non_chosen_firms)
    
    return pivoted_data

### Demo function call
normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX')
display(normalized_returns_demo_df)
Firm Date AAPL AMZN GOOG META NFLX Non_NFLX_Average_Growth
0 2020-01-02 00:00:00-05:00 0.022816 0.027151 0.022700 0.022071 0.019285 0.023684
1 2020-01-03 00:00:00-05:00 -0.009722 -0.012139 -0.004907 -0.005291 -0.011855 -0.008015
2 2020-01-06 00:00:00-05:00 0.007968 0.014886 0.024657 0.018834 0.030469 0.016586
3 2020-01-07 00:00:00-05:00 -0.004703 0.002092 -0.000624 0.002164 -0.015127 -0.000268
4 2020-01-08 00:00:00-05:00 0.016086 -0.007809 0.007880 0.010138 0.025729 0.006574
5 2020-01-09 00:00:00-05:00 0.021241 0.004799 0.011044 0.014311 -0.010611 0.012849
6 2020-01-10 00:00:00-05:00 0.002261 -0.009411 0.006973 -0.001099 -0.019693 -0.000319
7 2020-01-13 00:00:00-05:00 0.021364 0.004323 0.006645 0.017656 0.029996 0.012497
8 2020-01-14 00:00:00-05:00 -0.013503 -0.011558 -0.005802 -0.012843 -0.000679 -0.010927
9 2020-01-15 00:00:00-05:00 -0.004286 -0.003969 0.005815 0.009541 0.001122 0.001775


The test cell below will always pass. Please submit to collect your free points for normalize_stock_growths (exercise 5).

In [55]:
### Test Cell - Exercise 5  


print('Passed! Please submit.')
Passed! Please submit.

Exercise 6: (1 points)

summarize_netflix_original_genres

Your task: define summarize_netflix_original_genres as follows:

Define the function to return a SQL query which summarizes the trends of Netflix originals by year and genre, as detailed below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with three columns:

  • year
  • genre
  • genre_count

Requirements/steps:

  • Your query should:
    1. Count the number of Netflix original shows, grouped by year and by genre.
      • The year variable should be extracted from the premiere variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year from the premiere variable.
    2. Order them by:
      1. genre_count (descending)
      2. year (ascending)
      3. genre (ascending)
    3. Return the top 10 rows
  • The database table you will need is named nflx_originals.
  • Make sure you rename your column variables so that they match the specifications above.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [56]:
print('A proper summary of genres for the demo:')

netflix_original_genre_summary_df_CORRECT = utils.load_object_from_publicdata('netflix_original_genre_summary_df.dill')
display(netflix_original_genre_summary_df_CORRECT)
A proper summary of genres for the demo:
Successfully loaded netflix_original_genre_summary_df.dill.
year genre genre_count
0 2023 Kids and family 9
1 2023 Comedy 8
2 2023 Crime drama 6
3 2023 Drama 6
4 2024 True crime 6
5 2022 Reality 5
6 2023 True crime 5
7 2024 Drama 5
8 2022 Comedy 4
9 2022 Drama 4
In [61]:
### Solution - Exercise 6  
def summarize_netflix_original_genres()-> str:
    
    '''
    Steps:
    1. Look at the database table in question: 'nflx_originals'. This helps us understand the structure of the
        table that we are working with.
    2. Look at the linked documentation for STRFTIME and test a small query with it to make sure it works.
    3. Add in the group by and count() statements.
        Count the number of Netflix original shows, grouped by year and by genre.
    4. Add in the order by statements.
        Order them by:
            genre_count (descending)
            year (ascending)
            genre (ascending)
    '''
    
    query_1 = 'SELECT * FROM nflx_originals LIMIT 10'
    
    query_2 = f'''SELECT premiere, STRFTIME('%Y', premiere) AS year FROM nflx_originals LIMIT 10'''
    
    query_3 = f'''
        SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
        FROM nflx_originals
        GROUP BY year, genre
        LIMIT 10
    '''
    
    query_final = f'''
        SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
        FROM nflx_originals
        GROUP BY year, genre
        ORDER BY genre_count DESC, year, genre
        LIMIT(10)
    '''
    
    return query_final

### Demo function call
summary_query = summarize_netflix_original_genres()
netflix_original_genres = pd.read_sql(summary_query, conn)
display(netflix_original_genres)
year genre genre_count
0 2023 Kids and family 9
1 2023 Comedy 8
2 2023 Crime drama 6
3 2023 Drama 6
4 2024 True crime 6
5 2022 Reality 5
6 2023 True crime 5
7 2024 Drama 5
8 2022 Comedy 4
9 2022 Drama 4


The cell below will test your solution for summarize_netflix_original_genres (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [62]:
### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
              ex_name='summarize_netflix_original_genres',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
summarize_netflix_original_genres test ran 100 iterations in 0.71 seconds
Passed! Please submit.

Exercise 7: (2 points)

calc_netflix_top10_scores

Your task: define calc_netflix_top10_scores as follows:

Define the function to return a SQL query which calculates the number of times each show appeared in the top 10, the show's score, and the show's average score when it was in the top 10. The concept of a 'score' is defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database table nflx_top to obtain a table with the following attributes:

  • title: the title of the show
  • total_score: the sum of the show's score.
  • occurrence: the number of times a show appeared in the top 10.
  • avg_score: the total_score divided by the occurrence value for a given show.

Requirements/steps:

  1. You will need to calculate show scores to solve this problem:
    • A show ranked #1 on a day should earn 10 points towards its total score. A show ranked #2 should earn 9, etc. until a show ranked #10 should earn 1 point.
  2. After counting the occurrences, you'll need to calculate the avg_score to produce a floating-point value
  3. Order the results by show:
    1. total_score in descending order
    2. title in descending order
  4. Return the top 10 rows

Hints:

  • You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [ ]:
print('A proper summary of show score metrics:')

netflix_top_10_scores_df_CORRECT = utils.load_object_from_publicdata('netflix_top_10_scores_df.dill')
display(netflix_top_10_scores_df_CORRECT)
In [75]:
### Solution - Exercise 7  
def calc_netflix_top10_scores() -> str:
    
    '''
    Steps:
    1. Examine the nflx_top table.
    2. Calculate scores for each title.
        Note: I did this very inefficiently. The better way to do this (without a CASE WHEN statement) is:
            (11 - rank) AS scores
    3. Get the sum and average of all of the scores by title.
    4. Get the number of times a show appeared in the top 10.
        Note: If using the CASE WHEN method, we need to use COUNT(scores>0) to count the number of times that
            scores is greater than 0. If we use the 11-rank way, we would just need to COUNT(scores)
    5. Add in the order by statements:
        total_score in descending order
        title in descending order
        
    Note: I include LIMIT 10 from the beginning to avoid excess printing, since the final query will be limited to 10.
    '''
    
    
    query_1 = 'SELECT * FROM nflx_top LIMIT 10'
    
    query_2 = f'''
        SELECT
            title,
            CASE 
                WHEN rank=1 THEN 10
                WHEN rank=2 THEN 9
                WHEN rank=3 THEN 8
                WHEN rank=4 THEN 7
                WHEN rank=5 THEN 6
                WHEN rank=6 THEN 5
                WHEN rank=7 THEN 4
                WHEN rank=8 THEN 3
                WHEN rank=9 THEN 2
                WHEN rank=10 THEN 1
                ELSE 0
            END AS scores

        FROM nflx_top
    '''
    
    query_3 = f'''
        SELECT title, SUM(scores) AS total_score, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        LIMIT 10
    '''
    
    query_4 = f'''
        SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        LIMIT 10
    '''
    
    
    
    query_final = f'''
        
        SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        ORDER BY total_score DESC, title DESC
        LIMIT 10
    '''
    
    return query_final

### Demo function call
scores_query = calc_netflix_top10_scores()
netflix_top10_scores = pd.read_sql(scores_query, conn)
display(netflix_top10_scores)
title total_score occurrence avg_score
0 Cocomelon 1474 428 3.443925
1 Manifest 590 80 7.375000
2 Cobra Kai 582 81 7.185185
3 Outer Banks 534 72 7.416667
4 Ozark 500 85 5.882353
5 Squid Game 495 66 7.500000
6 The Queen’s Gambit 446 73 6.109589
7 Bridgerton 432 58 7.448276
8 Lucifer 405 56 7.232143
9 Virgin River 391 55 7.109091


The cell below will test your solution for calc_netflix_top10_scores (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [76]:
### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
              ex_name='calc_netflix_top10_scores',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
calc_netflix_top10_scores test ran 100 iterations in 0.79 seconds
Passed! Please submit.

Exercise 8: (3 points)

longterm_avg_score

Your task: define longterm_avg_score as follows:

Define the function to return a SQL query which calculates the long-term score for each Netflix Original, as defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with the following attributes:

  • title - (string): the titles of the Netflix Originals
  • total_score - (int): the sum of all scores for each title given in nflx_originals
    • The definition of a "score" is detailed below
  • date_count - (int): the number of dates that a given title is available, as detailed below
  • longterm_avg_score - (float): the long-term average score, defined as:
    • longterm_avg_score = total_score/date_count

Requirements/steps:

  • The data you will need is in the following tables:
    • nflx_originals
    • nflx_top
  • Calculate the values above based on the following details:
    • A score is a value associated with a Netflix title on a given date, where the score is defined as:
      • scored=11rankd, for a given date d
      • If no rank exists for a given show on a given date, you should assume the score is implicitly equal to 0.
    • To calculate date_count, count the number of date values in nflx_top that exist for a title, where the date is greater than or equal to the premiere for that title
  • We recommend the following approach. Intermediate steps are indicated by angular brackets (you may find using CTEs helpful!):
    1. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
    2. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are greater than or equal to the title's premiere.
    3. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top which have the same title/date pair.
      • Use the rank field from nflx_top to calculate a given score.
      • Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should have a NULL (score) value in <title_date_score>.
    4. Calculate the final result.
      1. For each title in <title_date_score>, calculate:
        1. total_score - compute the sum of all the score values sharing a common title.
        2. date_count - compute the count of all date values sharing a common title.
        3. longterm_avg_score - compute the quotient of the total_score over date_count.
      2. Filter out all records with longterm_avg_score that are NULL.
      3. Finally, the results should be ordered by:
        1. longterm_avg_score in descending order
        2. The show's title in ascending order
      4. Return the top 10 rows

Hints:

  • You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

An Example:

Let's take a deeper look specifically at the show Vikings: Valhalla.

  • According to nflx_originals table, the show premiered on 2022-02-25.
  • The dataframe longterm_avg_vikings_breakout_df_CORRECT below uses the results from the nflx_top table for the show titled Vikings: Valhalla.
    • Since the show premiered 2022-02-25, within nflx_top for Vikings: Valhalla, the dates greater than or equal to the premiere date range from 2022-02-25 to 2022-03-11.
      • Therefore, the show could theoretically receive a score from 2022-02-25 to 2022-03-11, which is 15 days
      • However, there is no entry for 2022-02-25 indicating that Vikings: Valhalla does not have a score for that date.
      • Vikings: Valhalla received its first ranking on 2022-02-26. Therefore, its score for 2022-02-26 would be:
        (11rank)=(114)=7
    • The dataframe longterm_avg_vikings_breakout_df_CORRECT below shows a table with the show title, premiere,date, rank, and score for all dates in the range from 2022-02-25 to 2022-03-11 for Vikings: Valhalla.
  • It is possible to then calculate the total_score, date_count, and longterm_avg_score for the show Vikings: Valhalla:
    longterm_avg_score=total_score/date_count=100/15=6.666667
In [77]:
longterm_avg_vikings_breakout_df_CORRECT = utils.load_object_from_publicdata('longterm_avg_vikings_breakout_df.dill')
display(longterm_avg_vikings_breakout_df_CORRECT)
Successfully loaded longterm_avg_vikings_breakout_df.dill.
title premiere date rank score
0 Vikings: Valhalla 2022-02-25 2022-02-25 NaN NaN
1 Vikings: Valhalla 2022-02-25 2022-02-26 4.0 7.0
2 Vikings: Valhalla 2022-02-25 2022-02-27 2.0 9.0
3 Vikings: Valhalla 2022-02-25 2022-02-28 1.0 10.0
4 Vikings: Valhalla 2022-02-25 2022-03-01 1.0 10.0
5 Vikings: Valhalla 2022-02-25 2022-03-02 1.0 10.0
6 Vikings: Valhalla 2022-02-25 2022-03-03 2.0 9.0
7 Vikings: Valhalla 2022-02-25 2022-03-04 2.0 9.0
8 Vikings: Valhalla 2022-02-25 2022-03-05 5.0 6.0
9 Vikings: Valhalla 2022-02-25 2022-03-06 5.0 6.0
10 Vikings: Valhalla 2022-02-25 2022-03-07 5.0 6.0
11 Vikings: Valhalla 2022-02-25 2022-03-08 6.0 5.0
12 Vikings: Valhalla 2022-02-25 2022-03-09 6.0 5.0
13 Vikings: Valhalla 2022-02-25 2022-03-10 7.0 4.0
14 Vikings: Valhalla 2022-02-25 2022-03-11 7.0 4.0

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [78]:
longterm_avg_scores_df = utils.load_object_from_publicdata('longterm_avg_scores_df.dill')
display(longterm_avg_scores_df)
Successfully loaded longterm_avg_scores_df.dill.
title total_score date_count longterm_avg_score
0 Vikings: Valhalla 100 15 6.666667
1 All of Us Are Dead 189 43 4.395349
2 Squid Game 495 176 2.812500
3 Bridgerton 432 442 0.977376
4 Ginny & Georgia 335 381 0.879265
5 Outer Banks 534 696 0.767241
6 Sweet Tooth 177 281 0.629893
7 Virgin River 391 710 0.550704
8 Emily in Paris 287 526 0.545627
9 Sweet Magnolias 344 662 0.519637
In [134]:
# NOTE: This is my original query before the recommended approach was added.

query_final = f'''
    WITH total_score_calc AS (
        SELECT nflx_originals.title, SUM(scores) AS total_score, nflx_release_date
        FROM (
            SELECT
                title,
                (11 - rank) AS scores,
                nflx_release_date
            FROM nflx_top
        ) AS top_scores
        JOIN nflx_originals ON nflx_originals.title = top_scores.title
        GROUP BY nflx_originals.title, nflx_release_date
    ),
    poss_days AS (
        SELECT 
            premiere, 
            COUNT(DISTINCT date) AS num_poss
        FROM nflx_top
        CROSS JOIN nflx_originals
        WHERE date >= premiere
        GROUP BY premiere
        ORDER BY premiere DESC
    )
    SELECT title, total_score, num_poss AS date_count, 
        CAST(total_score AS float) / CAST(num_poss AS float) AS longterm_avg_score
    FROM total_score_calc 
    JOIN poss_days ON premiere = nflx_release_date
    ORDER BY longterm_avg_score DESC, title ASC

    LIMIT 10
'''
In [138]:
### Solution - Exercise 8  
def longterm_avg_score():
    
    '''
    Steps:
    1. Look at the two tables that we will be querying: nflx_originals and nflx_top.
    2. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
    3. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are 
        greater than or equal to the title's premiere.
    4. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top 
        which have the same title/date pair.
        a. Use the rank field from nflx_top to calculate a given score.
        b. Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should 
            have a NULL (score) value in <title_date_score>.
    5. Calculate the final result.
        a. For each title in <title_date_score>, calculate:
            total_score - compute the sum of all the score values sharing a common title.
            date_count - compute the count of all date values sharing a common title.
            longterm_avg_score - compute the quotient of the total_score over date_count.
        b. Filter out all records with longterm_avg_score that are NULL.
        c. Finally, the results should be ordered by:
            longterm_avg_score in descending order
            The show's title in ascending order
        d. Return the top 10 rows
    '''
    
    query_1a = 'SELECT * FROM nflx_top LIMIT 10'
    query_1b = 'SELECT * FROM nflx_originals LIMIT 10'
    
    query_2 = f''' SELECT DISTINCT date FROM nflx_top '''
    
    query_3 = f'''
        WITH distinct_dates AS (
            SELECT DISTINCT date FROM nflx_top
        )
        SELECT
            title,
            premiere,
            date
        FROM nflx_originals
        CROSS JOIN distinct_dates
        WHERE premiere <= date
        LIMIT 10
    '''

    query_4 = f'''
        WITH title_dates AS (
            SELECT
                title,
                date
            FROM nflx_originals
            CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
            WHERE premiere <= date
        )
        SELECT 
            title_dates.title,
            title_dates.date,
            (11 - rank) AS scores
        FROM title_dates
        LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
    '''
    
    query_5a = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
    '''
    
    query_5b = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
        HAVING longterm_avg_score IS NOT NULL
    '''
    
    
    query_5c = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS float)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
        HAVING longterm_avg_score IS NOT NULL
        ORDER BY longterm_avg_score DESC, title ASC
        LIMIT 10
    '''
    
    return query_5c

### Demo function call
query = longterm_avg_score()
longterm_avg_scores = pd.read_sql(query, conn)
display(longterm_avg_scores)
title total_score date_count longterm_avg_score
0 Vikings: Valhalla 100 15 6.666667
1 All of Us Are Dead 189 43 4.395349
2 Squid Game 495 176 2.812500
3 Bridgerton 432 442 0.977376
4 Ginny & Georgia 335 381 0.879265
5 Outer Banks 534 696 0.767241
6 Sweet Tooth 177 281 0.629893
7 Virgin River 391 710 0.550704
8 Emily in Paris 287 526 0.545627
9 Sweet Magnolias 344 662 0.519637


The cell below will test your solution for longterm_avg_score (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [139]:
### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
              ex_name='longterm_avg_score',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to longterm_avg_score did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
longterm_avg_score test ran 100 iterations in 0.98 seconds
Passed! Please submit.
In [140]:
### Run Me!!!
nflx_scores_demo_df = utils.load_object_from_publicdata('nflx_scores_demo_df')
normalized_growths_df = utils.load_object_from_publicdata('normalized_growths_df')
Successfully loaded nflx_scores_demo_df.
Successfully loaded normalized_growths_df.

Exercise 9: (2 points)

construct_model_inputs

Your task: define construct_model_inputs as follows:

Define the model matrix and response for our regression model, as shown below.

Inputs:

  • normalized_growths_df: a Pandas DataFrame. It contains the results from exercise 5.
  • show_scores_df: a Pandas DataFrame. It contains the pre-computed scores for each show on a given date.

Return: model_matrix: a Pandas DataFrame containing the independent variables (show scores) and response variable (Relative_Growth)

Requirements/steps:

  1. To obtain the independent variables, you will need to pivot the show_scores_df so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score.
    • Fill any missing values with 0's.
    • See pandas.DataFrame.pivot for more information
    • You may also refer to Exercise 5 from above too
  2. You'll need to join the pivoted dataframe with the normalized_growths_df on date. We'll call this model_matrix
  3. Return the model_matrix but only with the columns for the shows and Relative_Growth

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

In [141]:
print('A proper model matrix for the demo:')

model_matrix_demo_df_CORRECT = utils.load_object_from_publicdata('model_matrix_demo_df.dill')
display(model_matrix_demo_df_CORRECT)
A proper model matrix for the demo:
Successfully loaded model_matrix_demo_df.dill.
Relative_Growth All American Badland Blood Father Car Masters: Rust to Riches How to Fix a Drug Scandal Love is Blind Nailed It! Ozark Salt The Platform The Players Club The Roommate Tiger King: Murder, Mayhem … Uncorked Unorthodox
0 0.011216 8.0 2.0 7.0 5.0 0.0 3.0 0.0 9.0 0.0 6.0 0.0 0.0 10.0 1.0 4.0
1 0.009617 8.0 0.0 5.0 1.0 6.0 0.0 7.0 9.0 0.0 2.0 0.0 4.0 10.0 0.0 3.0
2 -0.005836 8.0 0.0 1.0 0.0 7.0 0.0 6.0 9.0 2.0 0.0 3.0 5.0 10.0 0.0 4.0
In [144]:
### Solution - Exercise 9  
def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the two input dataframes and read the linked documentation for pivot.
    2. Pivot the show_scores_df as described above.
        a. Each show is represented as its own column. The index should reflect the dates. The values should be set 
            to the show's score.
                columns: shows (title)
                index: dates
                values: score
        b. Fill in the NaN values with 0s.
        c. Reset the index of the pivoted dataframe keeping the index as a column.
    3. Create the dataframe "model_matrix".
        a. Keep only the Date and Relative Growth values from the normalized_growths_df dataframe.
        b. Join the pivoted dataframe with normalized_growths_df on date.
        b. Delete the unnecessary columns of date and Date
    4. Return the final dataframe.    
    '''
    
    # step 2
    pivoted_df = show_scores_df.pivot(index='date', columns='title', values='score')
    pivoted_df.fillna(0, inplace=True)
    pivoted_df = pivoted_df.reset_index(drop=False)
    
    # step 3    
    normalized_small_df = normalized_growths_df[['Date', 'Relative_Growth']]
    model_matrix = pd.merge(normalized_small_df, pivoted_df, left_on = 'Date', right_on='date')
    
    del model_matrix['date']
    del model_matrix['Date']
    
    # step 4
    return model_matrix

### Demo function call
model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df)
display(model_matrix)
Relative_Growth All American Badland Blood Father Car Masters: Rust to Riches How to Fix a Drug Scandal Love is Blind Nailed It! Ozark Salt The Platform The Players Club The Roommate Tiger King: Murder, Mayhem … Uncorked Unorthodox
0 0.011216 8.0 2.0 7.0 5.0 0.0 3.0 0.0 9.0 0.0 6.0 0.0 0.0 10.0 1.0 4.0
1 0.009617 8.0 0.0 5.0 1.0 6.0 0.0 7.0 9.0 0.0 2.0 0.0 4.0 10.0 0.0 3.0
2 -0.005836 8.0 0.0 1.0 0.0 7.0 0.0 6.0 9.0 2.0 0.0 3.0 5.0 10.0 0.0 4.0


The cell below will test your solution for construct_model_inputs (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [145]:
### Test Cell - Exercise 9  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=construct_model_inputs,
              ex_name='construct_model_inputs',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to construct_model_inputs did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
construct_model_inputs test ran 100 iterations in 1.47 seconds
Passed! Please submit.

Fin

If you have made it this far, congratulations! You are done. Don't forget to submit the assignment!

Postscript

If you run the regression model below with all of the data collected, you would have obtained summary model results like the following:

##############################################
### The Code below is provided for illustrative and reproducability but is commented out due to additional time processing to the auto-grader. Note, the model was run on all NFLX growth data and top shows data.
##############################################
# Let's use step-wise regression to do feature selection
mm=model_matrix.copy(deep=True)
y = mm['Relative_Growth']
del mm['Relative_Growth']
X = mm
import statsmodels.api as sm
model = LinearRegression()
feature_selector = SequentialFeatureSelector(model, n_features_to_select=min(len(X.columns)-1,10))
feature_selector.fit(X, y)
selected_X = X[feature_selector.get_feature_names_out()]
selected_X = sm.add_constant(selected_X)
print(selected_X.columns)
selected_model = sm.OLS(y, selected_X)
results = selected_model.fit()
results.summary()

alt text

So, how does our model do?

Step-wise regression is a greedy algorithm which works to iteratively select the best possible features for improving the performance of our model. We tried to find the top-10 shows which were most predictive of the normalized stock values and found the following shows:

  1. Big Mouth
  2. Dirty John
  3. Emily in Paris
  4. I Am a Killer Released
  5. Lucifer
  6. Selling Sunset
  7. Squid Game
  8. The Last Kingdom
  9. The Royal Treatment
  10. Vivo

Note that these don't necessarily mean that they caused the stock value to improve. For example, The Royal Treatment was significantly associated with a negative trend in Netflix's stock growth.

A better question might be: "how reliable are these results?"

Ultimately, the performance of the model constructed from the input provided by exercise 9 is relatively poor. This model exhibits an adjusted R2 value of only 0.042. Predicting stock prices is intrinsically hard, as there is an exceptionally high amount of variance in each observation and there are many reasons for fluctuations in market prices. Therefore, it is unsurprising to see that a lot of the variance remains.

We are also fighting "the curse of dimensionality," which means the excess data we have can be more of a hinderance than a help. However, the F-score for our model is quite good (with a p-value of 0.000684), which suggests that our model does a much better job at predicting stock value changes than a naive model.

You will learn more about various modeling techniques and how to use them during the rest of the course. Working to improve the regression results with the results of this exam is left as an exercise to the students; for high-quality regression outputs, you may find the statsmodels package particularly useful. Here's one approach you might find worth exploring:

  • Calculate the logarithm of the golden cross ratio for the standardized stock prices. Use this as your response variable.
  • Calculate the logarithm of the golden cross ratio for the Netflix original show scores. Use these ratios as your predictors.
  • Use variable selection, as shown above, to find the strongest correlations. You may be surprised by the strength of the relationships!