`Midterm 2, Fall 2024`: `Netflix and Bills`¶

Version 1.0.3

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of ['General Python', 'Tabular Data', 'Pandas', 'SQL and SQLite', 'NumPy']. It has 10 exercises numbered 0 to 9. There are 17 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s) - FREE
Exercise 2 - : 3 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 0 point(s) - FREE
Exercise 6 - : 1 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 2 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Analyzing Netflix's Stocks and Services¶

Background. Netflix is a firm in the entertainment industry which has grown to prominence in the streaming era of movies and television. The company distributes both acquired and original programming via a streaming service.

Your overall task. Your goal is to attempt to evaluate whether there is a relationship between the programming that is distributed by Netflix and the financial performance of the firm. You will do this in two main steps:

First, you will analyze the general finanical standing of the firm.
Second, you will investigate the trends present in Netflix's program offerings.

Once you have obtained the results from both steps, you will use the results to build a model which will attempt to evaluate whether there is a relationship between the programs and the firm's financial standing.

The datasets. There are three sources of data which you will use to solve the following exercises.

Financial data for several tech firms, sourced from Yahoo Finance.
A daily ranking of Netflix's top-10 programs which spans 2 years, taken from The Numbers project by Nash Information Services.
A collection of information taken from Wikipedia on Netflix original programming.

The finanical data (source 1) is provided as a Python list of dictionaries. You will load this data into a Pandas DataFrame. The daily Netflix data (source 2) and original programming information (source 3) are contained in a SQLite Database.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

### Global imports
import dill
from cse6040_devkit import plugins, utils

from pprint import pprint, pformat

import numpy as np
import pandas as pd
import sqlite3
import random
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector

# Load starting data and define a database connection
stock_data = utils.load_object_from_publicdata('raw_stock_data_nflx.dill')

conn = sqlite3.connect('resource/asnlib/publicdata/NetflixDb.sqlite_db')

Successfully loaded raw_stock_data_nflx.dill.

Exercise 0: (1 points)¶

load_data

Your task: define load_data as follows:

Load the data contained in records into a Pandas DataFrame. See the instructions below for the requirements.

Inputs:

records: A list of dictionaries. Each dictionary contains information for a single row of data.

Return: full_data: A Pandas DataFrame containing the data held in records. The DataFrame should contain the following columns:

Firm
- This variable is contained in the dictionaries under the Stock key
Date
- This variable's datatype should be set to datetime64[ns, America/New_York]
Open
Close
Volume

Requirements/steps:

Do not remove any records with NaN values!
The DataFrame rows may be ordered arbitrarily.

Hints

You might find this part of the Pandas Documentation useful if you're struggling to load the data into Pandas.
There are many ways to approach this problem. We recommend this general approach:
1. Load the data into Pandas
2. Convert the date to the proper datatype
  - We have defined a helper function, convert_timezone_types() to make this easier. convert_timezone_types() takes an input of a pd.Series with dtype "string" and then outputs a pd.Series with dtype "datetime64[ns, America/New_York]". See the demo code for more examples.
  - You do not have to use our helper function. If you wish, you could instead take a look at to_datetime, tz_convert, and this post on stackoverflow
3. Rename the columns
4. Return the results

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

stock_subsample_df_demo_CORRECT = utils.load_object_from_publicdata('stock_subsample_df_demo.dill')
convert_timezone_types          = utils.load_object_from_publicdata('convert_timezone_types.dill')
display(stock_subsample_df_demo_CORRECT)

Successfully loaded stock_subsample_df_demo.dill.
Successfully loaded convert_timezone_types.dill.

stock_data[:10]

[{'Date': '2015-01-02 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.86195818653811,
  'Close': 24.402172088623047,
  'Volume': 212818400.0},
 {'Date': '2015-01-03 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-04 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-05 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.170046484250545,
  'Close': 23.714723587036133,
  'Volume': 257142000.0},
 {'Date': '2015-01-06 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 23.77945591892561,
  'Close': 23.716960906982425,
  'Volume': 263188400.0},
 {'Date': '2015-01-07 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 23.92675744290404,
  'Close': 24.049516677856445,
  'Volume': 160423600.0},
 {'Date': '2015-01-08 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 24.379850468289227,
  'Close': 24.973554611206055,
  'Volume': 237458000.0},
 {'Date': '2015-01-09 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': 25.14765081488744,
  'Close': 25.000341415405273,
  'Volume': 214798000.0},
 {'Date': '2015-01-10 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan},
 {'Date': '2015-01-11 00:00:00-05:00',
  'Stock': 'AAPL',
  'Open': nan,
  'Close': nan,
  'Volume': nan}]

### Solution - Exercise 0  
def load_data(records: list) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the input (records) to get a sense of what the data looks like.
    2. Create a dataframe with the list of dictionaries.
    3. Convert the date column to the data type "datetime64[ns, America/New_York]".
    4. Rename the 'Stock' column to be 'Firm'.
    5. Return the dataframe.
    
    Note:
    If you want to not use the built in function, this is how you would convert the dates:
    my_df['Date'] = my_df['Date'].apply(lambda x: pd.to_datetime(x).tz_convert('America/New_York'))
    '''
    
    my_df = pd.DataFrame(records)
    my_df['Date'] = convert_timezone_types(my_df['Date'])
    my_df = my_df.rename(columns={'Stock': 'Firm'})
        
    return my_df

### Demo function call
### `convert_timezone_types` demo ---------------------------------------------
# This helper function will help you change the data-type
# -- Input:  pd.Series, with dtype "string"
# -- Output: pd.Series, with dtype "datetime64[ns, America/New_York]"
type_conversion_demo = pd.DataFrame({
    'Date': ['2017-11-15 00:00:00-05:00', '2021-01-15 00:00:00-05:00'],
    'DemoColumn': [1, 2]
})
# Convert the type!
type_conversion_demo['Date'] = convert_timezone_types(type_conversion_demo['Date'])
assert type_conversion_demo['Date'].dtype == 'datetime64[ns, America/New_York]' # It works!


### `load_data` demo ----------------------------------------------------------
# We'll sample the dataset to 20 records, just to make debugging easier.
random.seed(6040)
stock_data_subsample = random.sample(stock_data, k=10)
# Here's what your solution produces!
print('Here is what your loaded data looks like in Pandas:')
stock_subsample_df = load_data(stock_data_subsample)
display(stock_subsample_df)

Here is what your loaded data looks like in Pandas:

The cell below will test your solution for load_data (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(load_data),
              ex_name='load_data',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to load_data did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

load_data test ran 100 iterations in 0.91 seconds
Passed! Please submit.

### Run Me!!!
stock_df = utils.load_object_from_publicdata('stock_df')

Successfully loaded stock_df.

Exercise 1: (1 points) - FREE¶

preview_data

Example: we have defined preview_data as follows:

Preview the three data sources as outlined below.

Inputs:

records: A Pandas DataFrame containing the results of the full_data output generated by Exercise 0.
conn: A SQLite connection to our database.

Return: A tuple, containing the following elements:

records_preview: A Pandas Dataframe containing 10 entries.
nflx_originals_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_originals SQL database table.
nflx_top_preview: A Pandas DataFrame containing 5 different, random entries from the nflx_top SQL database table.

Requirements/steps:

records_preview: If there are fewer than 10 entries in records, all entries should be returned.
nflx_originals_preview: The selected entries may be any 5 rows from the original table.
nflx_top_preview: The selected entires may be any 5 rows from the original table.

A solution is provided for you below. Run it and the test cell to earn your free point, then submit the assignment!

### Solution - Exercise 1  
def preview_data(records: list, conn: sqlite3.Connection) -> tuple:
    # Filter out any records beyond the second element
    records_preview = records.head(n=10)

    # Select everything from nflx_originals and randomly sample 5 rows
    nflx_originals_preview = pd\
        .read_sql('SELECT * FROM nflx_originals', conn)\
        .sample(5)

    # Select everything from nflx_top and randomly sample 5 rows
    nflx_top_preview = pd\
        .read_sql('SELECT * FROM nflx_top', conn)\
        .sample(5)

    return records_preview, nflx_originals_preview, nflx_top_preview

### Demo function call
records_preview, nflx_originals, nflx_top = preview_data(stock_df, conn)
print('Preview of Stock Data')
display(records_preview)
print('------------------------------------------------------------------------')
print('Preview of Netflix Originals Table')
display(nflx_originals)
print('------------------------------------------------------------------------')
print('Preview of Netflix Top Programs Table')
display(nflx_top)

Preview of Stock Data

------------------------------------------------------------------------
Preview of Netflix Originals Table

------------------------------------------------------------------------
Preview of Netflix Top Programs Table

The test cell below will always pass. Please submit to collect your free points for preview_data (exercise 1).

### Test Cell - Exercise 1  


print('Passed! Please submit.')

Passed! Please submit.

### Run Me!!!
calculate_daily_growth_demo_input = utils.load_object_from_publicdata('calculate_daily_growth_demo_input')

Successfully loaded calculate_daily_growth_demo_input.

Exercise 2: (3 points)¶

calculate_daily_growth

Your task: define calculate_daily_growth as follows:

Calculate the daily growth for a given firm, as outlined below.

Inputs:

stock_records: a Pandas DataFrame. It contains the full stock data loaded in exercise 0.
stock: a string specifying a particular stock.

Return: stock_growth a Pandas DataFrame with the following columns:

Date
Firm
Open
Close
Volume
Day_Growth

Requirements/steps:

Do not use SQLite.
Create a new Pandas DataFrame with Day_Growth appended as a new column. The growth should be specific to the firm specified by the firm parameter.
Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
- You should use the next valid close price to calculate growth in these instances.
Daily growth is defined as
${growth}_{0} = ({close}_{0} / {close}_{- 1}) - 1$
- In English: "Daily Growth is Today's Close divided by Yesterday's Close, minus one."
- $\text{close}_x$ is the value of the Close, offset by $x$ . So, $\text{close}_{-1}$ is the value of Close, offset by 1 position.
- You may find the pd.DataFrame.shift() method, and the related GroupBy shift method helpful for offsetting the close value!
To repeat: Some days have no recorded values for opening and closing prices. For example, stock prices are not recorded during the weekend.
- You should use the next valid close price to calculate growth in these instances.
- For example, if:
  1. The close price on Friday is 10
  2. There are no close prices for the next two days (Saturday and Sunday)
  3. The close price on Monday is 12 ... then the daily growth for Monday should be equal to (12/10)-1=0.2.
- You can make this easier by filtering out all NaN values when you initially calculate the growth.
Your final DataFrame (stock_growth) should contain the same number of records as stock_records for that firm.
Finally, reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print("Demo output for Netflix:")

daily_growth_df_CORRECT = utils.load_object_from_publicdata('daily_growth_df.dill')
display(daily_growth_df_CORRECT)

Demo output for Netflix:
Successfully loaded daily_growth_df.dill.

calculate_daily_growth_demo_input.head()

### Solution - Exercise 2  
def calculate_daily_growth(stock_records: pd.DataFrame, firm: str) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the input dataframe, stock_records, to understand what the data looks like. For us, that is 
        calculate_daily_growth_demo_input.
    2. Create a new dataframe called df.
        a. Create a copy of the stock_records df, so we don't accidentally modify the original dataframe.
        b. Filter so we are only looking at the firm specified in the "firm" input string.
        c. Create a new column called Day_Growth. I decided to initialize it so every value was 0.0, since this will
            help with the days with NaN Open and Close values.
    3. Isolate the rows that have valid open / close prices.
        a. Create one dataframe that only contains rows with NaN valus for Open. We want to be able to merge these
            back with our other dataframe in the end.
        b. Create another dataframe with NaN rows dropped (for the columns open and close).
                Note: for this, I looked at the Pandas documentation for dropna to remember the syntax for looking
                    at a subset.
    4. Utilize the shift() function to easily get the last closing value.
        a. Look at the linked documentation for shift and some examples.
        b. Use the shift function and print the results to confirm that it is doing what we want it to.
        c. Create a new column called "last_close" to hold these shifted values.
            Note: I kept getting the SettingWithCopyWarning when doing this. I ignored it until the end and came
                back to it. There were two ways that I found to fix it: either adding a .copy() at the end of the
                statement creating df_no_na or by using an "assign" function that I found when looking through
                StackOverflow. I have included both below.
    5. Calculate the Day_Growth column using the formula above: (close0/close-1)-1.
        a. Since we will want to use multiple columns, I automatically knew that we would want to use an apply
            function along the non-default axis. I always forget which axis to use, so I typically try both to
            determine which is correct. In this case, it would be axis=1.
        b. Use the apply function and then set the Day_Growth column equal to these calculated values.
    6. Combine the NaN rows with the rows that we just used (df_no_na).
        a. Delete the column "last_close" that we created for temporary storage.
        b. Use the concat function to add the na_rows to our dataframe, df_no_na.
        c. Sort the values by date to fix the order.
    7. Reset the index of stock_growth and return.
    
    When I followed these steps, I had an error. I compared my output with the expected output to realize that the
    first row of my df_no_na dataframe had a Day_Growth of NaN due to the nature of the shift function. So, I filled
    the NaN rows in df_no_na to fix this problem.
    '''
    
    # step 2
    df = stock_records.copy()
    df = df[df['Firm']==firm]
    df['Day_Growth'] = 0.000000
    
    # step 3
    na_mask = df['Open'].isna()
    na_rows = df[na_mask]
    # print(na_rows.head())
    
    df_no_na = df.dropna(subset=['Open', 'Close']).copy()
    
    # step 4
    shifted_vals = df_no_na['Close'].shift(periods=1)
    # print(shifted_vals)
    df_no_na = df_no_na.assign(last_close = shifted_vals.values)
    # if using .copy(), could just say df_no_na['last_close'] = list(shifted_vals.values)
    
    # step 5
    new_day_growth = df_no_na.apply(lambda x: (x['Close']/x['last_close'])-1, axis=1).copy()
    df_no_na = df_no_na.assign(Day_Growth = new_day_growth.values)
    # if using .copy(), could just say df_no_na['Day_Growth'] = list(new_day_growth.values)
    
    # step 6
    del df_no_na['last_close']
    df_no_na.fillna(0, inplace=True)        # step 8

    stock_growth = pd.concat([na_rows, df_no_na])
    stock_growth = stock_growth.sort_values(by='Date')
    
    # step 7
    return stock_growth.reset_index(drop=True)
    
    

### Demo function call
print("Demo daily growth for Netflix's stock:")
daily_growth_nflx_demo = calculate_daily_growth(
    stock_records=calculate_daily_growth_demo_input,
    firm='NFLX'
)
display(daily_growth_nflx_demo)

Demo daily growth for Netflix's stock:

The cell below will test your solution for calculate_daily_growth (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_daily_growth),
              ex_name='calculate_daily_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_daily_growth did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

calculate_daily_growth test ran 100 iterations in 1.42 seconds
Passed! Please submit.

### Run Me!!!
stock_daily_growths_df = utils.load_object_from_publicdata('stock_daily_growths_df')
stock_daily_growths_reduced_df = utils.load_object_from_publicdata('stock_daily_growths_reduced_df')

Successfully loaded stock_daily_growths_df.
Successfully loaded stock_daily_growths_reduced_df.

Exercise 3: (2 points)¶

calculate_annual_growth

Your task: define calculate_annual_growth as follows:

Calculate the annual growth for every firm and join the results to a copy of the original stock_records DataFrame.

Inputs:

stock_records: a Pandas DataFrame. It contains the results calculated by exercise 2.

Return: annual_growth: a Pandas DataFrame with the same columns as stock_records, along with columns for:

Year
Annual_Growth

Requirements/steps:

Use Pandas. Do not use SQLite.
Extract the year from the Date column of stock_records. Year is expected to be a string in the final output.
The most common way to calculate Annual_Growth
- Take the product of (Day_Growth + 1), grouped by firm and year. Make sure to subtract 1 at the end.
You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

Note: While not reflective of the output you are expected to produce, you may also find the annual_growth_grouped_df_demo information useful for checking your work. If you wish to use it, uncomment the code below to view it.

You should notice in this result that the same value for Annual_Growth appears in each row for a given Firm and Year. For example rows where APPL is the Firm and Year is '2018' all have Annual_Growth of -0.053902, and NFLX rows for '2020' have 0.671138.

The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them together. Then subtracting 1 from the product.

The first few terms in the (APPL, 2018) group would be $(1+ 0.0)(1 + .012904)(1 - 0.000174)(1 + 0.004645)(1 + 0.11385)$
The product of all terms for (APPL, 2018) is $0.946098$ .
Subtracting off 1 leaves $-0.053902$

Mathematically, let $g^{(f,y)}_i$ represent the Day_Growth value for firm $f$ , year $y$ , and observation $i$ . Then the Annual_Growth value for the same firm ( $G^{(f, y)}$ ) is given by:

G^{(f, y)} = [\prod_{\forall i} (g_{i}^{(f, y)} + 1)] - 1

$G^{(f, y)} = \left [ \prod_{\forall i}{(g^{(f,y)}_i + 1)} \right ] - 1$

print("Demo annual growth:")

annual_growth_demo_df_CORRECT = utils.load_object_from_publicdata('annual_growth_demo_df.dill')
display(annual_growth_demo_df_CORRECT)

### Uncomment these lines for additional information!
# print("Annual growth, grouped together:")

# annual_growth_grouped_df_demo = utils.load_object_from_publicdata('annual_growth_grouped_df_demo.dill')
# display(annual_growth_grouped_df_demo)

Demo annual growth:
Successfully loaded annual_growth_demo_df.dill.

### Solution - Exercise 3  
def calculate_annual_growth(stock_records: pd.DataFrame) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the stock_records dataframe to remind us visually the difference between our input and expected
        output.
    2. Create a new dataframe called annual_growth with a new "Year" column.
        a. Create a copy of the input dataframe, stock_records.
        b. Extract the year from the Date column.
            Googled: "get year from datetime pandas"
            Referenced: https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-separately-from-pandas-datetime-column
        c. Convert Year to a string, as instructed above.
    3. Look at the instructions to get an understanding of how to calculate Annual Growth. The below instruction helps
        break it down into steps:
            "The Annual_Growth is calculated by adding 1 to each Day_Growth observation. Multiplying all of them 
             together. Then subtracting 1 from the product."
    4. Calculate Annual_Growth.
        a. Create a new column called "updated_day" that holds the value of Day_Growth + 1.
        b. Group by Firm and Year, multiplying the values of updated_day together.
            Googled: "pandas groupby multiply values from group set to new column"
            Referenced: https://stackoverflow.com/questions/69471371/pandas-multiply-row-value-by-groupby-of-another-column-as-a-new-column
        c. Delete the temporary updated_day column.
        d. Subtract 1 from the Annual_Growth.
    5. Return the final dataframe after reordering the columns and resetting the index.

    '''
    
    # step 2
    annual_growth = stock_records.copy()
    annual_growth['Year'] = annual_growth['Date'].dt.year
    annual_growth['Year'] = annual_growth['Year'].astype(str)
    
    # step 4
    annual_growth['updated_day'] = annual_growth.apply(lambda x: x['Day_Growth']+1, axis=1)
    annual_growth['Annual_Growth'] = annual_growth.groupby(by=['Firm', 'Year'])['updated_day'].transform('prod')
    del annual_growth['updated_day']
    annual_growth['Annual_Growth'] = annual_growth['Annual_Growth'].apply(lambda x: x-1)
    
    # step 5
    annual_growth = annual_growth[['Date', 'Year', 'Firm', 'Open', 'Close', 'Volume', 'Day_Growth', 'Annual_Growth']]
    return annual_growth.reset_index(drop=True)

### Demo function call
annual_growth_demo_df = calculate_annual_growth(stock_records=stock_daily_growths_reduced_df)
display(annual_growth_demo_df)

The cell below will test your solution for calculate_annual_growth (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(calculate_annual_growth),
              ex_name='calculate_annual_growth',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calculate_annual_growth did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

calculate_annual_growth test ran 100 iterations in 21.78 seconds
Passed! Please submit.

### Run Me!!!
annual_growth_df = utils.load_object_from_publicdata('annual_growth_df')

Successfully loaded annual_growth_df.

Exercise 4: (2 points)¶

golden_cross_strategy

Your task: define golden_cross_strategy as follows:

Implement a function which calculates a Golden Cross strategy, as detailed below.

Inputs:

stockdf: a Pandas DataFrame. It contains the results calculated by exercise 3.
firm: a string. It specifies which firm to calculate the golden cross results for.
short_average: an integer. It defines the window for the short moving average. The number represents the number of days to include in the window.
long_average: an integer. It defines the window for the long moving average. The number represents the number of days to include in the window.

Return: golden_cross_results: a Pandas DataFrame with all of the columns from stockdf, plus:

Short_Average
Long_Average
Golden_Cross

Requirements/steps:

Use Pandas. Do not use SQLite.
Filter the DataFrame to include records for the firm parameter.
Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
- You may find the pd.DataFrame.notna() method useful for filtering these values out!
Add the columns Short_Average and Long_Average to the Pandas DataFrame.
- These columns should calculate the moving average of the Close column.
- The window of the moving average should be equal to the size specified by the short_average and long_average arguments.
- You may find the pd.DataFrame.rolling() method useful for creating the windows!
Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
- This column should indicate whether the short-window moving average is greater than the long-window moving average.
You may assume that only one firm is provided for the firm argument.
You may assume that stockdf is already ordered by date in ascending order.
Finally, reset the indexes before returning the resulting dataframe.

Notes

The Golden Cross (https://www.investopedia.com/terms/g/goldencross.asp) is a technical analysis pattern that indicates a potential bullish trend reversal. It occurs when a faster, short-term moving average crosses above a slower, longer-term moving average. For example, a 50-day moving average (the faster MA) crossing above a 200-day moving average (the slower MA) creates a Golden Cross signal.

alt text

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print("Example Golden Cross Demo:")

golden_cross_demo_df_CORRECT = utils.load_object_from_publicdata('golden_cross_demo_df.dill')
display(golden_cross_demo_df_CORRECT.head(15))

Example Golden Cross Demo:
Successfully loaded golden_cross_demo_df.dill.

### Solution - Exercise 4  
def golden_cross_strategy(stockdf: pd.DataFrame, firm: str, short_average: int, long_average: int) -> pd.DataFrame:
    
    '''
    Steps:
    1. Filter the DataFrame to include records for the firm parameter.
        a. Create a copy of the stockdf dataframe.
        b. Only include rows where the 'Firm' is equal to the input string firm.
    2. Some Close values are recorded as NaN. These should be filtered out before calculating the moving averages.
        a. Use dropna to filter out rows where Close values are NaN.
    3. Add the columns Short_Average and Long_Average to the Pandas DataFrame.
        a. Look at the linked documentation for how to use the rolling function and look at some examples.
        b. Since a Rolling object has no average attribute that I found, manually calculate the average values.
        c. Use the assign method to avoid the SettingWithCopyWarning.
    4. Add the column 'Golden_Cross', a boolean, to the Pandas DataFrame.
        Method 1:
            a. Create a Series where values are True if the Short_Average is greater than the Long_Average.
            b. Create a new column called 'Golden_Cross' and assign these values to it.
        Method 2:
            Googled: "pandas new column based on condition of other columns"
            Referenced: https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
            a. Use the example from StackOverflow to set the Golden_Cross column to be True if Short_Average >
                Long_Average and False if Short_Average <= Long_Average.
            b. When checking this, rows where Short_Average and/or Long_Average still have NaN.
            Googled: "fill na for one column pandas"
            Referenced: https://stackoverflow.com/questions/38134012/pandas-dataframe-fillna-only-some-columns-in-place
            c. Fill the NaN values in the Golden_Cross column with False.
    5. Finally, reset the indexes before returning the resulting dataframe.
    '''

    # step 1
    df = stockdf.copy()
    df = df[df['Firm']==firm]
    
    # step 2
    df_notna = df.dropna(subset=['Close'])
    
    # step 3
    short_window = (df_notna['Close'].rolling(short_average).sum())/short_average
    long_window = (df_notna['Close'].rolling(long_average).sum())/long_average
    
    df_notna = df_notna.assign(Short_Average = short_window.values)
    df_notna = df_notna.assign(Long_Average = long_window.values)

    
    # step 4, method 1
    golden_cross = df_notna['Short_Average']>df_notna['Long_Average']
    df_notna = df_notna.assign(Golden_Cross = golden_cross.values)
    
    # step 4, method 2
    # df_notna.loc[df_notna['Short_Average']>df_notna['Long_Average'], 'Golden_Cross'] = True
    # df_notna.loc[df_notna['Short_Average']<=df_notna['Long_Average'], 'Golden_Cross'] = False
    # df_notna = df_notna.fillna({'Golden_Cross': False})

    # step 5
    golden_cross_results = df_notna.copy()
    return golden_cross_results.reset_index(drop=True)



### Demo function call
golden_cross_demo_df = golden_cross_strategy(
    stockdf=annual_growth_df,
    firm='NFLX',
    short_average=3,
    long_average=7)
# Let's only look at the first 15 days for our input
display(golden_cross_demo_df.head(15))

The cell below will test your solution for golden_cross_strategy (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(golden_cross_strategy),
              ex_name='golden_cross_strategy',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to golden_cross_strategy did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

golden_cross_strategy test ran 100 iterations in 19.99 seconds
Passed! Please submit.

### Run Me!!!
golden_crosses_df = utils.load_object_from_publicdata('golden_crosses_df')
golden_crosses_reduced_df = utils.load_object_from_publicdata('golden_crosses_reduced_df')

Successfully loaded golden_crosses_df.
Successfully loaded golden_crosses_reduced_df.

Exercise 5: (0 points) - FREE¶

normalize_stock_growths

Example: we have defined normalize_stock_growths as follows:

Normalize the daily growth of a given firm by comparing it to the growth of the other provided firms.

Inputs:

daily_growth_by_firm: a Pandas DataFrame. It contains many of the variables we have already calculated.
firm: a string. It specifies which firm to calculate the Relative_Growth for.

Return: normalized_results: a Pandas DataFrame with the following columns:

Date, which contains every unique date value in the input.
A column for each unique firm in daily_growth_by_firm, where the value is equal to the growth of that firm's stock for the relevant Date.
- The value for the stock growth is contained in the Day_Growth column of daily_growth_by_firm.
Non_<FIRM>_Average_Growth (as defined below)

Requirements/steps:

Use Pandas. Do not use SQLite.
pandas.DataFrame.pivot may be helpful
To calculate Non_<FIRM>_Average_Growth:
1. For a given date in Date, take the mean of the growths for every firm which is not equal to firm.
  - For example, if our firms were AAPL, NFLX, and AMZN, and firm=NFLX, then Non_<FIRM>_Average_Growth would be the average growth of AAPL and AMZN for each day.
2. You should replace the string "<FIRM>" in the column name with the value given by firm.
  - For example, if firm='AAPL', then your column name should be Non_AAPL_Average_Growth.
You may need to reset the indexes before returning the resulting dataframe.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print('Properly Averaged Industry Growth:')

normalized_returns_demo_df_CORRECT = utils.load_object_from_publicdata('normalized_returns_demo_df.dill')
display(normalized_returns_demo_df_CORRECT)

Properly Averaged Industry Growth:
Successfully loaded normalized_returns_demo_df.dill.

### Solution - Exercise 5  
def normalize_stock_growths(daily_growth_by_firm: pd.DataFrame, firm: str) -> pd.DataFrame:
    # Create a list of every firm, except for the one we specified in the inputs
    non_chosen_firms = set(daily_growth_by_firm['Firm']) - set([firm])
    non_chosen_firms = list(non_chosen_firms)

    # Pivot the data for easy manipulation
    pivoted_data = daily_growth_by_firm.pivot(
        index='Date', 
        columns='Firm',
        values='Day_Growth'
        ).reset_index()

    # Calculate the total growth for every firm that isn't our firm
    pivoted_data[f'Non_{firm}_Average_Growth'] = pivoted_data[non_chosen_firms]\
        .sum(axis=1) / len(non_chosen_firms)
    
    return pivoted_data

### Demo function call
normalized_returns_demo_df = normalize_stock_growths(golden_crosses_reduced_df, 'NFLX')
display(normalized_returns_demo_df)

The test cell below will always pass. Please submit to collect your free points for normalize_stock_growths (exercise 5).

### Test Cell - Exercise 5  


print('Passed! Please submit.')

Passed! Please submit.

Exercise 6: (1 points)¶

summarize_netflix_original_genres

Your task: define summarize_netflix_original_genres as follows:

Define the function to return a SQL query which summarizes the trends of Netflix originals by year and genre, as detailed below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with three columns:

year
genre
genre_count

Requirements/steps:

Your query should:
1. Count the number of Netflix original shows, grouped by year and by genre.
  - The year variable should be extracted from the premiere variable. Note that in this case, year will be a string. We suggest you use the STRFTIME SQLite function to extract year from the premiere variable.
2. Order them by:
  1. genre_count (descending)
  2. year (ascending)
  3. genre (ascending)
3. Return the top 10 rows
The database table you will need is named nflx_originals.
Make sure you rename your column variables so that they match the specifications above.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print('A proper summary of genres for the demo:')

netflix_original_genre_summary_df_CORRECT = utils.load_object_from_publicdata('netflix_original_genre_summary_df.dill')
display(netflix_original_genre_summary_df_CORRECT)

A proper summary of genres for the demo:
Successfully loaded netflix_original_genre_summary_df.dill.

### Solution - Exercise 6  
def summarize_netflix_original_genres()-> str:
    
    '''
    Steps:
    1. Look at the database table in question: 'nflx_originals'. This helps us understand the structure of the
        table that we are working with.
    2. Look at the linked documentation for STRFTIME and test a small query with it to make sure it works.
    3. Add in the group by and count() statements.
        Count the number of Netflix original shows, grouped by year and by genre.
    4. Add in the order by statements.
        Order them by:
            genre_count (descending)
            year (ascending)
            genre (ascending)
    '''
    
    query_1 = 'SELECT * FROM nflx_originals LIMIT 10'
    
    query_2 = f'''SELECT premiere, STRFTIME('%Y', premiere) AS year FROM nflx_originals LIMIT 10'''
    
    query_3 = f'''
        SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
        FROM nflx_originals
        GROUP BY year, genre
        LIMIT 10
    '''
    
    query_final = f'''
        SELECT STRFTIME('%Y', premiere) AS year, genre, COUNT(*) AS genre_count
        FROM nflx_originals
        GROUP BY year, genre
        ORDER BY genre_count DESC, year, genre
        LIMIT(10)
    '''
    
    return query_final

### Demo function call
summary_query = summarize_netflix_original_genres()
netflix_original_genres = pd.read_sql(summary_query, conn)
display(netflix_original_genres)

The cell below will test your solution for summarize_netflix_original_genres (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(summarize_netflix_original_genres),
              ex_name='summarize_netflix_original_genres',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to summarize_netflix_original_genres did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

summarize_netflix_original_genres test ran 100 iterations in 0.71 seconds
Passed! Please submit.

Exercise 7: (2 points)¶

calc_netflix_top10_scores

Your task: define calc_netflix_top10_scores as follows:

Define the function to return a SQL query which calculates the number of times each show appeared in the top 10, the show's score, and the show's average score when it was in the top 10. The concept of a 'score' is defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database table nflx_top to obtain a table with the following attributes:

title: the title of the show
total_score: the sum of the show's score.
occurrence: the number of times a show appeared in the top 10.
avg_score: the total_score divided by the occurrence value for a given show.

Requirements/steps:

You will need to calculate show scores to solve this problem:
- A show ranked #1 on a day should earn 10 points towards its total score. A show ranked #2 should earn 9, etc. until a show ranked #10 should earn 1 point.
After counting the occurrences, you'll need to calculate the avg_score to produce a floating-point value
Order the results by show:
1. total_score in descending order
2. title in descending order
Return the top 10 rows

Hints:

You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print('A proper summary of show score metrics:')

netflix_top_10_scores_df_CORRECT = utils.load_object_from_publicdata('netflix_top_10_scores_df.dill')
display(netflix_top_10_scores_df_CORRECT)

### Solution - Exercise 7  
def calc_netflix_top10_scores() -> str:
    
    '''
    Steps:
    1. Examine the nflx_top table.
    2. Calculate scores for each title.
        Note: I did this very inefficiently. The better way to do this (without a CASE WHEN statement) is:
            (11 - rank) AS scores
    3. Get the sum and average of all of the scores by title.
    4. Get the number of times a show appeared in the top 10.
        Note: If using the CASE WHEN method, we need to use COUNT(scores>0) to count the number of times that
            scores is greater than 0. If we use the 11-rank way, we would just need to COUNT(scores)
    5. Add in the order by statements:
        total_score in descending order
        title in descending order
        
    Note: I include LIMIT 10 from the beginning to avoid excess printing, since the final query will be limited to 10.
    '''
    
    
    query_1 = 'SELECT * FROM nflx_top LIMIT 10'
    
    query_2 = f'''
        SELECT
            title,
            CASE 
                WHEN rank=1 THEN 10
                WHEN rank=2 THEN 9
                WHEN rank=3 THEN 8
                WHEN rank=4 THEN 7
                WHEN rank=5 THEN 6
                WHEN rank=6 THEN 5
                WHEN rank=7 THEN 4
                WHEN rank=8 THEN 3
                WHEN rank=9 THEN 2
                WHEN rank=10 THEN 1
                ELSE 0
            END AS scores

        FROM nflx_top
    '''
    
    query_3 = f'''
        SELECT title, SUM(scores) AS total_score, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        LIMIT 10
    '''
    
    query_4 = f'''
        SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        LIMIT 10
    '''
    
    
    
    query_final = f'''
        
        SELECT title, SUM(scores) AS total_score, COUNT(scores>0) AS occurrence, AVG(scores) AS avg_score
        FROM (
            SELECT
                title,
                CASE 
                    WHEN rank=1 THEN 10
                    WHEN rank=2 THEN 9
                    WHEN rank=3 THEN 8
                    WHEN rank=4 THEN 7
                    WHEN rank=5 THEN 6
                    WHEN rank=6 THEN 5
                    WHEN rank=7 THEN 4
                    WHEN rank=8 THEN 3
                    WHEN rank=9 THEN 2
                    WHEN rank=10 THEN 1
                    ELSE 0
                END AS scores

            FROM nflx_top
        )
        GROUP BY title
        ORDER BY total_score DESC, title DESC
        LIMIT 10
    '''
    
    return query_final

### Demo function call
scores_query = calc_netflix_top10_scores()
netflix_top10_scores = pd.read_sql(scores_query, conn)
display(netflix_top10_scores)

The cell below will test your solution for calc_netflix_top10_scores (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(calc_netflix_top10_scores),
              ex_name='calc_netflix_top10_scores',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to calc_netflix_top10_scores did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

calc_netflix_top10_scores test ran 100 iterations in 0.79 seconds
Passed! Please submit.

Exercise 8: (3 points)¶

longterm_avg_score

Your task: define longterm_avg_score as follows:

Define the function to return a SQL query which calculates the long-term score for each Netflix Original, as defined below.

Inputs: None

Return: query: a Python string, containing a SQLite query. It should query the database to obtain a table with the following attributes:

title - (string): the titles of the Netflix Originals
total_score - (int): the sum of all scores for each title given in nflx_originals
- The definition of a "score" is detailed below
date_count - (int): the number of dates that a given title is available, as detailed below
longterm_avg_score - (float): the long-term average score, defined as:
- longterm_avg_score = total_score/date_count

Requirements/steps:

The data you will need is in the following tables:
- nflx_originals
- nflx_top
Calculate the values above based on the following details:
- A score is a value associated with a Netflix title on a given date, where the score is defined as:
  - $\text{score}_{d} = 11 - \text{rank}_{d}$ , for a given date $d$
  - If no rank exists for a given show on a given date, you should assume the score is implicitly equal to 0.
- To calculate date_count, count the number of date values in nflx_top that exist for a title, where the date is greater than or equal to the premiere for that title
We recommend the following approach. Intermediate steps are indicated by angular brackets (you may find using CTEs helpful!):
1. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
2. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are greater than or equal to the title's premiere.
3. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top which have the same title/date pair.
  - Use the rank field from nflx_top to calculate a given score.
  - Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should have a NULL (score) value in <title_date_score>.
4. Calculate the final result.
  1. For each title in <title_date_score>, calculate:
    1. total_score - compute the sum of all the score values sharing a common title.
    2. date_count - compute the count of all date values sharing a common title.
    3. longterm_avg_score - compute the quotient of the total_score over date_count.
  2. Filter out all records with longterm_avg_score that are NULL.
  3. Finally, the results should be ordered by:
    1. longterm_avg_score in descending order
    2. The show's title in ascending order
  4. Return the top 10 rows

Hints:

You may need to use either a subquery or a CTE to solve this problem. You may use whichever you prefer.

An Example:

Let's take a deeper look specifically at the show Vikings: Valhalla.

According to nflx_originals table, the show premiered on 2022-02-25.
The dataframe longterm_avg_vikings_breakout_df_CORRECT below uses the results from the nflx_top table for the show titled Vikings: Valhalla.
- Since the show premiered 2022-02-25, within nflx_top for Vikings: Valhalla, the dates greater than or equal to the premiere date range from 2022-02-25 to 2022-03-11.
  - Therefore, the show could theoretically receive a score from 2022-02-25 to 2022-03-11, which is 15 days
  - However, there is no entry for 2022-02-25 indicating that Vikings: Valhalla does not have a score for that date.
  - Vikings: Valhalla received its first ranking on 2022-02-26. Therefore, its score for 2022-02-26 would be: $(11 - r a n k) = (11 - 4) = 7$ $(11-rank)=(11-4)=7$
- The dataframe longterm_avg_vikings_breakout_df_CORRECT below shows a table with the show title, premiere,date, rank, and score for all dates in the range from 2022-02-25 to 2022-03-11 for Vikings: Valhalla.
It is possible to then calculate the total_score, date_count, and longterm_avg_score for the show Vikings: Valhalla: $longterm_avg_score = total_score / date_count = 100 / 15 = 6.666667$ $\text{longterm_avg_score}=\text{total_score}/\text{date_count}=100/15=6.666667$

longterm_avg_vikings_breakout_df_CORRECT = utils.load_object_from_publicdata('longterm_avg_vikings_breakout_df.dill')
display(longterm_avg_vikings_breakout_df_CORRECT)

Successfully loaded longterm_avg_vikings_breakout_df.dill.

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

longterm_avg_scores_df = utils.load_object_from_publicdata('longterm_avg_scores_df.dill')
display(longterm_avg_scores_df)

Successfully loaded longterm_avg_scores_df.dill.

# NOTE: This is my original query before the recommended approach was added.

query_final = f'''
    WITH total_score_calc AS (
        SELECT nflx_originals.title, SUM(scores) AS total_score, nflx_release_date
        FROM (
            SELECT
                title,
                (11 - rank) AS scores,
                nflx_release_date
            FROM nflx_top
        ) AS top_scores
        JOIN nflx_originals ON nflx_originals.title = top_scores.title
        GROUP BY nflx_originals.title, nflx_release_date
    ),
    poss_days AS (
        SELECT 
            premiere, 
            COUNT(DISTINCT date) AS num_poss
        FROM nflx_top
        CROSS JOIN nflx_originals
        WHERE date >= premiere
        GROUP BY premiere
        ORDER BY premiere DESC
    )
    SELECT title, total_score, num_poss AS date_count, 
        CAST(total_score AS float) / CAST(num_poss AS float) AS longterm_avg_score
    FROM total_score_calc 
    JOIN poss_days ON premiere = nflx_release_date
    ORDER BY longterm_avg_score DESC, title ASC

    LIMIT 10
'''

### Solution - Exercise 8  
def longterm_avg_score():
    
    '''
    Steps:
    1. Look at the two tables that we will be querying: nflx_originals and nflx_top.
    2. Obtain <distinct_dates>: Identify all distinct date values in nflx_top.
    3. Obtain <title_dates>: For each title in nflx_originals, identify any dates in <distinct_dates> which are 
        greater than or equal to the title's premiere.
    4. Obtain <title_date_score>: For each title/date pair in <title_dates>, identify any records from nflx_top 
        which have the same title/date pair.
        a. Use the rank field from nflx_top to calculate a given score.
        b. Any title/date pairs which are in <title_date_score> but not in nflx_top should not be dropped and should 
            have a NULL (score) value in <title_date_score>.
    5. Calculate the final result.
        a. For each title in <title_date_score>, calculate:
            total_score - compute the sum of all the score values sharing a common title.
            date_count - compute the count of all date values sharing a common title.
            longterm_avg_score - compute the quotient of the total_score over date_count.
        b. Filter out all records with longterm_avg_score that are NULL.
        c. Finally, the results should be ordered by:
            longterm_avg_score in descending order
            The show's title in ascending order
        d. Return the top 10 rows
    '''
    
    query_1a = 'SELECT * FROM nflx_top LIMIT 10'
    query_1b = 'SELECT * FROM nflx_originals LIMIT 10'
    
    query_2 = f''' SELECT DISTINCT date FROM nflx_top '''
    
    query_3 = f'''
        WITH distinct_dates AS (
            SELECT DISTINCT date FROM nflx_top
        )
        SELECT
            title,
            premiere,
            date
        FROM nflx_originals
        CROSS JOIN distinct_dates
        WHERE premiere <= date
        LIMIT 10
    '''

    query_4 = f'''
        WITH title_dates AS (
            SELECT
                title,
                date
            FROM nflx_originals
            CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
            WHERE premiere <= date
        )
        SELECT 
            title_dates.title,
            title_dates.date,
            (11 - rank) AS scores
        FROM title_dates
        LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
    '''
    
    query_5a = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
    '''
    
    query_5b = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS DECIMAL)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
        HAVING longterm_avg_score IS NOT NULL
    '''
    
    
    query_5c = f'''
        WITH title_date_score AS (
            SELECT 
                title_dates.title,
                title_dates.date,
                (11 - rank) AS scores
            FROM (
                SELECT
                    title,
                    date
                FROM nflx_originals
                CROSS JOIN (SELECT DISTINCT date FROM nflx_top)
                WHERE premiere <= date
            ) title_dates
            LEFT JOIN nflx_top ON title_dates.title = nflx_top.title AND title_dates.date = nflx_top.date
        )
        
        SELECT
            title,
            SUM(scores) AS total_score,
            COUNT(date) AS date_count,
            CAST(SUM(scores) AS float)/COUNT(date) AS longterm_avg_score
        FROM title_date_score
        GROUP BY title
        HAVING longterm_avg_score IS NOT NULL
        ORDER BY longterm_avg_score DESC, title ASC
        LIMIT 10
    '''
    
    return query_5c

### Demo function call
query = longterm_avg_score()
longterm_avg_scores = pd.read_sql(query, conn)
display(longterm_avg_scores)

The cell below will test your solution for longterm_avg_score (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(longterm_avg_score),
              ex_name='longterm_avg_score',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to longterm_avg_score did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

longterm_avg_score test ran 100 iterations in 0.98 seconds
Passed! Please submit.

### Run Me!!!
nflx_scores_demo_df = utils.load_object_from_publicdata('nflx_scores_demo_df')
normalized_growths_df = utils.load_object_from_publicdata('normalized_growths_df')

Successfully loaded nflx_scores_demo_df.
Successfully loaded normalized_growths_df.

Exercise 9: (2 points)¶

construct_model_inputs

Your task: define construct_model_inputs as follows:

Define the model matrix and response for our regression model, as shown below.

Inputs:

normalized_growths_df: a Pandas DataFrame. It contains the results from exercise 5.
show_scores_df: a Pandas DataFrame. It contains the pre-computed scores for each show on a given date.

Return: model_matrix: a Pandas DataFrame containing the independent variables (show scores) and response variable (Relative_Growth)

Requirements/steps:

To obtain the independent variables, you will need to pivot the show_scores_df so that each show is represented as its own column. The index should reflect the dates. The values should be set to the show's score.
- Fill any missing values with 0's.
- See pandas.DataFrame.pivot for more information
- You may also refer to Exercise 5 from above too
You'll need to join the pivoted dataframe with the normalized_growths_df on date. We'll call this model_matrix
Return the model_matrix but only with the columns for the shows and Relative_Growth

Example. A correct implementation should produce, for the demo, the following output. Run the cell below to view it.

print('A proper model matrix for the demo:')

model_matrix_demo_df_CORRECT = utils.load_object_from_publicdata('model_matrix_demo_df.dill')
display(model_matrix_demo_df_CORRECT)

A proper model matrix for the demo:
Successfully loaded model_matrix_demo_df.dill.

### Solution - Exercise 9  
def construct_model_inputs(normalized_growths_df: pd.DataFrame, show_scores_df: pd.DataFrame) -> pd.DataFrame:
    
    '''
    Steps:
    1. Examine the two input dataframes and read the linked documentation for pivot.
    2. Pivot the show_scores_df as described above.
        a. Each show is represented as its own column. The index should reflect the dates. The values should be set 
            to the show's score.
                columns: shows (title)
                index: dates
                values: score
        b. Fill in the NaN values with 0s.
        c. Reset the index of the pivoted dataframe keeping the index as a column.
    3. Create the dataframe "model_matrix".
        a. Keep only the Date and Relative Growth values from the normalized_growths_df dataframe.
        b. Join the pivoted dataframe with normalized_growths_df on date.
        b. Delete the unnecessary columns of date and Date
    4. Return the final dataframe.    
    '''
    
    # step 2
    pivoted_df = show_scores_df.pivot(index='date', columns='title', values='score')
    pivoted_df.fillna(0, inplace=True)
    pivoted_df = pivoted_df.reset_index(drop=False)
    
    # step 3    
    normalized_small_df = normalized_growths_df[['Date', 'Relative_Growth']]
    model_matrix = pd.merge(normalized_small_df, pivoted_df, left_on = 'Date', right_on='date')
    
    del model_matrix['date']
    del model_matrix['Date']
    
    # step 4
    return model_matrix

### Demo function call
model_matrix = construct_model_inputs(normalized_growths_df, nflx_scores_demo_df)
display(model_matrix)

The cell below will test your solution for construct_model_inputs (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 9  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=construct_model_inputs,
              ex_name='construct_model_inputs',
              key=b'a9pptsbNJew33Kjoo3e2RkFk0FT2N1lXclu3U9zq8Aw=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to construct_model_inputs did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')

construct_model_inputs test ran 100 iterations in 1.47 seconds
Passed! Please submit.

Fin¶

If you have made it this far, congratulations! You are done. Don't forget to submit the assignment!

Postscript¶

If you run the regression model below with all of the data collected, you would have obtained summary model results like the following:

##############################################
### The Code below is provided for illustrative and reproducability but is commented out due to additional time processing to the auto-grader. Note, the model was run on all NFLX growth data and top shows data.
##############################################
# Let's use step-wise regression to do feature selection
mm=model_matrix.copy(deep=True)
y = mm['Relative_Growth']
del mm['Relative_Growth']
X = mm
import statsmodels.api as sm
model = LinearRegression()
feature_selector = SequentialFeatureSelector(model, n_features_to_select=min(len(X.columns)-1,10))
feature_selector.fit(X, y)
selected_X = X[feature_selector.get_feature_names_out()]
selected_X = sm.add_constant(selected_X)
print(selected_X.columns)
selected_model = sm.OLS(y, selected_X)
results = selected_model.fit()
results.summary()

alt text

So, how does our model do?

Step-wise regression is a greedy algorithm which works to iteratively select the best possible features for improving the performance of our model. We tried to find the top-10 shows which were most predictive of the normalized stock values and found the following shows:

Big Mouth
Dirty John
Emily in Paris
I Am a Killer Released
Lucifer
Selling Sunset
Squid Game
The Last Kingdom
The Royal Treatment
Vivo

Note that these don't necessarily mean that they caused the stock value to improve. For example, The Royal Treatment was significantly associated with a negative trend in Netflix's stock growth.

A better question might be: "how reliable are these results?"

Ultimately, the performance of the model constructed from the input provided by exercise 9 is relatively poor. This model exhibits an adjusted $R^2$ value of only 0.042. Predicting stock prices is intrinsically hard, as there is an exceptionally high amount of variance in each observation and there are many reasons for fluctuations in market prices. Therefore, it is unsurprising to see that a lot of the variance remains.

We are also fighting "the curse of dimensionality," which means the excess data we have can be more of a hinderance than a help. However, the F-score for our model is quite good (with a p-value of 0.000684), which suggests that our model does a much better job at predicting stock value changes than a naive model.

You will learn more about various modeling techniques and how to use them during the rest of the course. Working to improve the regression results with the results of this exam is left as an exercise to the students; for high-quality regression outputs, you may find the statsmodels package particularly useful. Here's one approach you might find worth exploring:

Calculate the logarithm of the golden cross ratio for the standardized stock prices. Use this as your response variable.
Calculate the logarithm of the golden cross ratio for the Netflix original show scores. Use these ratios as your predictors.
Use variable selection, as shown above, to find the strongest correlations. You may be surprised by the strength of the relationships!

	Date	Firm	Open	Close	Volume
0	2018-09-02 00:00:00-04:00	GOOG	NaN	NaN	NaN
1	2021-09-07 00:00:00-04:00	AMZN	173.899994	175.464493	54758000.0
2	2023-09-26 00:00:00-04:00	GOOG	130.914001	129.449997	20378800.0
3	2017-11-15 00:00:00-05:00	NFLX	194.539993	192.119995	4203100.0
4	2022-02-08 00:00:00-05:00	GOOG	138.991257	139.212997	34256000.0
5	2023-08-28 00:00:00-04:00	META	287.694748	289.952362	14239300.0
6	2023-12-09 00:00:00-05:00	NFLX	NaN	NaN	NaN
7	2021-01-15 00:00:00-05:00	META	247.637243	251.093582	24942900.0
8	2018-11-26 00:00:00-05:00	NFLX	260.549988	261.429993	12498600.0
9	2018-03-13 00:00:00-04:00	AAPL	43.227056	42.606789	126774000.0

	Date	Firm	Open	Close	Volume
0	2018-09-02 00:00:00-04:00	GOOG	NaN	NaN	NaN
1	2021-09-07 00:00:00-04:00	AMZN	173.899994	175.464493	54758000.0
2	2023-09-26 00:00:00-04:00	GOOG	130.914001	129.449997	20378800.0
3	2017-11-15 00:00:00-05:00	NFLX	194.539993	192.119995	4203100.0
4	2022-02-08 00:00:00-05:00	GOOG	138.991257	139.212997	34256000.0
5	2023-08-28 00:00:00-04:00	META	287.694748	289.952362	14239300.0
6	2023-12-09 00:00:00-05:00	NFLX	NaN	NaN	NaN
7	2021-01-15 00:00:00-05:00	META	247.637243	251.093582	24942900.0
8	2018-11-26 00:00:00-05:00	NFLX	260.549988	261.429993	12498600.0
9	2018-03-13 00:00:00-04:00	AAPL	43.227056	42.606789	126774000.0

	title	genre	premiere	seasons	runtime	status	language
58	Pluto	Science fiction	2023-10-26	1 season, 8 episodes	56–71 min	Pending	Japanese
129	Bloodhounds	Action drama	2023-06-09	1 season, 8 episodes	54–74 min	Renewed
242	Rhythm + Flow France	Music competition	2022-06-09	2 seasons, 16 episodes	36–48 min	Season 3 due to premiere on July 4, 2024	French
220	The Indrani Mukerjea Story: Buried Truth	True crime	2024-02-29	1 season, 4 episodes	44–50 min	Pending	English
154	Alpha Males	Comedy	2022-12-30	2 seasons, 20 episodes	30–40 min	Renewed

	date	rank	rank_one_week_prior	title	content_type	nflx_exclusive	nflx_release_date
3850	2021-04-21	1	-	The Baker and the Beauty	TV Show		2021-04-13
5674	2021-10-20	5	9	Cocomelon	TV Show		2020-06-01
6072	2021-11-29	3	-	Selling Sunset	TV Show	Yes	2019-03-22
1195	2020-07-29	6	-	Fear City: New York vs. The…	TV Show	Yes	2020-07-22
2727	2020-12-29	8	3	Tiny Pretty Things	TV Show	Yes	2020-12-14

	Date	Firm	Open	Close	Volume	Day_Growth
0	2019-05-01 00:00:00-04:00	NFLX	374.000000	378.809998	9257300.0	0.000000
1	2019-05-02 00:00:00-04:00	NFLX	378.000000	379.059998	5398200.0	0.000660
2	2019-05-03 00:00:00-04:00	NFLX	381.529999	385.029999	5130300.0	0.015749
3	2019-05-04 00:00:00-04:00	NFLX	NaN	NaN	NaN	0.000000
4	2019-05-05 00:00:00-04:00	NFLX	NaN	NaN	NaN	0.000000
5	2019-05-06 00:00:00-04:00	NFLX	377.690002	378.670013	5793100.0	-0.016518
6	2019-05-07 00:00:00-04:00	NFLX	377.000000	370.459991	6974900.0	-0.021681
7	2019-05-08 00:00:00-04:00	NFLX	367.920013	364.369995	6572000.0	-0.016439
8	2019-05-09 00:00:00-04:00	NFLX	360.899994	362.750000	5882600.0	-0.004446
9	2019-05-10 00:00:00-04:00	NFLX	361.619995	361.040009	5657100.0	-0.004714
10	2019-05-11 00:00:00-04:00	NFLX	NaN	NaN	NaN	0.000000
11	2019-05-12 00:00:00-04:00	NFLX	NaN	NaN	NaN	0.000000
12	2019-05-13 00:00:00-04:00	NFLX	352.290009	345.260010	8026700.0	-0.043707
13	2019-05-14 00:00:00-04:00	NFLX	348.709991	345.609985	5353000.0	0.001014
14	2019-05-15 00:00:00-04:00	NFLX	343.339996	354.989990	6340100.0	0.027140

	Date	Firm	Open	Close	Volume
0	2015-01-02 00:00:00-05:00	AAPL	24.861958	24.402172	212818400.0
1	2015-01-03 00:00:00-05:00	AAPL	NaN	NaN	NaN
2	2015-01-04 00:00:00-05:00	AAPL	NaN	NaN	NaN
3	2015-01-05 00:00:00-05:00	AAPL	24.170046	23.714724	257142000.0
4	2015-01-06 00:00:00-05:00	AAPL	23.779456	23.716961	263188400.0
5	2015-01-07 00:00:00-05:00	AAPL	23.926757	24.049517	160423600.0
6	2015-01-08 00:00:00-05:00	AAPL	24.379850	24.973555	237458000.0
7	2015-01-09 00:00:00-05:00	AAPL	25.147651	25.000341	214798000.0
8	2015-01-10 00:00:00-05:00	AAPL	NaN	NaN	NaN
9	2015-01-11 00:00:00-05:00	AAPL	NaN	NaN	NaN

	Date	Firm	Open	Close	Volume
1580	2019-05-01 00:00:00-04:00	AAPL	50.444390	50.598213	259309200.0
1581	2019-05-02 00:00:00-04:00	AAPL	50.434774	50.268932	127985200.0
1582	2019-05-03 00:00:00-04:00	AAPL	50.687145	50.893845	83569600.0
1583	2019-05-04 00:00:00-04:00	AAPL	NaN	NaN	NaN
1584	2019-05-05 00:00:00-04:00	AAPL	NaN	NaN	NaN

	Date	Year	Firm	Open	Close	Volume	Day_Growth	Annual_Growth
0	2018-01-01 00:00:00-05:00	2018	AAPL	NaN	NaN	NaN	0.000000	-0.053902
1	2018-01-02 00:00:00-05:00	2018	AAPL	40.120742	40.615883	102223600.0	0.017904	-0.053902
2	2018-01-03 00:00:00-05:00	2018	AAPL	40.679550	40.608814	118071600.0	-0.000174	-0.053902
3	2018-01-04 00:00:00-05:00	2018	AAPL	40.681893	40.797428	89738400.0	0.004645	-0.053902
4	2018-01-05 00:00:00-05:00	2018	AAPL	40.894105	41.261925	94640000.0	0.011385	-0.053902
...	...	...	...	...	...	...	...	...
5475	2020-12-27 00:00:00-05:00	2020	NFLX	NaN	NaN	NaN	0.000000	0.671138
5476	2020-12-28 00:00:00-05:00	2020	NFLX	516.429993	519.119995	2891900.0	0.010020	0.671138
5477	2020-12-29 00:00:00-05:00	2020	NFLX	519.900024	530.869995	4022400.0	0.022634	0.671138
5478	2020-12-30 00:00:00-05:00	2020	NFLX	530.130005	524.590027	1876300.0	-0.011830	0.671138
5479	2020-12-31 00:00:00-05:00	2020	NFLX	525.530029	540.729980	5392300.0	0.030767	0.671138

	Date	Year	Firm	Open	Close	Volume	Day_Growth	Annual_Growth	Short_Average	Long_Average	Golden_Cross
0	2015-01-02 00:00:00-05:00	2015	NFLX	49.151428	49.848572	13475000.0	0.000000	1.294549	NaN	NaN	False
1	2015-01-05 00:00:00-05:00	2015	NFLX	49.258572	47.311428	18165000.0	-0.050897	1.294549	NaN	NaN	False
2	2015-01-06 00:00:00-05:00	2015	NFLX	47.347141	46.501431	16037700.0	-0.017121	1.294549	47.887143	NaN	False
3	2015-01-07 00:00:00-05:00	2015	NFLX	47.347141	46.742859	9849700.0	0.005192	1.294549	46.851906	NaN	False
4	2015-01-08 00:00:00-05:00	2015	NFLX	47.119999	47.779999	9601900.0	0.022188	1.294549	47.008096	NaN	False
5	2015-01-09 00:00:00-05:00	2015	NFLX	47.631428	47.041428	9578100.0	-0.015458	1.294549	47.188095	NaN	False
6	2015-01-12 00:00:00-05:00	2015	NFLX	47.090000	45.547142	12944400.0	-0.031765	1.294549	46.789523	47.253265	False
7	2015-01-13 00:00:00-05:00	2015	NFLX	46.021427	46.255714	18721500.0	0.015557	1.294549	46.281428	46.740000	False
8	2015-01-14 00:00:00-05:00	2015	NFLX	47.071430	46.320000	19884900.0	0.001390	1.294549	46.040952	46.598367	False
9	2015-01-15 00:00:00-05:00	2015	NFLX	46.547142	46.251431	11669000.0	-0.001480	1.294549	46.275715	46.562653	False
10	2015-01-16 00:00:00-05:00	2015	NFLX	46.857143	48.191429	19696600.0	0.041945	1.294549	46.920953	46.769592	True
11	2015-01-20 00:00:00-05:00	2015	NFLX	48.571430	49.828571	40551000.0	0.033972	1.294549	48.090477	47.062245	True
12	2015-01-21 00:00:00-05:00	2015	NFLX	59.234287	58.468571	66902500.0	0.173394	1.294549	52.162857	48.694694	True
13	2015-01-22 00:00:00-05:00	2015	NFLX	58.571430	61.205715	33319300.0	0.046814	1.294549	56.500952	50.931633	True
14	2015-01-23 00:00:00-05:00	2015	NFLX	60.988571	62.494286	26784100.0	0.021053	1.294549	60.722857	53.251429	True

Firm	Date	AAPL	AMZN	GOOG	META	NFLX	Non_NFLX_Average_Growth
0	2020-01-02 00:00:00-05:00	0.022816	0.027151	0.022700	0.022071	0.019285	0.023684
1	2020-01-03 00:00:00-05:00	-0.009722	-0.012139	-0.004907	-0.005291	-0.011855	-0.008015
2	2020-01-06 00:00:00-05:00	0.007968	0.014886	0.024657	0.018834	0.030469	0.016586
3	2020-01-07 00:00:00-05:00	-0.004703	0.002092	-0.000624	0.002164	-0.015127	-0.000268
4	2020-01-08 00:00:00-05:00	0.016086	-0.007809	0.007880	0.010138	0.025729	0.006574
5	2020-01-09 00:00:00-05:00	0.021241	0.004799	0.011044	0.014311	-0.010611	0.012849
6	2020-01-10 00:00:00-05:00	0.002261	-0.009411	0.006973	-0.001099	-0.019693	-0.000319
7	2020-01-13 00:00:00-05:00	0.021364	0.004323	0.006645	0.017656	0.029996	0.012497
8	2020-01-14 00:00:00-05:00	-0.013503	-0.011558	-0.005802	-0.012843	-0.000679	-0.010927
9	2020-01-15 00:00:00-05:00	-0.004286	-0.003969	0.005815	0.009541	0.001122	0.001775

	year	genre	genre_count
0	2023	Kids and family	9
1	2023	Comedy	8
2	2023	Crime drama	6
3	2023	Drama	6
4	2024	True crime	6
5	2022	Reality	5
6	2023	True crime	5
7	2024	Drama	5
8	2022	Comedy	4
9	2022	Drama	4

	title	total_score	occurrence	avg_score
0	Cocomelon	1474	428	3.443925
1	Manifest	590	80	7.375000
2	Cobra Kai	582	81	7.185185
3	Outer Banks	534	72	7.416667
4	Ozark	500	85	5.882353
5	Squid Game	495	66	7.500000
6	The Queenâs Gambit	446	73	6.109589
7	Bridgerton	432	58	7.448276
8	Lucifer	405	56	7.232143
9	Virgin River	391	55	7.109091

	title	premiere	date	rank	score
0	Vikings: Valhalla	2022-02-25	2022-02-25	NaN	NaN
1	Vikings: Valhalla	2022-02-25	2022-02-26	4.0	7.0
2	Vikings: Valhalla	2022-02-25	2022-02-27	2.0	9.0
3	Vikings: Valhalla	2022-02-25	2022-02-28	1.0	10.0
4	Vikings: Valhalla	2022-02-25	2022-03-01	1.0	10.0
5	Vikings: Valhalla	2022-02-25	2022-03-02	1.0	10.0
6	Vikings: Valhalla	2022-02-25	2022-03-03	2.0	9.0
7	Vikings: Valhalla	2022-02-25	2022-03-04	2.0	9.0
8	Vikings: Valhalla	2022-02-25	2022-03-05	5.0	6.0
9	Vikings: Valhalla	2022-02-25	2022-03-06	5.0	6.0
10	Vikings: Valhalla	2022-02-25	2022-03-07	5.0	6.0
11	Vikings: Valhalla	2022-02-25	2022-03-08	6.0	5.0
12	Vikings: Valhalla	2022-02-25	2022-03-09	6.0	5.0
13	Vikings: Valhalla	2022-02-25	2022-03-10	7.0	4.0
14	Vikings: Valhalla	2022-02-25	2022-03-11	7.0	4.0

	title	total_score	date_count	longterm_avg_score
0	Vikings: Valhalla	100	15	6.666667
1	All of Us Are Dead	189	43	4.395349
2	Squid Game	495	176	2.812500
3	Bridgerton	432	442	0.977376
4	Ginny & Georgia	335	381	0.879265
5	Outer Banks	534	696	0.767241
6	Sweet Tooth	177	281	0.629893
7	Virgin River	391	710	0.550704
8	Emily in Paris	287	526	0.545627
9	Sweet Magnolias	344	662	0.519637

	Relative_Growth	All American	Badland	Blood Father	Car Masters: Rust to Riches	How to Fix a Drug Scandal	Love is Blind	Nailed It!	Ozark	Salt	The Platform	The Players Club	The Roommate	Tiger King: Murder, Mayhem …	Uncorked	Unorthodox
0	0.011216	8.0	2.0	7.0	5.0	0.0	3.0	0.0	9.0	0.0	6.0	0.0	0.0	10.0	1.0	4.0
1	0.009617	8.0	0.0	5.0	1.0	6.0	0.0	7.0	9.0	0.0	2.0	0.0	4.0	10.0	0.0	3.0
2	-0.005836	8.0	0.0	1.0	0.0	7.0	0.0	6.0	9.0	2.0	0.0	3.0	5.0	10.0	0.0	4.0

Midterm 2, Fall 2024: Netflix and Bills¶

The Problem: Analyzing Netflix's Stocks and Services¶

Exercise 0: (1 points)¶

Exercise 1: (1 points) - FREE¶

Exercise 2: (3 points)¶

Exercise 3: (2 points)¶

Exercise 4: (2 points)¶

Exercise 5: (0 points) - FREE¶

Exercise 6: (1 points)¶

Exercise 7: (2 points)¶

Exercise 8: (3 points)¶

Exercise 9: (2 points)¶

Fin¶

Postscript¶

`Midterm 2, Fall 2024`: `Netflix and Bills`¶