Fall 2025 MT2: A Better Airbnb

Version 1.0.2

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of data cleaning, filtering, and analysis in Pandas, SQL, and Numpy. It has 11 exercises numbered 0 to 10. There are 21 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 1 point(s)

  • Exercise 1 - : 1 point(s)

  • Exercise 2 - : 1 point(s)

  • Exercise 3 - : 2 point(s)

  • Exercise 4 - : 2 point(s)

  • Exercise 5 - : 3 point(s)

  • Exercise 6 - : 2 point(s)

  • Exercise 7 - : 3 point(s)

  • Exercise 8 - : 3 point(s)

  • Exercise 9 - : 1 point(s)

  • Exercise 10 - : 2 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Planning a Vacation in Los Angeles, California

Background: Airbnb is an online marketplace that allows property owners to rent their space to prospective short and long-term tenants. The app enables renters to filter potential living spaces by destination, price, stay type, stay dates, and more. Users also have access to review data from previous guests in order to make informed decisions. Even still, the process of narrowing down a good place to stay can be overwhelming, especially if you are unfamiliar with the area.

Your overall task: Your goal is to use Airbnb data to help a prospective renter narrow down relevant stay locations, and then determine the best options based on reviews and area information. You will do this in two main steps:

  1. First, you will clean, filter, and rank the Airbnb locations using review and listing data.

  2. Second, you will analyze crime data for the various neighborhoods within LA to ensure your chosen location is safe.

The datasets: There are two sources of data which you will use to solve the following exercises.

  1. Information about Airbnb listings in Los Angeles, California from 2024-2025, sourced from InsideAirbnb.

  2. Crime data from 2020-2025 for Los Angeles, sourced from the City of Los Angeles.

The listing data (source 1) is provided as three seperate Pandas data frames containing information about listings, reviews, and booking. The crime data (source 2) is contained in a SQLite Database.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

In [ ]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re 
import pandas as pd
In [ ]:
### Run Me!!!
import pandas as pd
import numpy as np
from pprint import pprint
import sqlite3
import re
from datetime import datetime, timedelta, date
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer


listings=utils.load_object_from_publicdata('listings.dill')
reviews=utils.load_object_from_publicdata('reviews.dill')
calendar=utils.load_object_from_publicdata('calendar.dill')
conn=sqlite3.connect("resource/asnlib/publicdata/crimes_and_population.db")

Exercise 0: (1 points)

explore_airbnb_data__FREE

Example: we have defined explore_airbnb_data__FREE as follows:

This is a free exercise! Please run the test cell below to collect your FREE point! We encourage you to review the samples of both listings and reviews in the cell below:

  • listings A DataFrame containing descriptions, location, and host information for active Airbnbs in Los Angeles, California, as of September 2024.
  • reviews A DataFrame containing reviews of Airbnbs, linked to each location by listing ID.
  • calendar A DataFrame containing information about listing availability and pricing for 2024-2025.
In [ ]:
### Solution - Exercise 0  
def explore_airbnb_data__FREE(listings: pd.DataFrame, reviews: pd.DataFrame, calendar: pd.DataFrame, headsize:int=10) -> tuple:
    listings_preview = listings.head(n=headsize)
    reviews_preview = reviews.head(n=headsize)
    calendar_preview = calendar.head(n=headsize)
    return (listings_preview, reviews_preview, calendar_preview)

### Demo function call
listings_preview,reviews_preview,calendar_preview = explore_airbnb_data__FREE(listings, reviews, calendar)
display(listings.head(5))
display(reviews.head(5))
display(calendar.head(5))


The test cell below will always pass. Please submit to collect your free points for explore_airbnb_data__FREE (exercise 0).

In [ ]:
### Test Cell - Exercise 0  


print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_cleanse_listings_TRUE = utils.load_object_from_publicdata('demo_result_cleanse_listings_TRUE')

Exercise 1: (1 points)

cleanse_listings

Your task: define cleanse_listings as follows:

Input: license: A single value from the license column of listings. If it is not null, license will be a string. However, our function should be able to handle null values, which will be represented as np.nan.

Return: clean_license: A string containing either 'licensed', 'exempt', or 'unlicensed'

Requirements:

To begin, we need to clean up listings by simplifying the license column. We're not actually interested in what specific type of license each Airbnb has - just whether they are licensed, unlicensed, or exempt. To help with this, write a function that can clean a string given the following criteria:

  1. If the word 'exempt' is present anywhere in the string, return the string 'exempt'. This will not be case-sensitive.

  2. If the string appears to be a license, return the string 'licensed'. A correct license format is as follows:

    • The string begins with one or more letters and numbers (not case sensitive),
    • Followed by a dash,
    • Ending with one or more numbers
    • Examples of passing licenses:
      • HSR19-000881
      • H-9
      • 0-9
      • hsr-0183
    • Examples of failing licenses:
      • ABCD123-4567890$
      • HSR-09R
      • License HSR-091
  3. If a neither of these criteria are met, or if the value passed in is null, return 'unlicensed'. The provided startercode handles null (np.nan) inputs, but you will still be need to account for strings which do not meet criteria 1 or 2.
In [ ]:
### Solution - Exercise 1  
def cleanse_listings(license: str) -> str:
    
    ##handle nan values
    if not isinstance(license, str):
        return 'unlicensed'

    ### BEGIN SOLUTION
  
    #1. 
    if 'exempt' in license.lower():
        return 'exempt'
    #2. 
    elif re.match(r'(^[a-zA-Z0-9]+-[0-9]+$)',license):
        return 'licensed'
    #3. 
    else:
        return 'unlicensed'
 

    ### END SOLUTION

### Demo function call

cleanse_listings_demo = utils.load_object_from_publicdata('cleanse_listings_demo.dill')

for i in cleanse_listings_demo:
    print(f'\n {i} -> {cleanse_listings(i)}')

The demo should display this printed output.

Exempt -> exempt

HSR19-000881 -> licensed

nan -> unlicensed

Exempt - This listing is a transient occupancy residential structure -> exempt

43158 -> unlicensed


The cell below will test your solution for cleanse_listings (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 1  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
              ex_name='cleanse_listings',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=200)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
              ex_name='cleanse_listings',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=200,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_find_superusers_TRUE = utils.load_object_from_publicdata('demo_result_find_superusers_TRUE')

Exercise 2: (1 points)

find_superusers

Your task: define find_superusers as follows:

Some Airbnb users have left more reviews, and their testimony may be more helpful. For example, for Google maps, Google calls such superusers 'local guides' and gives some deference to their reviews. We'd like to implement a similar approach.

Input:

  • reviews: A Pandas DataFrame containing reviews of Airbnb's, identified by listing_id.

Return: superusers: A Pandas DataFrame containing information about how many reviews each Airbnb user has left, and when their most recent review was posted. superusers should have the following columns:

  • reviewer_id: integer uniquely identifying a reviewer
  • reviewer_name: string containing the reviewer's name
  • user_review_count: integer representing the total number of reviews submitted by a reviewer
  • most_recent: datetime representing the most recent review submitted by a reviewer

Requirements:

  1. Drop rows with null values in any column
  2. Group and aggregate the data:
    1. Group reviews by reviewer_id
    2. Aggregate as follows:
      • Retain the first value of reviewer_name column
      • Store the count of the id column for a given user in a column named user_review_count
      • Store most recent (maximum) value of date for a reviewer in a column named most_recent
  3. Sort by user_review_count in descending order and reviewer_id in ascending order
  4. Reset the index to ensure it starts from 0
In [ ]:
### Solution - Exercise 2  
def find_superusers(reviews: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    

    r2 = reviews.copy()

    
    #1. 
    r2.dropna(inplace = True)
    
    #2. 
    user_review_counts = r2[['reviewer_id', 'reviewer_name','id','date']].groupby('reviewer_id', as_index = False                                                                          
        ).agg({'reviewer_name':'first', 'id':'count','date':'max'}
        ).rename(columns = {'id':'user_review_count', 'date':'most_recent'}
        #3. 
        ).sort_values(by = ['user_review_count','reviewer_id'], ascending = [False, True]
        ).reset_index(drop=True)

    return user_review_counts
    ### END SOLUTION

### Demo function call
superusers_demo = utils.load_object_from_publicdata('superusers_demo.dill')
demo_result_superusers_df = find_superusers(superusers_demo)
demo_result_superusers_df

The demo should display this output.

reviewer_id reviewer_name user_review_count most_recent
0 62195527 Don 3 2024-03-29
1 13775813 Julie 2 2024-07-26
2 440450367 Griffin 2 2024-04-17
3 286210 Carl 1 2024-08-19
4 3148220 Jill 1 2024-04-20
5 4954534 Jeremie 1 2024-07-20
6 80239899 Alejandro 1 2024-04-13
7 121588501 Trevor 1 2024-05-20
8 141406268 Sean 1 2024-05-01
9 156557523 Donnie 1 2024-05-11


The cell below will test your solution for find_superusers (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 2  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
              ex_name='find_superusers',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=25)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
              ex_name='find_superusers',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=25,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_haversine_TRUE = utils.load_object_from_publicdata('demo_result_haversine_TRUE')

Exercise 3: (2 points)

haversine

Your task: define haversine as follows:

Airbnb locations are only identified via latitude and longitude. For travel purposes, that's not very helpful, so we'll need to determine how far away they are manually. Assuming a target location in LA that we'd like to stay at, we need to be able to calculate how far away each Airbnb is in miles.

To do this, we need to use the Haversine Distance to measure the "as the crow flies" distance between two lat/long points.

The Haversine formula is as follows:

$$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$

where

  • $hav( \theta) = \sin^2( \theta/2)$
  • $\Delta \varphi = \varphi_2 - \varphi_1$
  • $\Delta\lambda = \lambda_2 - \lambda_1$
  • ($\varphi_1$, $\lambda_1$) is the latitude and longitude of point 1
  • ($\varphi_2$, $\lambda_2$) is the latitude and longitude of point 2
  • $\theta$ is the central angle between two points.

We have all the necessary information to determine the value of $hav(\theta)$. After solving the right-hand side of the equation, we can use the value of $hav(\theta)$ to solve for $\theta$:

  • $x = hav(\theta) = \sin^2(\theta/2)$

Solving algebraically for $\theta$,

$$ \theta = 2\arcsin(\sqrt(x)) $$

Now that we have the central angle $\theta$, we can find the distance $d$ using our computed value of $\theta$ and the radius of the Earth $r$, using the following relation:

$$d = r\theta$$

Note: Numpy's trigonometric functions expect angle measurements in radians. The measurements in the dataframe are given in degrees. Before passing the values to trigonometric function, you should use np.radians() to convert to radians.

Input:

  • point_1: A list containing one latitude-longitude pair, ordered [lat, long]
  • point_2: A list containing a second latitude-longitude pair, ordered[lat, long]
  • r: Earth's radius, in miles (we're assuming here that the Earth is perfectly round, which is not exactly true)

Return: d: a float value representing the Haversine Distance in miles between point_1 and point_2

Requirements:

  1. Convert the longitudes and latitudes to radians using np.radians())
  2. Solve for $d$ using the Numpy and the given formulas:

    $$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$

    $$\theta = 2\arcsin(\sqrt(hav( \theta)))$$

    $$d = r\theta$$

In [ ]:
### Solution - Exercise 3  
def haversine(point_1, point_2, r = 3959) -> float:
    ### BEGIN SOLUTION
    
    #1. 
    phi_1 = np.radians(point_1[0])
    phi_2 = np.radians(point_2[0])
    delta_phi = np.radians((point_2[0] - point_1[0]))
    delta_lambda = np.radians((point_2[1] - point_1[1]))

    #2. 
    x =  (np.sin(delta_phi/2)**2 + np.cos(phi_1)*np.cos(phi_2)*(np.sin(delta_lambda/2))**2)
    theta = 2 * np.arcsin(np.sqrt(x))
    
    
    ##multiply by radius of earth to get difference
    return theta*r

    ### END SOLUTION

### Demo function call
point1 = [38.898, -77.037]
point2 = [48.858, -2.294]
result = haversine(point1,point2)
print(result)

The demo should display this printed output.

3633.265427951256


The cell below will test your solution for haversine (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 3  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=haversine,
              ex_name='haversine',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=haversine,
              ex_name='haversine',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=20,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_filter_stay_TRUE = utils.load_object_from_publicdata('demo_result_filter_stay_TRUE')

Exercise 4: (2 points)

filter_stay

Your task: define filter_stay as follows:

Now that we have a method of calculating distance, we want to filter the listings to just ones that are within range of our desired location, and accommodate enough people.

In the event that you did not solve Exercise 3, you may still complete this exercise independently. Instead of the Haversine distance, this exercise will test you using a provided helper function, which calculates the Euclidean distance. It converts the polar latitude/longitude pairs to cartesian coordinates, and then computes the linear distance between them. This will serve as a rough estimate.

Input:

  • criteria - A dictionary with the following keys:
    • destination: A list containing one latitude-longitude pair, ordered [lat, long], representing our desired stay location
    • guests: An integer representing how many people we will be travelling with. An Airbnb is acceptable if accommodates $\ge$ guests
    • dist_range: A float value representing the maximum distance of an Airbnb from the desired stay location. An Airbnb is acceptable if distance $\le$ dist_range
  • clean_listings: The result of Exercise 1 - A Pandas DataFrame containing Airbnb listing information with corrected license information
  • euclid: A function which can be called to calculate distance using longitude and latitude

Return: options: a Pandas DataFrame of filtered listings which could be suitable for our vacation

Requirements:

  1. Filter clean_listings to rows where accommodates $\ge$ guests
  2. Compute the distance between each Airbnb and the desired stay destination using the provided helper function and the latitude and longitude columns in clean_listings
    • Store this column as distance
  3. Filter clean_listings so that distance $\le$ dist_range
  4. Sort results by distance in ascending order, then by listing_id in ascending order
  5. As with Exercise 2, make sure you reset your index to ensure it starts at 0

Note: It's very possible that your criteria will return no viable options! This is okay - if you have 10 guests and you want to stay within one mile of your destination, it's reasonable to assume you may not find anything suitable (and may need to relax your criteria.)

  • In this case, just return an empty dataframe.
In [ ]:
### Helper Function
def euclid(point_1, point_2, r = 3959):
    
    ##convert to radians
    lat1 = point_1[0]*np.pi/180
    long1 = point_1[1]*np.pi/180
    lat2 = point_2[0]*np.pi/180
    long2 = point_2[1]*np.pi/180
    
    ##convert from polar to cartesian coordinates
    x1 = r * np.cos(lat1) * np.sin(long1)
    y1 = r * np.sin(lat1)
    z1 = r * np.cos(lat1) * np.cos(long1)


    x2 = r * np.cos(lat2) * np.sin(long2)
    y2 = r * np.sin(lat2)
    z2 = r * np.cos(lat2) * np.cos(long2)

    ##calculate distance
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2 + (z2 - z1)**2)

### Solution - Exercise 4  
def filter_stay(criteria: dict, clean_listings: pd.DataFrame, euclid) -> pd.DataFrame:
    ### BEGIN SOLUTION
    
    #1. 
    df  = clean_listings.copy()
    df=df[df['accommodates'] >= criteria['guests']]

    #2.
    def get_distance(x):
        return euclid(criteria['destination'], [x['latitude'],x['longitude']])
    df['distance'] = df[['latitude', 'longitude']].apply(get_distance, axis = 1)
    
    #3. 
    options = df[df['distance'] <= criteria['dist_range']]
    
    #4. 
    return options.sort_values(by = ['distance', 'listing_id'], ignore_index =True)
    ### END SOLUTION

### Demo function call
demo_options, demo_criteria = utils.load_object_from_publicdata('filter_stay_demo.dill')
demo_result_options_df = filter_stay(demo_criteria, demo_options, euclid)
demo_result_options_df

The demo should display this output.

listing_id host_id host_since latitude longitude room_type accommodates license distance
0 1099417246276636063 245392128 2019-02-25 34.091298 -118.351877 Entire home/apt 5 exempt 6.813108
1 908697518167341476 409985549 2021-06-29 34.088171 -118.365521 Entire home/apt 8 licensed 7.461785
2 51254530 61324086 2016-03-02 34.092360 -118.363940 Private room 2 exempt 7.478845


The cell below will test your solution for filter_stay (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 4  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
              ex_name='filter_stay',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
              ex_name='filter_stay',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_find_availability_TRUE = utils.load_object_from_publicdata('demo_result_find_availability_TRUE')

Exercise 5: (3 points)

find_availability

Your task: define find_availability as follows:

Now that we have filtered clean_listings down to some viable options, we need to determine whether they're actually available on the dates we want to vacation.

Input:

  • stay_length: An integer representing the total number of nights you intend to stay.
  • first_night: A datetime object representing the first night of your stay.
  • budget: An integer representing the maximum amount of money you'd like to spend per night on lodging.
  • options: The result of Exercise 4, a Pandas DataFrame of filtered listings which could be suitable for our vacation
  • calendar A DataFrame containing information about listing availability and pricing for 2024-2025 with the following relevant columns:
    • listing_id: integer that uniquely identifies a particular Airbnb
    • date: a datetime that represents a particular calendar day in 2024-2025
    • available: a boolean that represents whether a listing is available for rent on a particular date
    • price: a float representing the price of a listing on a particular date

Return: available: A list of listing_ids that suit our criteria and are available during our dates.

Requirements:

  1. Merge calendar and options on listing_id using an inner join
  2. Filter the resulting DataFrame to listings that have minimum_nights $\le$ stay_length $\le$ maximum_nights
    • Step 1 and 2 are provided in the starter code.
  3. Filter to dates of interest - assuming that first_night counts as the first day of your stay length, filter the DataFrame to only include the dates of your stay.
    • For example, if stay_length == 3, and the first night of your stay is 09-01-2025, your dataframe should contain the following dates:
      • 09-01-2025
      • 09-02-2025
      • 09-03-2025
    • (This corresponds to three nights: Check in on 09-01, stay that night, the night of 09-02, and the night of 09-03. You'll check out on the morning of 09-04.)
    • Hint: timedelta() might be helpful.
  4. Check whether listings are available for every date within your stay.
  5. We want a sorted list of listing_ids which meet these criteria. For each listing_id:
    • available = True for all nights of the stay
    • The mean of the price column $\le$ budget
    • Every night of the stay is represented in the calendar
      • In the above example, a listing would need to have calendar data for 09-01, 09-02, and 09-03 to be viable. If a listing is available and in-budget for 09-01 and 09-03, but no data is available for 09-02, the listing should not be considered.
  6. These listing_id's should be sorted in ascending order
    • If there are no available listings for your desired dates, your function should return an empty list
In [ ]:
### Solution - Exercise 5  
def find_availability( stay_length: int, first_night: datetime, budget:int, options: pd.DataFrame, calendar: pd.DataFrame) -> list:
    from datetime import timedelta
    
    #1. 
    df = calendar.merge(options['listing_id'], on = 'listing_id')
    
    #2.
    df = df[df['minimum_nights']<= stay_length]
    df = df[df['maximum_nights']>= stay_length]
    
    
    ### BEGIN SOLUTION
    
    #3. 
    df = df.loc[df['date']< first_night + timedelta(stay_length)]
    df = df.loc[df['date']>=first_night]

    #4. check price and availability
    def filter_avail(x):
        if len(x) != stay_length:
            return False
        else:
            avail = x['available'].all()
            in_budget = x['price'].mean() <= budget
            return avail & in_budget
        
    available = df.groupby(by = 'listing_id', group_keys = True).apply(filter_avail)
    
    #5.
    avail_list = available[available == True].index.to_list()

    return sorted(avail_list)
    ### END SOLUTION

### Demo function call
demo_options, demo_stay, demo_first, demo_budget = utils.load_object_from_publicdata('find_availability_demo.dill')
demo_result_avail_df = find_availability(demo_stay, demo_first, demo_budget, demo_options, calendar)
print(demo_result_avail_df)
demo_result_avail_df

The demo should display this printed output.

[588426797893466377, 623184032121344750, 786386342683651712]


The cell below will test your solution for find_availability (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 5  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
              ex_name='find_availability',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
              ex_name='find_availability',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_get_sentiment_TRUE = utils.load_object_from_publicdata('demo_result_get_sentiment_TRUE')
In [ ]:
superusers = utils.load_object_from_publicdata('superusers.dill')

Aside: NLKT's SentimentIntensityAnalzer()

The next exercise makes use of nltk's SentimentIntensityAnalyzer(). Given a string representing a short blurb of text, the model returns a dictionary representing the sentiment of the text. There are scores representing 'positive', 'neutral', and 'negative' sentiment, as well as a more complex 'compound' score. This is what we'll be using to score our Airbnb reviews.

Notably, the model is only trained on English text, and therefore treats any non-English text as neutral.

The example below demonstrates how to instantiate and use the model, and displays results for the same sentence in both English and Spanish. Note that the Spanish sentence has a compound score of 0.0.

In [ ]:
sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores('CSE6040 is my favorite class. I love Python and I love data analysis.'))
print(sia.polarity_scores('CSE6040 es mi asignatura favorita. Me encanta Python y el análisis de datos.'))

Exercise 6: (2 points)

get_sentiment

Your task: define get_sentiment as follows:

We now have a list of suitable Airbnbs that are available on our desired list of dates. In order to choose between them, it will be useful to consult the reviews. This dataset does not contain quantitative review data (e.g. 4 out of 5 stars) for individual reviews. It does, however, have text for each review, which we can analyze for overall sentiment.

Input:

  • reviews - A DataFrame containing reviews of Airbnbs, linked to each location by listing_id.
  • available_listings - The result of Exercise 5, a list containing listings that match all our distance, date, and budget criteria

  • superusers- The result of Exercise 2, a Pandas Dataframe consisting of reviewer IDs, the total number of reviews for each user, and their most recent review

Return: comments: A DataFrame consisting of reviews with sentiment analysis added, merged with superuser information

Requirements:

  1. Filter reviews to include only listing_ids that appear in available_listings
  2. Remove rows with NaN values in the comments column
  3. Merge this DataFrame with superusers on reviewer_id using an inner join
    • Keep all columns from reviews and the user_review_count column from superusers
  4. Use nltk's SentimentIntensityAnalyzer() to assign a score to each review's comments using polarity_scores()
    • Store the compound sentiment value for each review in a new column labelled sentiment
  5. Drop any rows with sentiment == 0.0 (we're assuming that reviews with 'compound' scores of 0.0 are either unuseful or in a language that the model cannot parse.)
In [ ]:
### Solution - Exercise 6  
def get_sentiment(reviews: pd.DataFrame, available_listings: list, superusers: pd.DataFrame) -> pd.DataFrame:
    
    sia = SentimentIntensityAnalyzer()
    ### BEGIN SOLUTION

    #1.
    comments = reviews.loc[reviews['listing_id'].isin(available_listings)]

    #2.
    comments = comments.loc[~comments['comments'].isna()]
    
    #3.
    comments = comments.merge(superusers[['user_review_count', 'reviewer_id']], on = 'reviewer_id')

    #4.
    comments['sentiment'] = comments['comments'].apply(lambda x: sia.polarity_scores(x)['compound'])

    #5. 
    comments = comments.loc[comments['sentiment']!=0]
    
    return comments
    ### END SOLUTION

### Demo function call
demo_avail = [44542527, 42572989]
demo_sentiment = get_sentiment(reviews, demo_avail, superusers)
demo_sentiment

The demo should display this output.

listing_id id date reviewer_id reviewer_name comments user_review_count sentiment
0 42572989 621367488 2020-03-30 276894729 Sara You will be more then happy with this place! Feels like home! 2 0.7942
1 44542527 881251541600796134 2023-04-30 92951767 Steve Scott’s apartment is in the perfect location in Silverlake. Very close to all the local hotspots on Sunset Blvd, and uniquely sits very close to sunset Junction, meaning it’s also walkable to all the best in Los Feliz and Virgil Village, too. We enjoyed a different neighborhood walk every morning we were here. the views are incredible and reach to the hills of Hollywood (you can see the iconic sign right from the living room), Griffith Park, and enjoy sunsets west over the Hollywood hills and beyond. The neighborhood is private and quaint, and the apartment is tucked far from the street making it quiet, private, and very peaceful. The neighbors in the building are wonderful and very friendly. And the walls are thick - we didn’t hear a thing from any other apartment the entire 3 months we lived here. the apartment is modern and clean. Highly recommend. 2 0.9844
2 44542527 816836676950366108 2023-01-31 11611847 Debbie Had a wonderful stay at Scott’s home. His place is large, with good light, and a great location, conveniently located next to restaurants, coffee shops, and bars. Our host Joe was responsive, friendly, and accommodating. 1 0.9468
3 44542527 458673161248982460 2021-09-24 676224 Alexis Great location, clean and modern, very responsive host. Centrally located to great shops and restaurants, a great base for exploring the East Side and LA 2 0.9571
4 44542527 745649760 2021-04-04 15466686 Sofia Scott’s place is sparkling clean with awesome mid century style throughout the apartment. The location is amazing (right near Erewhon which is the heaven of super markets). The apartment has large sunny windows with an amazing view, we loved the hummingbirds that visited our window every morning. Everything is brand new and perfect for stay at home working with all that you need. The spa like shower also so lovely after a long day. Scott is a very responsive and accommodating host, Thank you, Scott! 1 0.9932


The cell below will test your solution for get_sentiment (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 6  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
              ex_name='get_sentiment',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
              ex_name='get_sentiment',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_rank_top_listings_TRUE = utils.load_object_from_publicdata('demo_result_rank_top_listings_TRUE')

Exercise 7: (3 points)

rank_top_listings

Your task: define rank_top_listings as follows:

Now that the relevant comments have been associated with sentiment scores, we can use them to find our best options. The simplest thing to do would be to sort by sentiment and call it day, but some reviews are more informative than others. For example, recent reviews are usually much more insightful and pertinent to an upcoming stay, and we may be more inclined to trust frequent reviewers or superusers (this is why Google has the Local Guide feature, for example). Let's implement a weighted average based on this information, so that we can find our best options.

As a reminder, a weighted average is computed as follows:

$$\bar{x} = \frac{\sum_{i=1}^{n}{w_ix_i}}{\sum_{i=1}^{n}{w_i}}$$

which expands to

$$\bar{x} = \frac{w_1x_1+w_2x_2+... + w_nx_n}{w_1+w_2+...+w_n}$$

Where $w_i$ is the weight assigned to point $x_i$, and $n$ is the total number of points.

Input:

  • comments - The result of Exercise 6; A DataFrame consisting of reviews with sentiment analysis added, merged with superuser information

Return: top_listings - A Pandas dataframe consisting of the top listing_ids and their weighted scores

Requirements:

  1. Compute the weights column:
    • If a review has date $\ge$:
      • 2024-01-01: assign a weight of 1
      • 2021-01-01: assign a weight of 0.8
      • 2018-01-01: assign a weight of 0.5
    • For dates $\le$ 2018-01-01, assign a weight of 0.1
    • To these weights, add 0.01 for every user_review_count greater than 1 (e.g. 0.01*(user_review_count-1))
    • For example, if a review was left on 2023-02-02 and the reviewer's user_review_count is 2, the weight assigned to that review should be .81 -Hint: You may find strptime() useful.
  2. Group by listing_id and aggregate sentiments using the weighted average forumula above. Use the computed column weights you computed as $w_i$ and sentiment as $x_1$. Store the resulting column as weighted_avg.
  3. Sort by weighted_avg in descending order and listing_id in ascending order
In [ ]:
### Solution - Exercise 7  
def rank_top_listings(comments: pd.DataFrame):
    ### BEGIN SOLUTION
    
    comments_copy = comments.copy()
    
    #1. 
    def apply_weights(x):
        if x['date'] >= datetime.strptime('2024-01-01', '%Y-%m-%d'):
            weight = 1
        elif datetime.strptime('2024-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2021-01-01', '%Y-%m-%d'):
            weight = .8
        elif datetime.strptime('2021-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2018-01-01', '%Y-%m-%d'):
            weight = .5
        else:
            weight = .1
        ##rate superuser reviews slightly more heavily - add .01 weight for every review > 1
        weight += .01*(x['user_review_count']-1)
        return weight

    
    
    comments_copy['weights'] = comments_copy[['date','user_review_count']].apply(apply_weights, axis = 1)
    
    
    #2. 
    def compute_avg(x):
        return (x['weights']*x['sentiment']).sum()/x['weights'].sum()

    weighted = comments_copy.groupby('listing_id', as_index = False).apply(compute_avg).rename(columns = {None:'weighted_avg'})

    3.
    top_listings = weighted.sort_values(by = ['weighted_avg', 'listing_id'], ascending = [False,True], ignore_index = True)

    return top_listings
    ### END SOLUTION

### Demo function call
demo_comments = utils.load_object_from_publicdata('rank_top_listings_demo.dill')
demo_top_listings = rank_top_listings(demo_comments)
demo_top_listings

The demo should display this output.

listing_id weighted_avg
0 899909 0.836561
1 828540 0.795065


The cell below will test your solution for rank_top_listings (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 7  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
              ex_name='rank_top_listings',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
              ex_name='rank_top_listings',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Part 2: Evaluating LA Crime Data

Now that we have found the top Airbnb listings available for our desired dates, let's enhance our search by evaluating how safe each neighborhood in LA is. We will use crime data provided by the City of Los Angeles.

Our database includes two tables: crime and population

The crime table contains a record of each crime committed, including information such as the date, lat/long, area, and crime description.

The population table only contains three columns - the population and number of square miles contained in each of the 21 areas in LA.

Run the queries below to view samples of the data.

In [ ]:
# Sample of crime table:
sample_crime_data = pd.read_sql('select * from crime limit 3', conn)
display(sample_crime_data)
In [ ]:
# Sample of population table:
sample_population_data = pd.read_sql('select * from population limit 3', conn)
display(sample_population_data)
In [ ]:
### Run Me!!!
demo_result_prime_crime_time_TRUE = utils.load_object_from_publicdata('demo_result_prime_crime_time_TRUE')

Exercise 8: (3 points)

prime_crime_time

Your task: define prime_crime_time_query as follows:

The crime table contains entries for each crime committed in LA. Every crime record contains a crime description crime_desc, as well as the date on which the crime occurred date_occ. Let us first determine what the top crimes committed each month are.

Input: None

Return: A SQLite query that returns the top two crimes for each month.

Requirements:

  • Your solution must contain a SQLite query
  • Your solution must contain the following columns: month, description, crime_count, and crime_rank
  • Your result should extract the month from the date_occ column, and create a string containing the numeric month value along with the 3 letter month abbreviation. For example, January would be 01-JAN, September would be 09-SEP, December would be 12-DEC, etc.
  • Only return the top 2 crimes per month, along with their count and monthly ranking (either 1 or 2). In the event of ties, return all tied values
  • Column order and row order do not matter
In [ ]:
### Solution - Exercise 8  
prime_crime_time_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
prime_crime_time_query = '''
with count_by_month as (
    select
        CASE substr(date_occ, 1, 2)
          WHEN '01' THEN '01-JAN'
          WHEN '02' THEN '02-FEB'
          WHEN '03' THEN '03-MAR'
          WHEN '04' THEN '04-APR'
          WHEN '05' THEN '05-MAY'
          WHEN '06' THEN '06-JUN'
          WHEN '07' THEN '07-JUL'
          WHEN '08' THEN '08-AUG'
          WHEN '09' THEN '09-SEP'
          WHEN '10' THEN '10-OCT'
          WHEN '11' THEN '11-NOV'
          WHEN '12' THEN '12-DEC'
          ELSE NULL
        END AS month,
        crime_desc description,
        count(crime_desc) crime_count
    from
        crime
    group by 1, 2
)
select * from (
select
    month,
    description,
    crime_count,
    rank() OVER (partition by month order by crime_count desc) AS crime_rank
from
    count_by_month
)
where crime_rank <= 2
'''
### END SOLUTION


### Demo function call
demo_prime_crime_time_result = pd.read_sql(prime_crime_time_query,conn)
demo_prime_crime_time_result

The demo should display this output.

month description crime_count crime_rank
0 01-JAN VEHICLE - STOLEN 4097 1
1 01-JAN BATTERY - SIMPLE ASSAULT 2891 2
2 02-FEB VEHICLE - STOLEN 3666 1
3 02-FEB THEFT OF IDENTITY 2954 2
4 03-MAR VEHICLE - STOLEN 3942 1
5 03-MAR BATTERY - SIMPLE ASSAULT 2695 2
6 04-APR VEHICLE - STOLEN 3611 1
7 04-APR BATTERY - SIMPLE ASSAULT 2417 2
8 05-MAY VEHICLE - STOLEN 3847 1
9 05-MAY BATTERY - SIMPLE ASSAULT 1973 2
10 06-JUN VEHICLE - STOLEN 3795 1
11 06-JUN BURGLARY FROM VEHICLE 1694 2
12 07-JUL VEHICLE - STOLEN 4171 1
13 07-JUL VANDALISM - FELONY (400 and OVER, ALL CHURCH VANDALISMS) 1777 2
14 08-AUG VEHICLE - STOLEN 4074 1
15 08-AUG BURGLARY FROM VEHICLE 1806 2
16 09-SEP VEHICLE - STOLEN 3904 1
17 09-SEP BURGLARY FROM VEHICLE 1825 2
18 10-OCT VEHICLE - STOLEN 4109 1
19 10-OCT BURGLARY FROM VEHICLE 2021 2
20 11-NOV VEHICLE - STOLEN 3867 1
21 11-NOV BURGLARY FROM VEHICLE 2039 2
22 12-DEC VEHICLE - STOLEN 2898 1
23 12-DEC BURGLARY FROM VEHICLE 1918 2


The cell below will test your solution for prime_crime_time (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 8  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
              ex_name='prime_crime_time',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
              ex_name='prime_crime_time',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_crime_center_TRUE = utils.load_object_from_publicdata('demo_result_crime_center_TRUE')

Exercise 9: (1 points)

crime_center

Your task: define crime_center_query as follows:

The crime table contains entries categorized into the areas served by each police department within LA. Each crime entry contains the coordinates at which the crime occurred. Let us now determine the crime center for each police department's area.

Input: None

Return: A SQLite query that returns the average coordinates for each area.

Requirements:

  • Your solution must contain a SQLite query
  • The columns lat and long in the crime table contain the coordinates at which the crime occurred
  • Return the average lat and long for each area. Name these columns center_lat and center_long
  • Your solution must contain the following columns: area, center_lat, and center_long
  • Order your results in ascending alphabetical order by area
In [ ]:
### Solution - Exercise 9  
crime_center_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
crime_center_query = '''select area, avg(lat) center_lat, avg(long) center_long from crime group by area order by 1'''
### END SOLUTION


### Demo function call
demo_crime_center_result = pd.read_sql(crime_center_query,conn)
demo_crime_center_result

The demo should display this output.

area center_lat center_long
0 77th Street 33.977699 -118.297673
1 Central 34.048074 -118.250774
2 Devonshire 34.250923 -118.537942
3 Foothill 34.245504 -118.370130
4 Harbor 33.770143 -118.285061
5 Hollenbeck 34.053705 -118.203808
6 Hollywood 34.099204 -118.328856
7 Mission 34.254761 -118.450299
8 N Hollywood 34.172124 -118.384572
9 Newton 34.010694 -118.260135
10 Northeast 34.104364 -118.238708
11 Olympic 34.060322 -118.300204
12 Pacific 33.981253 -118.419988
13 Rampart 34.062123 -118.267145
14 Southeast 33.939051 -118.266904
15 Southwest 34.020676 -118.315023
16 Topanga 34.191668 -118.601957
17 Van Nuys 34.176803 -118.439639
18 West LA 34.051339 -118.439908
19 West Valley 34.186766 -118.519701
20 Wilshire 34.062164 -118.351886


The cell below will test your solution for crime_center (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 9  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
              ex_name='crime_center',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
              ex_name='crime_center',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')
In [ ]:
### Run Me!!!
demo_result_relative_crime_rate_TRUE = utils.load_object_from_publicdata('demo_result_relative_crime_rate_TRUE')

Exercise 10: (2 points)

relative_crime_rate

Your task: define relative_crime_rate_query as follows:

The crime table contains entries categorized into the areas served by each police department within LA. The population table contains the population and area in square miles for each of LA's police district areas.

We would now like to determine how dangerous each area is relative to the others. We can do this by computing:

crimes_per_100k: This is the floating-point number 100,000.0 multiplied by the number of crimes committed in a given area. This is then divided by the population for that area.

The formula is: crimes_per_100k = 100000.0 * number of crimes / population for each area

crimes_per_sq_mile: This is the ratio of the number of crimes committed in a given area divided by the number of square miles in that area.

The formula is: crimes_per_sq_mile = number of crimes / square miles for each area

Input: None

Return: A SQLite query that returns the relative crime rate for each area.

Requirements:

  • Your solution must contain a SQLite query
  • Your query should return the following columns: area, crimes_per_100k, and crimes_per_sq_mile
  • You should round your results to the nearest whole number
  • You should order your results by crimes_per_100k in descending order, crimes_per_sq_mile in descending order, and finally by area in ascending alphabetical order
In [ ]:
### Solution - Exercise 10  
relative_crime_rate_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
relative_crime_rate_query = '''
        with scores as (
        select area, count(case_num) total_score from crime
        group by area)
        select s.area, round(s.total_score * 100000.0 / p.population) crimes_per_100k, round(s.total_score / p.sq_miles) crimes_per_sq_mile
        from scores s inner join population p on s.area = p.area
        order by 2 desc, 3 desc, 1
        '''
### END SOLUTION


### Demo function call
demo_relative_crime_rate_result = pd.read_sql(relative_crime_rate_query,conn)
demo_relative_crime_rate_result

The demo should display this output.

area crimes_per_100k crimes_per_sq_mile
0 Central 67945.0 6040.0
1 Southwest 12981.0 1634.0
2 77th Street 11845.0 1742.0
3 Newton 11546.0 1924.0
4 Southeast 11265.0 1657.0
5 Pacific 10977.0 853.0
6 Rampart 10384.0 3092.0
7 Olympic 8763.0 2827.0
8 N Hollywood 8548.0 752.0
9 Harbor 8446.0 535.0
10 West Valley 7732.0 454.0
11 Topanga 7459.0 821.0
12 Wilshire 7178.0 1290.0
13 Devonshire 7144.0 324.0
14 West LA 7017.0 246.0
15 Hollenbeck 6362.0 837.0
16 Mission 6350.0 571.0
17 Foothill 6096.0 241.0
18 Northeast 5986.0 516.0
19 Hollywood 5867.0 1023.0
20 Van Nuys 4831.0 523.0


The cell below will test your solution for relative_crime_rate (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [ ]:
### Test Cell - Exercise 10  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
              ex_name='relative_crime_rate',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
              ex_name='relative_crime_rate',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Fin.

If you have made it this far, congratulations! Remember to submit the exam to ensure you receive all the points you have earned!