`Fall 2025 MT2`: `A Better Airbnb`¶

Version 1.0.2

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of data cleaning, filtering, and analysis in Pandas, SQL, and Numpy. It has 11 exercises numbered 0 to 10. There are 21 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s)
Exercise 2 - : 1 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 3 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 1 point(s)
Exercise 10 - : 2 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Planning a Vacation in Los Angeles, California¶

Background: Airbnb is an online marketplace that allows property owners to rent their space to prospective short and long-term tenants. The app enables renters to filter potential living spaces by destination, price, stay type, stay dates, and more. Users also have access to review data from previous guests in order to make informed decisions. Even still, the process of narrowing down a good place to stay can be overwhelming, especially if you are unfamiliar with the area.

Your overall task: Your goal is to use Airbnb data to help a prospective renter narrow down relevant stay locations, and then determine the best options based on reviews and area information. You will do this in two main steps:

First, you will clean, filter, and rank the Airbnb locations using review and listing data.
Second, you will analyze crime data for the various neighborhoods within LA to ensure your chosen location is safe.

The datasets: There are two sources of data which you will use to solve the following exercises.

Information about Airbnb listings in Los Angeles, California from 2024-2025, sourced from InsideAirbnb.
Crime data from 2020-2025 for Los Angeles, sourced from the City of Los Angeles.

The listing data (source 1) is provided as three seperate Pandas data frames containing information about listings, reviews, and booking. The crime data (source 2) is contained in a SQLite Database.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re 
import pandas as pd

### Run Me!!!
import pandas as pd
import numpy as np
from pprint import pprint
import sqlite3
import re
from datetime import datetime, timedelta, date
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer


listings=utils.load_object_from_publicdata('listings.dill')
reviews=utils.load_object_from_publicdata('reviews.dill')
calendar=utils.load_object_from_publicdata('calendar.dill')
conn=sqlite3.connect("resource/asnlib/publicdata/crimes_and_population.db")

Exercise 0: (1 points)¶

explore_airbnb_data__FREE

Example: we have defined explore_airbnb_data__FREE as follows:

This is a free exercise! Please run the test cell below to collect your FREE point! We encourage you to review the samples of both listings and reviews in the cell below:

listings A DataFrame containing descriptions, location, and host information for active Airbnbs in Los Angeles, California, as of September 2024.
reviews A DataFrame containing reviews of Airbnbs, linked to each location by listing ID.
calendar A DataFrame containing information about listing availability and pricing for 2024-2025.

### Solution - Exercise 0  
def explore_airbnb_data__FREE(listings: pd.DataFrame, reviews: pd.DataFrame, calendar: pd.DataFrame, headsize:int=10) -> tuple:
    listings_preview = listings.head(n=headsize)
    reviews_preview = reviews.head(n=headsize)
    calendar_preview = calendar.head(n=headsize)
    return (listings_preview, reviews_preview, calendar_preview)

### Demo function call
listings_preview,reviews_preview,calendar_preview = explore_airbnb_data__FREE(listings, reviews, calendar)
display(listings.head(5))
display(reviews.head(5))
display(calendar.head(5))

The test cell below will always pass. Please submit to collect your free points for explore_airbnb_data__FREE (exercise 0).

### Test Cell - Exercise 0  


print('Passed! Please submit.')

### Run Me!!!
demo_result_cleanse_listings_TRUE = utils.load_object_from_publicdata('demo_result_cleanse_listings_TRUE')

Exercise 1: (1 points)¶

cleanse_listings

Your task: define cleanse_listings as follows:

Input: license: A single value from the license column of listings. If it is not null, license will be a string. However, our function should be able to handle null values, which will be represented as np.nan.

Return: clean_license: A string containing either 'licensed', 'exempt', or 'unlicensed'

Requirements:

To begin, we need to clean up listings by simplifying the license column. We're not actually interested in what specific type of license each Airbnb has - just whether they are licensed, unlicensed, or exempt. To help with this, write a function that can clean a string given the following criteria:

If the word 'exempt' is present anywhere in the string, return the string 'exempt'. This will not be case-sensitive.
If the string appears to be a license, return the string 'licensed'. A correct license format is as follows:
- The string begins with one or more letters and numbers (not case sensitive),
- Followed by a dash,
- Ending with one or more numbers
- Examples of passing licenses:
  - HSR19-000881
  - H-9
  - 0-9
  - hsr-0183
- Examples of failing licenses:
  - ABCD123-4567890$
  - HSR-09R
  - License HSR-091
If a neither of these criteria are met, or if the value passed in is null, return 'unlicensed'. The provided startercode handles null (np.nan) inputs, but you will still be need to account for strings which do not meet criteria 1 or 2.

### Solution - Exercise 1  
def cleanse_listings(license: str) -> str:
    
    ##handle nan values
    if not isinstance(license, str):
        return 'unlicensed'

    ### BEGIN SOLUTION
  
    #1. 
    if 'exempt' in license.lower():
        return 'exempt'
    #2. 
    elif re.match(r'(^[a-zA-Z0-9]+-[0-9]+$)',license):
        return 'licensed'
    #3. 
    else:
        return 'unlicensed'
 

    ### END SOLUTION

### Demo function call

cleanse_listings_demo = utils.load_object_from_publicdata('cleanse_listings_demo.dill')

for i in cleanse_listings_demo:
    print(f'\n {i} -> {cleanse_listings(i)}')

The demo should display this printed output.

Exempt -> exempt

HSR19-000881 -> licensed

nan -> unlicensed

Exempt - This listing is a transient occupancy residential structure -> exempt

43158 -> unlicensed

The cell below will test your solution for cleanse_listings (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
              ex_name='cleanse_listings',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=200)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
              ex_name='cleanse_listings',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=200,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_find_superusers_TRUE = utils.load_object_from_publicdata('demo_result_find_superusers_TRUE')

Exercise 2: (1 points)¶

find_superusers

Your task: define find_superusers as follows:

Some Airbnb users have left more reviews, and their testimony may be more helpful. For example, for Google maps, Google calls such superusers 'local guides' and gives some deference to their reviews. We'd like to implement a similar approach.

Input:

reviews: A Pandas DataFrame containing reviews of Airbnb's, identified by listing_id.

Return: superusers: A Pandas DataFrame containing information about how many reviews each Airbnb user has left, and when their most recent review was posted. superusers should have the following columns:

reviewer_id: integer uniquely identifying a reviewer
reviewer_name: string containing the reviewer's name
user_review_count: integer representing the total number of reviews submitted by a reviewer
most_recent: datetime representing the most recent review submitted by a reviewer

Requirements:

Drop rows with null values in any column
Group and aggregate the data:
1. Group reviews by reviewer_id
2. Aggregate as follows:
  - Retain the first value of reviewer_name column
  - Store the count of the id column for a given user in a column named user_review_count
  - Store most recent (maximum) value of date for a reviewer in a column named most_recent
Sort by user_review_count in descending order and reviewer_id in ascending order
Reset the index to ensure it starts from 0

### Solution - Exercise 2  
def find_superusers(reviews: pd.DataFrame) -> pd.DataFrame:
    ### BEGIN SOLUTION
    

    r2 = reviews.copy()

    
    #1. 
    r2.dropna(inplace = True)
    
    #2. 
    user_review_counts = r2[['reviewer_id', 'reviewer_name','id','date']].groupby('reviewer_id', as_index = False                                                                          
        ).agg({'reviewer_name':'first', 'id':'count','date':'max'}
        ).rename(columns = {'id':'user_review_count', 'date':'most_recent'}
        #3. 
        ).sort_values(by = ['user_review_count','reviewer_id'], ascending = [False, True]
        ).reset_index(drop=True)

    return user_review_counts
    ### END SOLUTION

### Demo function call
superusers_demo = utils.load_object_from_publicdata('superusers_demo.dill')
demo_result_superusers_df = find_superusers(superusers_demo)
demo_result_superusers_df

The demo should display this output.

	reviewer_id	reviewer_name	user_review_count	most_recent
0	62195527	Don	3	2024-03-29
1	13775813	Julie	2	2024-07-26
2	440450367	Griffin	2	2024-04-17
3	286210	Carl	1	2024-08-19
4	3148220	Jill	1	2024-04-20
5	4954534	Jeremie	1	2024-07-20
6	80239899	Alejandro	1	2024-04-13
7	121588501	Trevor	1	2024-05-20
8	141406268	Sean	1	2024-05-01
9	156557523	Donnie	1	2024-05-11

The cell below will test your solution for find_superusers (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 2  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
              ex_name='find_superusers',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=25)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
              ex_name='find_superusers',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=25,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_haversine_TRUE = utils.load_object_from_publicdata('demo_result_haversine_TRUE')

Exercise 3: (2 points)¶

haversine

Your task: define haversine as follows:

Airbnb locations are only identified via latitude and longitude. For travel purposes, that's not very helpful, so we'll need to determine how far away they are manually. Assuming a target location in LA that we'd like to stay at, we need to be able to calculate how far away each Airbnb is in miles.

To do this, we need to use the Haversine Distance to measure the "as the crow flies" distance between two lat/long points.

The Haversine formula is as follows:

$$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$

where

$hav( \theta) = \sin^2( \theta/2)$
$\Delta \varphi = \varphi_2 - \varphi_1$
$\Delta\lambda = \lambda_2 - \lambda_1$
($\varphi_1$, $\lambda_1$) is the latitude and longitude of point 1
($\varphi_2$, $\lambda_2$) is the latitude and longitude of point 2
$\theta$ is the central angle between two points.

We have all the necessary information to determine the value of $hav(\theta)$. After solving the right-hand side of the equation, we can use the value of $hav(\theta)$ to solve for $\theta$:

$x = hav(\theta) = \sin^2(\theta/2)$

Solving algebraically for $\theta$,

$$ \theta = 2\arcsin(\sqrt(x)) $$

Now that we have the central angle $\theta$, we can find the distance $d$ using our computed value of $\theta$ and the radius of the Earth $r$, using the following relation:

$$d = r\theta$$

Note: Numpy's trigonometric functions expect angle measurements in radians. The measurements in the dataframe are given in degrees. Before passing the values to trigonometric function, you should use np.radians() to convert to radians.

Input:

point_1: A list containing one latitude-longitude pair, ordered [lat, long]
point_2: A list containing a second latitude-longitude pair, ordered[lat, long]
r: Earth's radius, in miles (we're assuming here that the Earth is perfectly round, which is not exactly true)

Return: d: a float value representing the Haversine Distance in miles between point_1 and point_2

Requirements:

Convert the longitudes and latitudes to radians using np.radians())
Solve for $d$ using the Numpy and the given formulas:

$$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$

$$\theta = 2\arcsin(\sqrt(hav( \theta)))$$

$$d = r\theta$$

### Solution - Exercise 3  
def haversine(point_1, point_2, r = 3959) -> float:
    ### BEGIN SOLUTION
    
    #1. 
    phi_1 = np.radians(point_1[0])
    phi_2 = np.radians(point_2[0])
    delta_phi = np.radians((point_2[0] - point_1[0]))
    delta_lambda = np.radians((point_2[1] - point_1[1]))

    #2. 
    x =  (np.sin(delta_phi/2)**2 + np.cos(phi_1)*np.cos(phi_2)*(np.sin(delta_lambda/2))**2)
    theta = 2 * np.arcsin(np.sqrt(x))
    
    
    ##multiply by radius of earth to get difference
    return theta*r

    ### END SOLUTION

### Demo function call
point1 = [38.898, -77.037]
point2 = [48.858, -2.294]
result = haversine(point1,point2)
print(result)

The demo should display this printed output.

3633.265427951256

The cell below will test your solution for haversine (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=haversine,
              ex_name='haversine',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=haversine,
              ex_name='haversine',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=20,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_filter_stay_TRUE = utils.load_object_from_publicdata('demo_result_filter_stay_TRUE')

Exercise 4: (2 points)¶

filter_stay

Your task: define filter_stay as follows:

Now that we have a method of calculating distance, we want to filter the listings to just ones that are within range of our desired location, and accommodate enough people.

In the event that you did not solve Exercise 3, you may still complete this exercise independently. Instead of the Haversine distance, this exercise will test you using a provided helper function, which calculates the Euclidean distance. It converts the polar latitude/longitude pairs to cartesian coordinates, and then computes the linear distance between them. This will serve as a rough estimate.

Input:

criteria - A dictionary with the following keys:
- destination: A list containing one latitude-longitude pair, ordered [lat, long], representing our desired stay location
- guests: An integer representing how many people we will be travelling with. An Airbnb is acceptable if accommodates $\ge$ guests
- dist_range: A float value representing the maximum distance of an Airbnb from the desired stay location. An Airbnb is acceptable if distance $\le$ dist_range
clean_listings: The result of Exercise 1 - A Pandas DataFrame containing Airbnb listing information with corrected license information
euclid: A function which can be called to calculate distance using longitude and latitude

Return: options: a Pandas DataFrame of filtered listings which could be suitable for our vacation

Requirements:

Filter clean_listings to rows where accommodates $\ge$ guests
Compute the distance between each Airbnb and the desired stay destination using the provided helper function and the latitude and longitude columns in clean_listings
- Store this column as distance
Filter clean_listings so that distance $\le$ dist_range
Sort results by distance in ascending order, then by listing_id in ascending order
As with Exercise 2, make sure you reset your index to ensure it starts at 0

Note: It's very possible that your criteria will return no viable options! This is okay - if you have 10 guests and you want to stay within one mile of your destination, it's reasonable to assume you may not find anything suitable (and may need to relax your criteria.)

In this case, just return an empty dataframe.

### Helper Function
def euclid(point_1, point_2, r = 3959):
    
    ##convert to radians
    lat1 = point_1[0]*np.pi/180
    long1 = point_1[1]*np.pi/180
    lat2 = point_2[0]*np.pi/180
    long2 = point_2[1]*np.pi/180
    
    ##convert from polar to cartesian coordinates
    x1 = r * np.cos(lat1) * np.sin(long1)
    y1 = r * np.sin(lat1)
    z1 = r * np.cos(lat1) * np.cos(long1)


    x2 = r * np.cos(lat2) * np.sin(long2)
    y2 = r * np.sin(lat2)
    z2 = r * np.cos(lat2) * np.cos(long2)

    ##calculate distance
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2 + (z2 - z1)**2)

### Solution - Exercise 4  
def filter_stay(criteria: dict, clean_listings: pd.DataFrame, euclid) -> pd.DataFrame:
    ### BEGIN SOLUTION
    
    #1. 
    df  = clean_listings.copy()
    df=df[df['accommodates'] >= criteria['guests']]

    #2.
    def get_distance(x):
        return euclid(criteria['destination'], [x['latitude'],x['longitude']])
    df['distance'] = df[['latitude', 'longitude']].apply(get_distance, axis = 1)
    
    #3. 
    options = df[df['distance'] <= criteria['dist_range']]
    
    #4. 
    return options.sort_values(by = ['distance', 'listing_id'], ignore_index =True)
    ### END SOLUTION

### Demo function call
demo_options, demo_criteria = utils.load_object_from_publicdata('filter_stay_demo.dill')
demo_result_options_df = filter_stay(demo_criteria, demo_options, euclid)
demo_result_options_df

The demo should display this output.

	listing_id	host_id	host_since	latitude	longitude	room_type	accommodates	license	distance
0	1099417246276636063	245392128	2019-02-25	34.091298	-118.351877	Entire home/apt	5	exempt	6.813108
1	908697518167341476	409985549	2021-06-29	34.088171	-118.365521	Entire home/apt	8	licensed	7.461785
2	51254530	61324086	2016-03-02	34.092360	-118.363940	Private room	2	exempt	7.478845

The cell below will test your solution for filter_stay (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
              ex_name='filter_stay',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
              ex_name='filter_stay',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_find_availability_TRUE = utils.load_object_from_publicdata('demo_result_find_availability_TRUE')

Exercise 5: (3 points)¶

find_availability

Your task: define find_availability as follows:

Now that we have filtered clean_listings down to some viable options, we need to determine whether they're actually available on the dates we want to vacation.

Input:

stay_length: An integer representing the total number of nights you intend to stay.
first_night: A datetime object representing the first night of your stay.
budget: An integer representing the maximum amount of money you'd like to spend per night on lodging.
options: The result of Exercise 4, a Pandas DataFrame of filtered listings which could be suitable for our vacation
calendar A DataFrame containing information about listing availability and pricing for 2024-2025 with the following relevant columns:
- listing_id: integer that uniquely identifies a particular Airbnb
- date: a datetime that represents a particular calendar day in 2024-2025
- available: a boolean that represents whether a listing is available for rent on a particular date
- price: a float representing the price of a listing on a particular date

Return: available: A list of listing_ids that suit our criteria and are available during our dates.

Requirements:

Merge calendar and options on listing_id using an inner join
Filter the resulting DataFrame to listings that have minimum_nights $\le$ stay_length $\le$ maximum_nights
- Step 1 and 2 are provided in the starter code.
Filter to dates of interest - assuming that first_night counts as the first day of your stay length, filter the DataFrame to only include the dates of your stay.
- For example, if stay_length == 3, and the first night of your stay is 09-01-2025, your dataframe should contain the following dates:
  - 09-01-2025
  - 09-02-2025
  - 09-03-2025
- (This corresponds to three nights: Check in on 09-01, stay that night, the night of 09-02, and the night of 09-03. You'll check out on the morning of 09-04.)
- Hint: timedelta() might be helpful.
Check whether listings are available for every date within your stay.
We want a sorted list of listing_ids which meet these criteria. For each listing_id:
- available = True for all nights of the stay
- The mean of the price column $\le$ budget
- Every night of the stay is represented in the calendar
  - In the above example, a listing would need to have calendar data for 09-01, 09-02, and 09-03 to be viable. If a listing is available and in-budget for 09-01 and 09-03, but no data is available for 09-02, the listing should not be considered.
These listing_id's should be sorted in ascending order
- If there are no available listings for your desired dates, your function should return an empty list

### Solution - Exercise 5  
def find_availability( stay_length: int, first_night: datetime, budget:int, options: pd.DataFrame, calendar: pd.DataFrame) -> list:
    from datetime import timedelta
    
    #1. 
    df = calendar.merge(options['listing_id'], on = 'listing_id')
    
    #2.
    df = df[df['minimum_nights']<= stay_length]
    df = df[df['maximum_nights']>= stay_length]
    
    
    ### BEGIN SOLUTION
    
    #3. 
    df = df.loc[df['date']< first_night + timedelta(stay_length)]
    df = df.loc[df['date']>=first_night]

    #4. check price and availability
    def filter_avail(x):
        if len(x) != stay_length:
            return False
        else:
            avail = x['available'].all()
            in_budget = x['price'].mean() <= budget
            return avail & in_budget
        
    available = df.groupby(by = 'listing_id', group_keys = True).apply(filter_avail)
    
    #5.
    avail_list = available[available == True].index.to_list()

    return sorted(avail_list)
    ### END SOLUTION

### Demo function call
demo_options, demo_stay, demo_first, demo_budget = utils.load_object_from_publicdata('find_availability_demo.dill')
demo_result_avail_df = find_availability(demo_stay, demo_first, demo_budget, demo_options, calendar)
print(demo_result_avail_df)
demo_result_avail_df

The demo should display this printed output.

[588426797893466377, 623184032121344750, 786386342683651712]

The cell below will test your solution for find_availability (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 5  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
              ex_name='find_availability',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
              ex_name='find_availability',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=100,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_get_sentiment_TRUE = utils.load_object_from_publicdata('demo_result_get_sentiment_TRUE')

superusers = utils.load_object_from_publicdata('superusers.dill')

Aside: NLKT's SentimentIntensityAnalzer()¶

The next exercise makes use of nltk's SentimentIntensityAnalyzer(). Given a string representing a short blurb of text, the model returns a dictionary representing the sentiment of the text. There are scores representing 'positive', 'neutral', and 'negative' sentiment, as well as a more complex 'compound' score. This is what we'll be using to score our Airbnb reviews.

Notably, the model is only trained on English text, and therefore treats any non-English text as neutral.

The example below demonstrates how to instantiate and use the model, and displays results for the same sentence in both English and Spanish. Note that the Spanish sentence has a compound score of 0.0.

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores('CSE6040 is my favorite class. I love Python and I love data analysis.'))
print(sia.polarity_scores('CSE6040 es mi asignatura favorita. Me encanta Python y el análisis de datos.'))

Exercise 6: (2 points)¶

get_sentiment

Your task: define get_sentiment as follows:

We now have a list of suitable Airbnbs that are available on our desired list of dates. In order to choose between them, it will be useful to consult the reviews. This dataset does not contain quantitative review data (e.g. 4 out of 5 stars) for individual reviews. It does, however, have text for each review, which we can analyze for overall sentiment.

Input:

reviews - A DataFrame containing reviews of Airbnbs, linked to each location by listing_id.
available_listings - The result of Exercise 5, a list containing listings that match all our distance, date, and budget criteria
superusers- The result of Exercise 2, a Pandas Dataframe consisting of reviewer IDs, the total number of reviews for each user, and their most recent review

Return: comments: A DataFrame consisting of reviews with sentiment analysis added, merged with superuser information

Requirements:

Filter reviews to include only listing_ids that appear in available_listings
Remove rows with NaN values in the comments column
Merge this DataFrame with superusers on reviewer_id using an inner join
- Keep all columns from reviews and the user_review_count column from superusers
Use nltk's SentimentIntensityAnalyzer() to assign a score to each review's comments using polarity_scores()
- Store the compound sentiment value for each review in a new column labelled sentiment
Drop any rows with sentiment == 0.0 (we're assuming that reviews with 'compound' scores of 0.0 are either unuseful or in a language that the model cannot parse.)

### Solution - Exercise 6  
def get_sentiment(reviews: pd.DataFrame, available_listings: list, superusers: pd.DataFrame) -> pd.DataFrame:
    
    sia = SentimentIntensityAnalyzer()
    ### BEGIN SOLUTION

    #1.
    comments = reviews.loc[reviews['listing_id'].isin(available_listings)]

    #2.
    comments = comments.loc[~comments['comments'].isna()]
    
    #3.
    comments = comments.merge(superusers[['user_review_count', 'reviewer_id']], on = 'reviewer_id')

    #4.
    comments['sentiment'] = comments['comments'].apply(lambda x: sia.polarity_scores(x)['compound'])

    #5. 
    comments = comments.loc[comments['sentiment']!=0]
    
    return comments
    ### END SOLUTION

### Demo function call
demo_avail = [44542527, 42572989]
demo_sentiment = get_sentiment(reviews, demo_avail, superusers)
demo_sentiment

The demo should display this output.

	listing_id	id	date	reviewer_id	reviewer_name	comments	user_review_count	sentiment
0	42572989	621367488	2020-03-30	276894729	Sara	You will be more then happy with this place! Feels like home!	2	0.7942
1	44542527	881251541600796134	2023-04-30	92951767	Steve	Scott’s apartment is in the perfect location in Silverlake. Very close to all the local hotspots on Sunset Blvd, and uniquely sits very close to sunset Junction, meaning it’s also walkable to all the best in Los Feliz and Virgil Village, too. We enjoyed a different neighborhood walk every morning we were here. the views are incredible and reach to the hills of Hollywood (you can see the iconic sign right from the living room), Griffith Park, and enjoy sunsets west over the Hollywood hills and beyond. The neighborhood is private and quaint, and the apartment is tucked far from the street making it quiet, private, and very peaceful. The neighbors in the building are wonderful and very friendly. And the walls are thick - we didn’t hear a thing from any other apartment the entire 3 months we lived here. the apartment is modern and clean. Highly recommend.	2	0.9844
2	44542527	816836676950366108	2023-01-31	11611847	Debbie	Had a wonderful stay at Scott’s home. His place is large, with good light, and a great location, conveniently located next to restaurants, coffee shops, and bars. Our host Joe was responsive, friendly, and accommodating.	1	0.9468
3	44542527	458673161248982460	2021-09-24	676224	Alexis	Great location, clean and modern, very responsive host. Centrally located to great shops and restaurants, a great base for exploring the East Side and LA	2	0.9571
4	44542527	745649760	2021-04-04	15466686	Sofia	Scott’s place is sparkling clean with awesome mid century style throughout the apartment. The location is amazing (right near Erewhon which is the heaven of super markets). The apartment has large sunny windows with an amazing view, we loved the hummingbirds that visited our window every morning. Everything is brand new and perfect for stay at home working with all that you need. The spa like shower also so lovely after a long day. Scott is a very responsive and accommodating host, Thank you, Scott!	1	0.9932

The cell below will test your solution for get_sentiment (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
              ex_name='get_sentiment',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
              ex_name='get_sentiment',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_rank_top_listings_TRUE = utils.load_object_from_publicdata('demo_result_rank_top_listings_TRUE')

Exercise 7: (3 points)¶

rank_top_listings

Your task: define rank_top_listings as follows:

Now that the relevant comments have been associated with sentiment scores, we can use them to find our best options. The simplest thing to do would be to sort by sentiment and call it day, but some reviews are more informative than others. For example, recent reviews are usually much more insightful and pertinent to an upcoming stay, and we may be more inclined to trust frequent reviewers or superusers (this is why Google has the Local Guide feature, for example). Let's implement a weighted average based on this information, so that we can find our best options.

As a reminder, a weighted average is computed as follows:

$$\bar{x} = \frac{\sum_{i=1}^{n}{w_ix_i}}{\sum_{i=1}^{n}{w_i}}$$

which expands to

$$\bar{x} = \frac{w_1x_1+w_2x_2+... + w_nx_n}{w_1+w_2+...+w_n}$$

Where $w_i$ is the weight assigned to point $x_i$, and $n$ is the total number of points.

Input:

comments - The result of Exercise 6; A DataFrame consisting of reviews with sentiment analysis added, merged with superuser information

Return: top_listings - A Pandas dataframe consisting of the top listing_ids and their weighted scores

Requirements:

Compute the weights column:
- If a review has date $\ge$:
  - 2024-01-01: assign a weight of 1
  - 2021-01-01: assign a weight of 0.8
  - 2018-01-01: assign a weight of 0.5
- For dates $\le$ 2018-01-01, assign a weight of 0.1
- To these weights, add 0.01 for every user_review_count greater than 1 (e.g. 0.01*(user_review_count-1))
- For example, if a review was left on 2023-02-02 and the reviewer's user_review_count is 2, the weight assigned to that review should be .81 -Hint: You may find strptime() useful.
Group by listing_id and aggregate sentiments using the weighted average forumula above. Use the computed column weights you computed as $w_i$ and sentiment as $x_1$. Store the resulting column as weighted_avg.
Sort by weighted_avg in descending order and listing_id in ascending order

### Solution - Exercise 7  
def rank_top_listings(comments: pd.DataFrame):
    ### BEGIN SOLUTION
    
    comments_copy = comments.copy()
    
    #1. 
    def apply_weights(x):
        if x['date'] >= datetime.strptime('2024-01-01', '%Y-%m-%d'):
            weight = 1
        elif datetime.strptime('2024-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2021-01-01', '%Y-%m-%d'):
            weight = .8
        elif datetime.strptime('2021-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2018-01-01', '%Y-%m-%d'):
            weight = .5
        else:
            weight = .1
        ##rate superuser reviews slightly more heavily - add .01 weight for every review > 1
        weight += .01*(x['user_review_count']-1)
        return weight

    
    
    comments_copy['weights'] = comments_copy[['date','user_review_count']].apply(apply_weights, axis = 1)
    
    
    #2. 
    def compute_avg(x):
        return (x['weights']*x['sentiment']).sum()/x['weights'].sum()

    weighted = comments_copy.groupby('listing_id', as_index = False).apply(compute_avg).rename(columns = {None:'weighted_avg'})

    3.
    top_listings = weighted.sort_values(by = ['weighted_avg', 'listing_id'], ascending = [False,True], ignore_index = True)

    return top_listings
    ### END SOLUTION

### Demo function call
demo_comments = utils.load_object_from_publicdata('rank_top_listings_demo.dill')
demo_top_listings = rank_top_listings(demo_comments)
demo_top_listings

The demo should display this output.

	listing_id	weighted_avg
0	899909	0.836561
1	828540	0.795065

The cell below will test your solution for rank_top_listings (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
              ex_name='rank_top_listings',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
              ex_name='rank_top_listings',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Part 2: Evaluating LA Crime Data¶

Now that we have found the top Airbnb listings available for our desired dates, let's enhance our search by evaluating how safe each neighborhood in LA is. We will use crime data provided by the City of Los Angeles.

Our database includes two tables: crime and population

The crime table contains a record of each crime committed, including information such as the date, lat/long, area, and crime description.

The population table only contains three columns - the population and number of square miles contained in each of the 21 areas in LA.

Run the queries below to view samples of the data.

# Sample of crime table:
sample_crime_data = pd.read_sql('select * from crime limit 3', conn)
display(sample_crime_data)

# Sample of population table:
sample_population_data = pd.read_sql('select * from population limit 3', conn)
display(sample_population_data)

### Run Me!!!
demo_result_prime_crime_time_TRUE = utils.load_object_from_publicdata('demo_result_prime_crime_time_TRUE')

Exercise 8: (3 points)¶

prime_crime_time

Your task: define prime_crime_time_query as follows:

The crime table contains entries for each crime committed in LA. Every crime record contains a crime description crime_desc, as well as the date on which the crime occurred date_occ. Let us first determine what the top crimes committed each month are.

Input: None

Return: A SQLite query that returns the top two crimes for each month.

Requirements:

Your solution must contain a SQLite query
Your solution must contain the following columns: month, description, crime_count, and crime_rank
Your result should extract the month from the date_occ column, and create a string containing the numeric month value along with the 3 letter month abbreviation. For example, January would be 01-JAN, September would be 09-SEP, December would be 12-DEC, etc.
Only return the top 2 crimes per month, along with their count and monthly ranking (either 1 or 2). In the event of ties, return all tied values
Column order and row order do not matter

### Solution - Exercise 8  
prime_crime_time_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
prime_crime_time_query = '''
with count_by_month as (
    select
        CASE substr(date_occ, 1, 2)
          WHEN '01' THEN '01-JAN'
          WHEN '02' THEN '02-FEB'
          WHEN '03' THEN '03-MAR'
          WHEN '04' THEN '04-APR'
          WHEN '05' THEN '05-MAY'
          WHEN '06' THEN '06-JUN'
          WHEN '07' THEN '07-JUL'
          WHEN '08' THEN '08-AUG'
          WHEN '09' THEN '09-SEP'
          WHEN '10' THEN '10-OCT'
          WHEN '11' THEN '11-NOV'
          WHEN '12' THEN '12-DEC'
          ELSE NULL
        END AS month,
        crime_desc description,
        count(crime_desc) crime_count
    from
        crime
    group by 1, 2
)
select * from (
select
    month,
    description,
    crime_count,
    rank() OVER (partition by month order by crime_count desc) AS crime_rank
from
    count_by_month
)
where crime_rank <= 2
'''
### END SOLUTION


### Demo function call
demo_prime_crime_time_result = pd.read_sql(prime_crime_time_query,conn)
demo_prime_crime_time_result

The demo should display this output.

	month	description	crime_count	crime_rank
0	01-JAN	VEHICLE - STOLEN	4097	1
1	01-JAN	BATTERY - SIMPLE ASSAULT	2891	2
2	02-FEB	VEHICLE - STOLEN	3666	1
3	02-FEB	THEFT OF IDENTITY	2954	2
4	03-MAR	VEHICLE - STOLEN	3942	1
5	03-MAR	BATTERY - SIMPLE ASSAULT	2695	2
6	04-APR	VEHICLE - STOLEN	3611	1
7	04-APR	BATTERY - SIMPLE ASSAULT	2417	2
8	05-MAY	VEHICLE - STOLEN	3847	1
9	05-MAY	BATTERY - SIMPLE ASSAULT	1973	2
10	06-JUN	VEHICLE - STOLEN	3795	1
11	06-JUN	BURGLARY FROM VEHICLE	1694	2
12	07-JUL	VEHICLE - STOLEN	4171	1
13	07-JUL	VANDALISM - FELONY (400 and OVER, ALL CHURCH VANDALISMS)	1777	2
14	08-AUG	VEHICLE - STOLEN	4074	1
15	08-AUG	BURGLARY FROM VEHICLE	1806	2
16	09-SEP	VEHICLE - STOLEN	3904	1
17	09-SEP	BURGLARY FROM VEHICLE	1825	2
18	10-OCT	VEHICLE - STOLEN	4109	1
19	10-OCT	BURGLARY FROM VEHICLE	2021	2
20	11-NOV	VEHICLE - STOLEN	3867	1
21	11-NOV	BURGLARY FROM VEHICLE	2039	2
22	12-DEC	VEHICLE - STOLEN	2898	1
23	12-DEC	BURGLARY FROM VEHICLE	1918	2

The cell below will test your solution for prime_crime_time (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
              ex_name='prime_crime_time',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
              ex_name='prime_crime_time',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_crime_center_TRUE = utils.load_object_from_publicdata('demo_result_crime_center_TRUE')

Exercise 9: (1 points)¶

crime_center

Your task: define crime_center_query as follows:

The crime table contains entries categorized into the areas served by each police department within LA. Each crime entry contains the coordinates at which the crime occurred. Let us now determine the crime center for each police department's area.

Input: None

Return: A SQLite query that returns the average coordinates for each area.

Requirements:

Your solution must contain a SQLite query
The columns lat and long in the crime table contain the coordinates at which the crime occurred
Return the average lat and long for each area. Name these columns center_lat and center_long
Your solution must contain the following columns: area, center_lat, and center_long
Order your results in ascending alphabetical order by area

### Solution - Exercise 9  
crime_center_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
crime_center_query = '''select area, avg(lat) center_lat, avg(long) center_long from crime group by area order by 1'''
### END SOLUTION


### Demo function call
demo_crime_center_result = pd.read_sql(crime_center_query,conn)
demo_crime_center_result

The demo should display this output.

	area	center_lat	center_long
0	77th Street	33.977699	-118.297673
1	Central	34.048074	-118.250774
2	Devonshire	34.250923	-118.537942
3	Foothill	34.245504	-118.370130
4	Harbor	33.770143	-118.285061
5	Hollenbeck	34.053705	-118.203808
6	Hollywood	34.099204	-118.328856
7	Mission	34.254761	-118.450299
8	N Hollywood	34.172124	-118.384572
9	Newton	34.010694	-118.260135
10	Northeast	34.104364	-118.238708
11	Olympic	34.060322	-118.300204
12	Pacific	33.981253	-118.419988
13	Rampart	34.062123	-118.267145
14	Southeast	33.939051	-118.266904
15	Southwest	34.020676	-118.315023
16	Topanga	34.191668	-118.601957
17	Van Nuys	34.176803	-118.439639
18	West LA	34.051339	-118.439908
19	West Valley	34.186766	-118.519701
20	Wilshire	34.062164	-118.351886

The cell below will test your solution for crime_center (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 9  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
              ex_name='crime_center',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
              ex_name='crime_center',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

### Run Me!!!
demo_result_relative_crime_rate_TRUE = utils.load_object_from_publicdata('demo_result_relative_crime_rate_TRUE')

Exercise 10: (2 points)¶

relative_crime_rate

Your task: define relative_crime_rate_query as follows:

The crime table contains entries categorized into the areas served by each police department within LA. The population table contains the population and area in square miles for each of LA's police district areas.

We would now like to determine how dangerous each area is relative to the others. We can do this by computing:

crimes_per_100k: This is the floating-point number 100,000.0 multiplied by the number of crimes committed in a given area. This is then divided by the population for that area.

The formula is: crimes_per_100k = 100000.0 * number of crimes / population for each area

crimes_per_sq_mile: This is the ratio of the number of crimes committed in a given area divided by the number of square miles in that area.

The formula is: crimes_per_sq_mile = number of crimes / square miles for each area

Input: None

Return: A SQLite query that returns the relative crime rate for each area.

Requirements:

Your solution must contain a SQLite query
Your query should return the following columns: area, crimes_per_100k, and crimes_per_sq_mile
You should round your results to the nearest whole number
You should order your results by crimes_per_100k in descending order, crimes_per_sq_mile in descending order, and finally by area in ascending alphabetical order

### Solution - Exercise 10  
relative_crime_rate_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
relative_crime_rate_query = '''
        with scores as (
        select area, count(case_num) total_score from crime
        group by area)
        select s.area, round(s.total_score * 100000.0 / p.population) crimes_per_100k, round(s.total_score / p.sq_miles) crimes_per_sq_mile
        from scores s inner join population p on s.area = p.area
        order by 2 desc, 3 desc, 1
        '''
### END SOLUTION


### Demo function call
demo_relative_crime_rate_result = pd.read_sql(relative_crime_rate_query,conn)
demo_relative_crime_rate_result

The demo should display this output.

	area	crimes_per_100k	crimes_per_sq_mile
0	Central	67945.0	6040.0
1	Southwest	12981.0	1634.0
2	77th Street	11845.0	1742.0
3	Newton	11546.0	1924.0
4	Southeast	11265.0	1657.0
5	Pacific	10977.0	853.0
6	Rampart	10384.0	3092.0
7	Olympic	8763.0	2827.0
8	N Hollywood	8548.0	752.0
9	Harbor	8446.0	535.0
10	West Valley	7732.0	454.0
11	Topanga	7459.0	821.0
12	Wilshire	7178.0	1290.0
13	Devonshire	7144.0	324.0
14	West LA	7017.0	246.0
15	Hollenbeck	6362.0	837.0
16	Mission	6350.0	571.0
17	Foothill	6096.0	241.0
18	Northeast	5986.0	516.0
19	Hollywood	5867.0	1023.0
20	Van Nuys	4831.0	523.0

The cell below will test your solution for relative_crime_rate (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 10  


tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
              ex_name='relative_crime_rate',
              key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'

### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
              ex_name='relative_crime_rate',
              key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=', 
              n_iter=50,
              hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'
### END HIDDEN TESTS

print('Passed! Please submit.')

Fin.¶

If you have made it this far, congratulations! Remember to submit the exam to ensure you receive all the points you have earned!

Fall 2025 MT2: A Better Airbnb¶

The Problem: Planning a Vacation in Los Angeles, California¶

Exercise 0: (1 points)¶

Exercise 1: (1 points)¶

Exercise 2: (1 points)¶

Exercise 3: (2 points)¶

Exercise 4: (2 points)¶

Exercise 5: (3 points)¶

Aside: NLKT's SentimentIntensityAnalzer()¶

Exercise 6: (2 points)¶

Exercise 7: (3 points)¶

Part 2: Evaluating LA Crime Data¶

Exercise 8: (3 points)¶

Exercise 9: (1 points)¶

Exercise 10: (2 points)¶

Fin.¶

`Fall 2025 MT2`: `A Better Airbnb`¶