Fall 2025 MT2: A Better Airbnb¶Version 1.0.2
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of data cleaning, filtering, and analysis in Pandas, SQL, and Numpy. It has 11 exercises numbered 0 to 10. There are 21 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s)
Exercise 2 - : 1 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 3 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 1 point(s)
Exercise 10 - : 2 point(s)
Final reminders:
Background: Airbnb is an online marketplace that allows property owners to rent their space to prospective short and long-term tenants. The app enables renters to filter potential living spaces by destination, price, stay type, stay dates, and more. Users also have access to review data from previous guests in order to make informed decisions. Even still, the process of narrowing down a good place to stay can be overwhelming, especially if you are unfamiliar with the area.
Your overall task: Your goal is to use Airbnb data to help a prospective renter narrow down relevant stay locations, and then determine the best options based on reviews and area information. You will do this in two main steps:
First, you will clean, filter, and rank the Airbnb locations using review and listing data.
Second, you will analyze crime data for the various neighborhoods within LA to ensure your chosen location is safe.
The datasets: There are two sources of data which you will use to solve the following exercises.
Information about Airbnb listings in Los Angeles, California from 2024-2025, sourced from InsideAirbnb.
Crime data from 2020-2025 for Los Angeles, sourced from the City of Los Angeles.
The listing data (source 1) is provided as three seperate Pandas data frames containing information about listings, reviews, and booking. The crime data (source 2) is contained in a SQLite Database.
SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re
import pandas as pd
### Run Me!!!
import pandas as pd
import numpy as np
from pprint import pprint
import sqlite3
import re
from datetime import datetime, timedelta, date
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
listings=utils.load_object_from_publicdata('listings.dill')
reviews=utils.load_object_from_publicdata('reviews.dill')
calendar=utils.load_object_from_publicdata('calendar.dill')
conn=sqlite3.connect("resource/asnlib/publicdata/crimes_and_population.db")
explore_airbnb_data__FREE
Example: we have defined explore_airbnb_data__FREE as follows:
This is a free exercise!
Please run the test cell below to collect your FREE point!
We encourage you to review the samples of both listings and reviews in the cell below:
listings A DataFrame containing descriptions, location, and host information for active Airbnbs in Los Angeles, California, as of September 2024. reviews A DataFrame containing reviews of Airbnbs, linked to each location by listing ID.calendar A DataFrame containing information about listing availability and pricing for 2024-2025. ### Solution - Exercise 0
def explore_airbnb_data__FREE(listings: pd.DataFrame, reviews: pd.DataFrame, calendar: pd.DataFrame, headsize:int=10) -> tuple:
listings_preview = listings.head(n=headsize)
reviews_preview = reviews.head(n=headsize)
calendar_preview = calendar.head(n=headsize)
return (listings_preview, reviews_preview, calendar_preview)
### Demo function call
listings_preview,reviews_preview,calendar_preview = explore_airbnb_data__FREE(listings, reviews, calendar)
display(listings.head(5))
display(reviews.head(5))
display(calendar.head(5))
The test cell below will always pass. Please submit to collect your free points for explore_airbnb_data__FREE (exercise 0).
### Test Cell - Exercise 0
print('Passed! Please submit.')
### Run Me!!!
demo_result_cleanse_listings_TRUE = utils.load_object_from_publicdata('demo_result_cleanse_listings_TRUE')
cleanse_listings
Your task: define cleanse_listings as follows:
Input: license: A single value from the license column of listings. If it is not null, license will be a string. However, our function should be able to handle null values, which will be represented as np.nan.
Return: clean_license: A string containing either 'licensed', 'exempt', or 'unlicensed'
Requirements:
To begin, we need to clean up listings by simplifying the license column. We're not actually interested in what specific type of license each Airbnb has - just whether they are licensed, unlicensed, or exempt.
To help with this, write a function that can clean a string given the following criteria:
If the word 'exempt' is present anywhere in the string, return the string 'exempt'. This will not be case-sensitive.
If the string appears to be a license, return the string 'licensed'. A correct license format is as follows:
np.nan) inputs, but you will still be need to account for strings which do not meet criteria 1 or 2. ### Solution - Exercise 1
def cleanse_listings(license: str) -> str:
##handle nan values
if not isinstance(license, str):
return 'unlicensed'
### BEGIN SOLUTION
#1.
if 'exempt' in license.lower():
return 'exempt'
#2.
elif re.match(r'(^[a-zA-Z0-9]+-[0-9]+$)',license):
return 'licensed'
#3.
else:
return 'unlicensed'
### END SOLUTION
### Demo function call
cleanse_listings_demo = utils.load_object_from_publicdata('cleanse_listings_demo.dill')
for i in cleanse_listings_demo:
print(f'\n {i} -> {cleanse_listings(i)}')
The demo should display this printed output.
Exempt -> exempt
HSR19-000881 -> licensed
nan -> unlicensed
Exempt - This listing is a transient occupancy residential structure -> exempt
43158 -> unlicensed
The cell below will test your solution for cleanse_listings (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
ex_name='cleanse_listings',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=200)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(cleanse_listings),
ex_name='cleanse_listings',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=200,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_listings did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_find_superusers_TRUE = utils.load_object_from_publicdata('demo_result_find_superusers_TRUE')
find_superusers
Your task: define find_superusers as follows:
Some Airbnb users have left more reviews, and their testimony may be more helpful. For example, for Google maps, Google calls such superusers 'local guides' and gives some deference to their reviews. We'd like to implement a similar approach.
Input:
reviews: A Pandas DataFrame containing reviews of Airbnb's, identified by listing_id.Return: superusers: A Pandas DataFrame containing information about how many reviews each Airbnb user has left, and when their most recent review was posted. superusers should have the following columns:
reviewer_id: integer uniquely identifying a reviewerreviewer_name: string containing the reviewer's nameuser_review_count: integer representing the total number of reviews submitted by a reviewermost_recent: datetime representing the most recent review submitted by a reviewerRequirements:
reviewer_idreviewer_name columnid column for a given user in a column named user_review_countdate for a reviewer in a column named most_recentuser_review_count in descending order and reviewer_id in ascending order### Solution - Exercise 2
def find_superusers(reviews: pd.DataFrame) -> pd.DataFrame:
### BEGIN SOLUTION
r2 = reviews.copy()
#1.
r2.dropna(inplace = True)
#2.
user_review_counts = r2[['reviewer_id', 'reviewer_name','id','date']].groupby('reviewer_id', as_index = False
).agg({'reviewer_name':'first', 'id':'count','date':'max'}
).rename(columns = {'id':'user_review_count', 'date':'most_recent'}
#3.
).sort_values(by = ['user_review_count','reviewer_id'], ascending = [False, True]
).reset_index(drop=True)
return user_review_counts
### END SOLUTION
### Demo function call
superusers_demo = utils.load_object_from_publicdata('superusers_demo.dill')
demo_result_superusers_df = find_superusers(superusers_demo)
demo_result_superusers_df
The demo should display this output.
| reviewer_id | reviewer_name | user_review_count | most_recent | |
|---|---|---|---|---|
| 0 | 62195527 | Don | 3 | 2024-03-29 |
| 1 | 13775813 | Julie | 2 | 2024-07-26 |
| 2 | 440450367 | Griffin | 2 | 2024-04-17 |
| 3 | 286210 | Carl | 1 | 2024-08-19 |
| 4 | 3148220 | Jill | 1 | 2024-04-20 |
| 5 | 4954534 | Jeremie | 1 | 2024-07-20 |
| 6 | 80239899 | Alejandro | 1 | 2024-04-13 |
| 7 | 121588501 | Trevor | 1 | 2024-05-20 |
| 8 | 141406268 | Sean | 1 | 2024-05-01 |
| 9 | 156557523 | Donnie | 1 | 2024-05-11 |
The cell below will test your solution for find_superusers (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
ex_name='find_superusers',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=25)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_superusers),
ex_name='find_superusers',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=25,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_superusers did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_haversine_TRUE = utils.load_object_from_publicdata('demo_result_haversine_TRUE')
haversine
Your task: define haversine as follows:
Airbnb locations are only identified via latitude and longitude. For travel purposes, that's not very helpful, so we'll need to determine how far away they are manually. Assuming a target location in LA that we'd like to stay at, we need to be able to calculate how far away each Airbnb is in miles.
To do this, we need to use the Haversine Distance to measure the "as the crow flies" distance between two lat/long points.
The Haversine formula is as follows:
$$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$
where
We have all the necessary information to determine the value of $hav(\theta)$. After solving the right-hand side of the equation, we can use the value of $hav(\theta)$ to solve for $\theta$:
Solving algebraically for $\theta$,
$$ \theta = 2\arcsin(\sqrt(x)) $$Now that we have the central angle $\theta$, we can find the distance $d$ using our computed value of $\theta$ and the radius of the Earth $r$, using the following relation:
$$d = r\theta$$Note: Numpy's trigonometric functions expect angle measurements in radians. The measurements in the dataframe are given in degrees. Before passing the values to trigonometric function, you should use np.radians() to convert to radians.
Input:
point_1: A list containing one latitude-longitude pair, ordered [lat, long]point_2: A list containing a second latitude-longitude pair, ordered[lat, long]r: Earth's radius, in miles (we're assuming here that the Earth is perfectly round, which is not exactly true)Return: d: a float value representing the Haversine Distance in miles between point_1 and point_2
Requirements:
Solve for $d$ using the Numpy and the given formulas:
$$hav(\theta) = hav(\Delta\varphi) + \cos(\varphi_1)\cos(\varphi_2)hav(\Delta\lambda)$$
$$\theta = 2\arcsin(\sqrt(hav( \theta)))$$
$$d = r\theta$$
### Solution - Exercise 3
def haversine(point_1, point_2, r = 3959) -> float:
### BEGIN SOLUTION
#1.
phi_1 = np.radians(point_1[0])
phi_2 = np.radians(point_2[0])
delta_phi = np.radians((point_2[0] - point_1[0]))
delta_lambda = np.radians((point_2[1] - point_1[1]))
#2.
x = (np.sin(delta_phi/2)**2 + np.cos(phi_1)*np.cos(phi_2)*(np.sin(delta_lambda/2))**2)
theta = 2 * np.arcsin(np.sqrt(x))
##multiply by radius of earth to get difference
return theta*r
### END SOLUTION
### Demo function call
point1 = [38.898, -77.037]
point2 = [48.858, -2.294]
result = haversine(point1,point2)
print(result)
The demo should display this printed output.
3633.265427951256
The cell below will test your solution for haversine (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=haversine,
ex_name='haversine',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=haversine,
ex_name='haversine',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=20,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to haversine did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_filter_stay_TRUE = utils.load_object_from_publicdata('demo_result_filter_stay_TRUE')
filter_stay
Your task: define filter_stay as follows:
Now that we have a method of calculating distance, we want to filter the listings to just ones that are within range of our desired location, and accommodate enough people.
In the event that you did not solve Exercise 3, you may still complete this exercise independently. Instead of the Haversine distance, this exercise will test you using a provided helper function, which calculates the Euclidean distance. It converts the polar latitude/longitude pairs to cartesian coordinates, and then computes the linear distance between them. This will serve as a rough estimate.
Input:
criteria - A dictionary with the following keys:destination: A list containing one latitude-longitude pair, ordered [lat, long], representing our desired stay locationguests: An integer representing how many people we will be travelling with. An Airbnb is acceptable if accommodates $\ge$ guestsdist_range: A float value representing the maximum distance of an Airbnb from the desired stay location. An Airbnb is acceptable if distance $\le$ dist_range clean_listings: The result of Exercise 1 - A Pandas DataFrame containing Airbnb listing information with corrected license informationeuclid: A function which can be called to calculate distance using longitude and latitudeReturn: options: a Pandas DataFrame of filtered listings which could be suitable for our vacation
Requirements:
clean_listings to rows where accommodates $\ge$ guestslatitude and longitude columns in clean_listingsdistanceclean_listings so that distance $\le$ dist_rangedistance in ascending order, then by listing_id in ascending orderNote: It's very possible that your criteria will return no viable options! This is okay - if you have 10 guests and you want to stay within one mile of your destination, it's reasonable to assume you may not find anything suitable (and may need to relax your criteria.)
### Helper Function
def euclid(point_1, point_2, r = 3959):
##convert to radians
lat1 = point_1[0]*np.pi/180
long1 = point_1[1]*np.pi/180
lat2 = point_2[0]*np.pi/180
long2 = point_2[1]*np.pi/180
##convert from polar to cartesian coordinates
x1 = r * np.cos(lat1) * np.sin(long1)
y1 = r * np.sin(lat1)
z1 = r * np.cos(lat1) * np.cos(long1)
x2 = r * np.cos(lat2) * np.sin(long2)
y2 = r * np.sin(lat2)
z2 = r * np.cos(lat2) * np.cos(long2)
##calculate distance
return np.sqrt((x2 - x1)**2 + (y2 - y1)**2 + (z2 - z1)**2)
### Solution - Exercise 4
def filter_stay(criteria: dict, clean_listings: pd.DataFrame, euclid) -> pd.DataFrame:
### BEGIN SOLUTION
#1.
df = clean_listings.copy()
df=df[df['accommodates'] >= criteria['guests']]
#2.
def get_distance(x):
return euclid(criteria['destination'], [x['latitude'],x['longitude']])
df['distance'] = df[['latitude', 'longitude']].apply(get_distance, axis = 1)
#3.
options = df[df['distance'] <= criteria['dist_range']]
#4.
return options.sort_values(by = ['distance', 'listing_id'], ignore_index =True)
### END SOLUTION
### Demo function call
demo_options, demo_criteria = utils.load_object_from_publicdata('filter_stay_demo.dill')
demo_result_options_df = filter_stay(demo_criteria, demo_options, euclid)
demo_result_options_df
The demo should display this output.
| listing_id | host_id | host_since | latitude | longitude | room_type | accommodates | license | distance | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1099417246276636063 | 245392128 | 2019-02-25 | 34.091298 | -118.351877 | Entire home/apt | 5 | exempt | 6.813108 |
| 1 | 908697518167341476 | 409985549 | 2021-06-29 | 34.088171 | -118.365521 | Entire home/apt | 8 | licensed | 7.461785 |
| 2 | 51254530 | 61324086 | 2016-03-02 | 34.092360 | -118.363940 | Private room | 2 | exempt | 7.478845 |
The cell below will test your solution for filter_stay (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
ex_name='filter_stay',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(filter_stay),
ex_name='filter_stay',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to filter_stay did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_find_availability_TRUE = utils.load_object_from_publicdata('demo_result_find_availability_TRUE')
find_availability
Your task: define find_availability as follows:
Now that we have filtered clean_listings down to some viable options, we need to determine whether they're actually available on the dates we want to vacation.
Input:
stay_length: An integer representing the total number of nights you intend to stay.first_night: A datetime object representing the first night of your stay.budget: An integer representing the maximum amount of money you'd like to spend per night on lodging.options: The result of Exercise 4, a Pandas DataFrame of filtered listings which could be suitable for our vacationcalendar A DataFrame containing information about listing availability and pricing for 2024-2025 with the following relevant columns:listing_id: integer that uniquely identifies a particular Airbnbdate: a datetime that represents a particular calendar day in 2024-2025available: a boolean that represents whether a listing is available for rent on a particular dateprice: a float representing the price of a listing on a particular dateReturn: available: A list of listing_ids that suit our criteria and are available during our dates.
Requirements:
calendar and options on listing_id using an inner joinminimum_nights $\le$ stay_length $\le$ maximum_nights first_night counts as the first day of your stay length, filter the DataFrame to only include the dates of your stay. stay_length == 3, and the first night of your stay is 09-01-2025, your dataframe should contain the following dates:listing_ids which meet these criteria. For each listing_id:available = True for all nights of the stayprice column $\le$ budget listing_id's should be sorted in ascending order### Solution - Exercise 5
def find_availability( stay_length: int, first_night: datetime, budget:int, options: pd.DataFrame, calendar: pd.DataFrame) -> list:
from datetime import timedelta
#1.
df = calendar.merge(options['listing_id'], on = 'listing_id')
#2.
df = df[df['minimum_nights']<= stay_length]
df = df[df['maximum_nights']>= stay_length]
### BEGIN SOLUTION
#3.
df = df.loc[df['date']< first_night + timedelta(stay_length)]
df = df.loc[df['date']>=first_night]
#4. check price and availability
def filter_avail(x):
if len(x) != stay_length:
return False
else:
avail = x['available'].all()
in_budget = x['price'].mean() <= budget
return avail & in_budget
available = df.groupby(by = 'listing_id', group_keys = True).apply(filter_avail)
#5.
avail_list = available[available == True].index.to_list()
return sorted(avail_list)
### END SOLUTION
### Demo function call
demo_options, demo_stay, demo_first, demo_budget = utils.load_object_from_publicdata('find_availability_demo.dill')
demo_result_avail_df = find_availability(demo_stay, demo_first, demo_budget, demo_options, calendar)
print(demo_result_avail_df)
demo_result_avail_df
The demo should display this printed output.
[588426797893466377, 623184032121344750, 786386342683651712]
The cell below will test your solution for find_availability (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
ex_name='find_availability',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_availability),
ex_name='find_availability',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=100,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_availability did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_get_sentiment_TRUE = utils.load_object_from_publicdata('demo_result_get_sentiment_TRUE')
superusers = utils.load_object_from_publicdata('superusers.dill')
The next exercise makes use of nltk's SentimentIntensityAnalyzer(). Given a string representing a short blurb of text, the model returns a dictionary representing the sentiment of the text. There are scores representing 'positive', 'neutral', and 'negative' sentiment, as well as a more complex 'compound' score. This is what we'll be using to score our Airbnb reviews.
Notably, the model is only trained on English text, and therefore treats any non-English text as neutral.
The example below demonstrates how to instantiate and use the model, and displays results for the same sentence in both English and Spanish. Note that the Spanish sentence has a compound score of 0.0.
sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores('CSE6040 is my favorite class. I love Python and I love data analysis.'))
print(sia.polarity_scores('CSE6040 es mi asignatura favorita. Me encanta Python y el análisis de datos.'))
get_sentiment
Your task: define get_sentiment as follows:
We now have a list of suitable Airbnbs that are available on our desired list of dates. In order to choose between them, it will be useful to consult the reviews. This dataset does not contain quantitative review data (e.g. 4 out of 5 stars) for individual reviews. It does, however, have text for each review, which we can analyze for overall sentiment.
Input:
reviews - A DataFrame containing reviews of Airbnbs, linked to each location by listing_id.available_listings - The result of Exercise 5, a list containing listings that match all our distance, date, and budget criteria
superusers- The result of Exercise 2, a Pandas Dataframe consisting of reviewer IDs, the total number of reviews for each user, and their most recent review
Return: comments: A DataFrame consisting of reviews with sentiment analysis added, merged with superuser information
Requirements:
reviews to include only listing_ids that appear in available_listingscomments columnsuperusers on reviewer_id using an inner join reviews and the user_review_count column from superuserspolarity_scores()compound sentiment value for each review in a new column labelled sentimentsentiment == 0.0 (we're assuming that reviews with 'compound' scores of 0.0 are either unuseful or in a language that the model cannot parse.)### Solution - Exercise 6
def get_sentiment(reviews: pd.DataFrame, available_listings: list, superusers: pd.DataFrame) -> pd.DataFrame:
sia = SentimentIntensityAnalyzer()
### BEGIN SOLUTION
#1.
comments = reviews.loc[reviews['listing_id'].isin(available_listings)]
#2.
comments = comments.loc[~comments['comments'].isna()]
#3.
comments = comments.merge(superusers[['user_review_count', 'reviewer_id']], on = 'reviewer_id')
#4.
comments['sentiment'] = comments['comments'].apply(lambda x: sia.polarity_scores(x)['compound'])
#5.
comments = comments.loc[comments['sentiment']!=0]
return comments
### END SOLUTION
### Demo function call
demo_avail = [44542527, 42572989]
demo_sentiment = get_sentiment(reviews, demo_avail, superusers)
demo_sentiment
The demo should display this output.
| listing_id | id | date | reviewer_id | reviewer_name | comments | user_review_count | sentiment | |
|---|---|---|---|---|---|---|---|---|
| 0 | 42572989 | 621367488 | 2020-03-30 | 276894729 | Sara | You will be more then happy with this place! Feels like home! | 2 | 0.7942 |
| 1 | 44542527 | 881251541600796134 | 2023-04-30 | 92951767 | Steve | Scott’s apartment is in the perfect location in Silverlake. Very close to all the local hotspots on Sunset Blvd, and uniquely sits very close to sunset Junction, meaning it’s also walkable to all the best in Los Feliz and Virgil Village, too. We enjoyed a different neighborhood walk every morning we were here. the views are incredible and reach to the hills of Hollywood (you can see the iconic sign right from the living room), Griffith Park, and enjoy sunsets west over the Hollywood hills and beyond. The neighborhood is private and quaint, and the apartment is tucked far from the street making it quiet, private, and very peaceful. The neighbors in the building are wonderful and very friendly. And the walls are thick - we didn’t hear a thing from any other apartment the entire 3 months we lived here. the apartment is modern and clean. Highly recommend. | 2 | 0.9844 |
| 2 | 44542527 | 816836676950366108 | 2023-01-31 | 11611847 | Debbie | Had a wonderful stay at Scott’s home. His place is large, with good light, and a great location, conveniently located next to restaurants, coffee shops, and bars. Our host Joe was responsive, friendly, and accommodating. | 1 | 0.9468 |
| 3 | 44542527 | 458673161248982460 | 2021-09-24 | 676224 | Alexis | Great location, clean and modern, very responsive host. Centrally located to great shops and restaurants, a great base for exploring the East Side and LA | 2 | 0.9571 |
| 4 | 44542527 | 745649760 | 2021-04-04 | 15466686 | Sofia | Scott’s place is sparkling clean with awesome mid century style throughout the apartment. The location is amazing (right near Erewhon which is the heaven of super markets). The apartment has large sunny windows with an amazing view, we loved the hummingbirds that visited our window every morning. Everything is brand new and perfect for stay at home working with all that you need. The spa like shower also so lovely after a long day. Scott is a very responsive and accommodating host, Thank you, Scott! | 1 | 0.9932 |
The cell below will test your solution for get_sentiment (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
ex_name='get_sentiment',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(get_sentiment),
ex_name='get_sentiment',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to get_sentiment did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_rank_top_listings_TRUE = utils.load_object_from_publicdata('demo_result_rank_top_listings_TRUE')
rank_top_listings
Your task: define rank_top_listings as follows:
Now that the relevant comments have been associated with sentiment scores, we can use them to find our best options. The simplest thing to do would be to sort by sentiment and call it day, but some reviews are more informative than others. For example, recent reviews are usually much more insightful and pertinent to an upcoming stay, and we may be more inclined to trust frequent reviewers or superusers (this is why Google has the Local Guide feature, for example). Let's implement a weighted average based on this information, so that we can find our best options.
As a reminder, a weighted average is computed as follows:
$$\bar{x} = \frac{\sum_{i=1}^{n}{w_ix_i}}{\sum_{i=1}^{n}{w_i}}$$which expands to
$$\bar{x} = \frac{w_1x_1+w_2x_2+... + w_nx_n}{w_1+w_2+...+w_n}$$Where $w_i$ is the weight assigned to point $x_i$, and $n$ is the total number of points.
Input:
comments - The result of Exercise 6; A DataFrame consisting of reviews with sentiment analysis added, merged with superuser informationReturn: top_listings - A Pandas dataframe consisting of the top listing_ids and their weighted scores
Requirements:
weights column:date $\ge$:user_review_count greater than 1 (e.g. 0.01*(user_review_count-1))user_review_count is 2, the weight assigned to that review should be .81
-Hint: You may find strptime() useful. listing_id and aggregate sentiments using the weighted average forumula above. Use the computed column weights you computed as $w_i$ and sentiment as $x_1$.
Store the resulting column as weighted_avg.weighted_avg in descending order and listing_id in ascending order### Solution - Exercise 7
def rank_top_listings(comments: pd.DataFrame):
### BEGIN SOLUTION
comments_copy = comments.copy()
#1.
def apply_weights(x):
if x['date'] >= datetime.strptime('2024-01-01', '%Y-%m-%d'):
weight = 1
elif datetime.strptime('2024-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2021-01-01', '%Y-%m-%d'):
weight = .8
elif datetime.strptime('2021-01-01', '%Y-%m-%d') > x['date'] >= datetime.strptime('2018-01-01', '%Y-%m-%d'):
weight = .5
else:
weight = .1
##rate superuser reviews slightly more heavily - add .01 weight for every review > 1
weight += .01*(x['user_review_count']-1)
return weight
comments_copy['weights'] = comments_copy[['date','user_review_count']].apply(apply_weights, axis = 1)
#2.
def compute_avg(x):
return (x['weights']*x['sentiment']).sum()/x['weights'].sum()
weighted = comments_copy.groupby('listing_id', as_index = False).apply(compute_avg).rename(columns = {None:'weighted_avg'})
3.
top_listings = weighted.sort_values(by = ['weighted_avg', 'listing_id'], ascending = [False,True], ignore_index = True)
return top_listings
### END SOLUTION
### Demo function call
demo_comments = utils.load_object_from_publicdata('rank_top_listings_demo.dill')
demo_top_listings = rank_top_listings(demo_comments)
demo_top_listings
The demo should display this output.
| listing_id | weighted_avg | |
|---|---|---|
| 0 | 899909 | 0.836561 |
| 1 | 828540 | 0.795065 |
The cell below will test your solution for rank_top_listings (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
ex_name='rank_top_listings',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(rank_top_listings),
ex_name='rank_top_listings',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to rank_top_listings did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
Now that we have found the top Airbnb listings available for our desired dates, let's enhance our search by evaluating how safe each neighborhood in LA is. We will use crime data provided by the City of Los Angeles.
Our database includes two tables: crime and population
The crime table contains a record of each crime committed, including information such as the date, lat/long, area, and crime description.
The population table only contains three columns - the population and number of square miles contained in each of the 21 areas in LA.
Run the queries below to view samples of the data.
# Sample of crime table:
sample_crime_data = pd.read_sql('select * from crime limit 3', conn)
display(sample_crime_data)
# Sample of population table:
sample_population_data = pd.read_sql('select * from population limit 3', conn)
display(sample_population_data)
### Run Me!!!
demo_result_prime_crime_time_TRUE = utils.load_object_from_publicdata('demo_result_prime_crime_time_TRUE')
prime_crime_time
Your task: define prime_crime_time_query as follows:
The crime table contains entries for each crime committed in LA. Every crime record contains a crime description crime_desc, as well as the date on which the crime occurred date_occ. Let us first determine what the top crimes committed each month are.
Input: None
Return: A SQLite query that returns the top two crimes for each month.
Requirements:
month, description, crime_count, and crime_rankdate_occ column, and create a string containing the numeric month value along with the 3 letter month abbreviation. For example, January would be 01-JAN, September would be 09-SEP, December would be 12-DEC, etc.### Solution - Exercise 8
prime_crime_time_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
prime_crime_time_query = '''
with count_by_month as (
select
CASE substr(date_occ, 1, 2)
WHEN '01' THEN '01-JAN'
WHEN '02' THEN '02-FEB'
WHEN '03' THEN '03-MAR'
WHEN '04' THEN '04-APR'
WHEN '05' THEN '05-MAY'
WHEN '06' THEN '06-JUN'
WHEN '07' THEN '07-JUL'
WHEN '08' THEN '08-AUG'
WHEN '09' THEN '09-SEP'
WHEN '10' THEN '10-OCT'
WHEN '11' THEN '11-NOV'
WHEN '12' THEN '12-DEC'
ELSE NULL
END AS month,
crime_desc description,
count(crime_desc) crime_count
from
crime
group by 1, 2
)
select * from (
select
month,
description,
crime_count,
rank() OVER (partition by month order by crime_count desc) AS crime_rank
from
count_by_month
)
where crime_rank <= 2
'''
### END SOLUTION
### Demo function call
demo_prime_crime_time_result = pd.read_sql(prime_crime_time_query,conn)
demo_prime_crime_time_result
The demo should display this output.
| month | description | crime_count | crime_rank | |
|---|---|---|---|---|
| 0 | 01-JAN | VEHICLE - STOLEN | 4097 | 1 |
| 1 | 01-JAN | BATTERY - SIMPLE ASSAULT | 2891 | 2 |
| 2 | 02-FEB | VEHICLE - STOLEN | 3666 | 1 |
| 3 | 02-FEB | THEFT OF IDENTITY | 2954 | 2 |
| 4 | 03-MAR | VEHICLE - STOLEN | 3942 | 1 |
| 5 | 03-MAR | BATTERY - SIMPLE ASSAULT | 2695 | 2 |
| 6 | 04-APR | VEHICLE - STOLEN | 3611 | 1 |
| 7 | 04-APR | BATTERY - SIMPLE ASSAULT | 2417 | 2 |
| 8 | 05-MAY | VEHICLE - STOLEN | 3847 | 1 |
| 9 | 05-MAY | BATTERY - SIMPLE ASSAULT | 1973 | 2 |
| 10 | 06-JUN | VEHICLE - STOLEN | 3795 | 1 |
| 11 | 06-JUN | BURGLARY FROM VEHICLE | 1694 | 2 |
| 12 | 07-JUL | VEHICLE - STOLEN | 4171 | 1 |
| 13 | 07-JUL | VANDALISM - FELONY (400 and OVER, ALL CHURCH VANDALISMS) | 1777 | 2 |
| 14 | 08-AUG | VEHICLE - STOLEN | 4074 | 1 |
| 15 | 08-AUG | BURGLARY FROM VEHICLE | 1806 | 2 |
| 16 | 09-SEP | VEHICLE - STOLEN | 3904 | 1 |
| 17 | 09-SEP | BURGLARY FROM VEHICLE | 1825 | 2 |
| 18 | 10-OCT | VEHICLE - STOLEN | 4109 | 1 |
| 19 | 10-OCT | BURGLARY FROM VEHICLE | 2021 | 2 |
| 20 | 11-NOV | VEHICLE - STOLEN | 3867 | 1 |
| 21 | 11-NOV | BURGLARY FROM VEHICLE | 2039 | 2 |
| 22 | 12-DEC | VEHICLE - STOLEN | 2898 | 1 |
| 23 | 12-DEC | BURGLARY FROM VEHICLE | 1918 | 2 |
The cell below will test your solution for prime_crime_time (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
ex_name='prime_crime_time',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(prime_crime_time_query),
ex_name='prime_crime_time',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to prime_crime_time did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_crime_center_TRUE = utils.load_object_from_publicdata('demo_result_crime_center_TRUE')
crime_center
Your task: define crime_center_query as follows:
The crime table contains entries categorized into the areas served by each police department within LA. Each crime entry contains the coordinates at which the crime occurred. Let us now determine the crime center for each police department's area.
Input: None
Return: A SQLite query that returns the average coordinates for each area.
Requirements:
lat and long in the crime table contain the coordinates at which the crime occurredarea. Name these columns center_lat and center_longarea, center_lat, and center_longarea### Solution - Exercise 9
crime_center_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
crime_center_query = '''select area, avg(lat) center_lat, avg(long) center_long from crime group by area order by 1'''
### END SOLUTION
### Demo function call
demo_crime_center_result = pd.read_sql(crime_center_query,conn)
demo_crime_center_result
The demo should display this output.
| area | center_lat | center_long | |
|---|---|---|---|
| 0 | 77th Street | 33.977699 | -118.297673 |
| 1 | Central | 34.048074 | -118.250774 |
| 2 | Devonshire | 34.250923 | -118.537942 |
| 3 | Foothill | 34.245504 | -118.370130 |
| 4 | Harbor | 33.770143 | -118.285061 |
| 5 | Hollenbeck | 34.053705 | -118.203808 |
| 6 | Hollywood | 34.099204 | -118.328856 |
| 7 | Mission | 34.254761 | -118.450299 |
| 8 | N Hollywood | 34.172124 | -118.384572 |
| 9 | Newton | 34.010694 | -118.260135 |
| 10 | Northeast | 34.104364 | -118.238708 |
| 11 | Olympic | 34.060322 | -118.300204 |
| 12 | Pacific | 33.981253 | -118.419988 |
| 13 | Rampart | 34.062123 | -118.267145 |
| 14 | Southeast | 33.939051 | -118.266904 |
| 15 | Southwest | 34.020676 | -118.315023 |
| 16 | Topanga | 34.191668 | -118.601957 |
| 17 | Van Nuys | 34.176803 | -118.439639 |
| 18 | West LA | 34.051339 | -118.439908 |
| 19 | West Valley | 34.186766 | -118.519701 |
| 20 | Wilshire | 34.062164 | -118.351886 |
The cell below will test your solution for crime_center (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
ex_name='crime_center',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(crime_center_query),
ex_name='crime_center',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to crime_center did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
### Run Me!!!
demo_result_relative_crime_rate_TRUE = utils.load_object_from_publicdata('demo_result_relative_crime_rate_TRUE')
relative_crime_rate
Your task: define relative_crime_rate_query as follows:
The crime table contains entries categorized into the areas served by each police department within LA. The population table contains the population and area in square miles for each of LA's police district areas.
We would now like to determine how dangerous each area is relative to the others. We can do this by computing:
crimes_per_100k: This is the floating-point number 100,000.0 multiplied by the number of crimes committed in a given area. This is then divided by the population for that area.
The formula is: crimes_per_100k = 100000.0 * number of crimes / population for each area
crimes_per_sq_mile: This is the ratio of the number of crimes committed in a given area divided by the number of square miles in that area.
The formula is: crimes_per_sq_mile = number of crimes / square miles for each area
Input: None
Return: A SQLite query that returns the relative crime rate for each area.
Requirements:
area, crimes_per_100k, and crimes_per_sq_milecrimes_per_100k in descending order, crimes_per_sq_mile in descending order, and finally by area in ascending alphabetical order### Solution - Exercise 10
relative_crime_rate_query = '''YOUR QUERY HERE'''
### BEGIN SOLUTION
relative_crime_rate_query = '''
with scores as (
select area, count(case_num) total_score from crime
group by area)
select s.area, round(s.total_score * 100000.0 / p.population) crimes_per_100k, round(s.total_score / p.sq_miles) crimes_per_sq_mile
from scores s inner join population p on s.area = p.area
order by 2 desc, 3 desc, 1
'''
### END SOLUTION
### Demo function call
demo_relative_crime_rate_result = pd.read_sql(relative_crime_rate_query,conn)
demo_relative_crime_rate_result
The demo should display this output.
| area | crimes_per_100k | crimes_per_sq_mile | |
|---|---|---|---|
| 0 | Central | 67945.0 | 6040.0 |
| 1 | Southwest | 12981.0 | 1634.0 |
| 2 | 77th Street | 11845.0 | 1742.0 |
| 3 | Newton | 11546.0 | 1924.0 |
| 4 | Southeast | 11265.0 | 1657.0 |
| 5 | Pacific | 10977.0 | 853.0 |
| 6 | Rampart | 10384.0 | 3092.0 |
| 7 | Olympic | 8763.0 | 2827.0 |
| 8 | N Hollywood | 8548.0 | 752.0 |
| 9 | Harbor | 8446.0 | 535.0 |
| 10 | West Valley | 7732.0 | 454.0 |
| 11 | Topanga | 7459.0 | 821.0 |
| 12 | Wilshire | 7178.0 | 1290.0 |
| 13 | Devonshire | 7144.0 | 324.0 |
| 14 | West LA | 7017.0 | 246.0 |
| 15 | Hollenbeck | 6362.0 | 837.0 |
| 16 | Mission | 6350.0 | 571.0 |
| 17 | Foothill | 6096.0 | 241.0 |
| 18 | Northeast | 5986.0 | 516.0 |
| 19 | Hollywood | 5867.0 | 1023.0 |
| 20 | Van Nuys | 4831.0 | 523.0 |
The cell below will test your solution for relative_crime_rate (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 10
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
ex_name='relative_crime_rate',
key=b'EXzjI9Bl-HSKe3WU8FpYqOWmhdkEXDqidvCHvUb29Vs=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'
### BEGIN HIDDEN TESTS
start_time = time()
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(relative_crime_rate_query),
ex_name='relative_crime_rate',
key=b'tbHWq1vIdW4eCnmnM2RbmIeX398-hATUSjUUhdoaGes=',
n_iter=50,
hidden=True)
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to relative_crime_rate did not pass the test.'
### END HIDDEN TESTS
print('Passed! Please submit.')
If you have made it this far, congratulations! Remember to submit the exam to ensure you receive all the points you have earned!