Midterm 2
: Spring 2024
¶Version 0.0.2
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of ['Working with relational data (SQL/Pandas)', 'Working with Numpy Arrays']
. It has 7 exercises numbered 0 to 6. There are 13 available points. However to earn 100% the threshold is 11 points. (Therefore once you hit 11 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging you code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 1 point(s)
Exercise 2 - : 2 point(s)
Exercise 3 - : 2 point(s)
Exercise 4 - : 2 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 3 point(s)
Final reminders:
import re
import pandas as pd
import numpy as np
import sqlite3
conn = sqlite3.connect('file:resource/asnlib/publicdata/traffic.db?mode=ro', uri=True)
Given a SQLite database connection and a table name determine the column names for that table. Return your result as a Python list sorted in alphabetical order.
This solution most likely requires running a dynamically generated command against the database connection, however SQLite does not allow parameters in a FROM clause. For the sake of this exercise, the security team has magically given approval for self-validation. To meet the standard get_table_cols
must verify that the table_name
parameter only contains letters, numbers, or underscores. If that is not the case a ValueError must be raised.
### Solution - Exercise 0
def get_table_cols(table_name, conn):
### BEGIN SOLUTION
if re.search('\W', table_name) is not None:
raise ValueError
return sorted(pd.read_sql(f'''SELECT * FROM {table_name} LIMIT 1''', conn).columns)
### END SOLUTION
### Demo function call
print(get_table_cols(table_name='vehicles', conn=conn))
The demo above should display the following output:
['COLLISION_ID', 'CONTRIBUTING_FACTOR_1', 'CONTRIBUTING_FACTOR_2', 'CRASH_DATE', 'CRASH_TIME', 'DRIVER_LICENSE_JURISDICTION', 'DRIVER_LICENSE_STATUS', 'DRIVER_SEX', 'POINT_OF_IMPACT', 'PRE_CRASH', 'PUBLIC_PROPERTY_DAMAGE', 'PUBLIC_PROPERTY_DAMAGE_TYPE', 'STATE_REGISTRATION', 'TRAVEL_DIRECTION', 'UNIQUE_ID', 'VEHICLE_DAMAGE', 'VEHICLE_DAMAGE_1', 'VEHICLE_DAMAGE_2', 'VEHICLE_DAMAGE_3', 'VEHICLE_ID', 'VEHICLE_MAKE', 'VEHICLE_MODEL', 'VEHICLE_OCCUPANTS', 'VEHICLE_TYPE', 'VEHICLE_YEAR']
The cell below will test your solution for get_table_cols (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. def get_table_cols_wrapper(table_name, conn):
out = None
err = False
try:
out = get_table_cols(table_name, conn)
except ValueError:
err = True
finally:
return err, out
### Test Cell - Exercise 0
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_table_cols']['config']
ex_conf['func'] = get_table_cols_wrapper
tester = Tester(ex_conf, key=b'fnyLpMOE1kzPU139jnXiRzK8TQaOmdpujOXBA7ere7M=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'RLFek7Ust8SoBuK86FBG2y2iMsquCfU-Jc3fPrcwTZw=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Given a list (lol
) containing sub-lists of table column names, identify all column names which are present in at least num
sublists.
Return your result as a Python set.
### Solution - Exercise 1
def intersection_of_cols(lol: list, num=None) -> set:
assert len(lol) > 1, '`lol` must have at least 2 items'
if num is None:
num = len(lol)
### BEGIN SOLUTION
from collections import Counter
c = Counter([li for l in lol for li in l])
return {col for col, count in c.items() if count >= num}
### END SOLUTION
### Demo function call
for num in [2,3]:
demo_intersection = intersection_of_cols(
lol=[['ringo', 'phyllis', 'angela', 'paul'],
['kevin', 'paul', 'oscar', 'kelly', 'phyllis'],
['phyllis', 'oscar', 'ryan', 'john', 'toby']],
num=num)
print(f'num={num}; intersection={demo_intersection}')
The demo above should display the following output:
num=2; intersection={'oscar', 'paul', 'phyllis'}
num=3; intersection={'phyllis'}
The cell below will test your solution for intersection_of_cols (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['intersection_of_cols']['config']
ex_conf['func'] = intersection_of_cols
tester = Tester(ex_conf, key=b'raZA_h68J-9hTejHs20uHCC1YESrg7_SN5rqzwaaOHk=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'jMB3qDF8LreLKeFXC41uV57PB8Ak0DhiVjw-ceCVdvM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Given a SQLite database connection which contains the PERSON table provide a summary with these columns:
Return your result as a Pandas DataFrame.
### Solution - Exercise 2
def summarize_person(conn):
### BEGIN SOLUTION
return pd.read_sql('''
SELECT
SUBSTR(CRASH_DATE, 7) YEAR,
SUM(PERSON_INJURY='Injured') INJURY_COUNT,
SUM(PERSON_INJURY='Killed') DEATHS_COUNT,
SUM(PERSON_INJURY='Unspecified') UNSPECIFIED_COUNT
FROM
PERSON
GROUP BY
1
''', conn)
### END SOLUTION
### Demo function call
demo_person_summary_df = summarize_person(conn)
display(demo_person_summary_df)
The demo above should display the following output:
YEAR | INJURY_COUNT | DEATHS_COUNT | UNSPECIFIED_COUNT | |
---|---|---|---|---|
0 | 2012 | 27447 | 137 | 87 |
1 | 2013 | 55127 | 297 | 191 |
2 | 2014 | 51212 | 262 | 379 |
3 | 2015 | 51357 | 243 | 571 |
4 | 2016 | 60076 | 239 | 740038 |
5 | 2017 | 60655 | 261 | 900859 |
6 | 2018 | 61918 | 231 | 884054 |
7 | 2019 | 61388 | 244 | 792637 |
8 | 2020 | 44614 | 269 | 368310 |
9 | 2021 | 51782 | 296 | 333997 |
10 | 2022 | 51931 | 289 | 310335 |
11 | 2023 | 54230 | 273 | 284491 |
12 | 2024 | 8037 | 41 | 41617 |
The cell below will test your solution for summarize_person (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['summarize_person']['config']
ex_conf['func'] = summarize_person
tester = Tester(ex_conf, key=b'Dlr-TXAq3EljYXsnF0vtXoIhJ6PhRDCqW9-8sdTW5bo=', path='resource/asnlib/publicdata/')
for _ in range(5):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'dzuPNunpJsE7peUKg1xT8ASJaxOuTd3OaT2JNFQHQuM=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Given a SQLite database connection which contains the CRASHES table provide a summary with these columns:
Return your result as a Pandas DataFrame
### Solution - Exercise 3
def summarize_crashes(conn):
### BEGIN SOLUTION
return pd.read_sql('''
select
substr("CRASH DATE",7) YEAR,
COUNT(*) CRASH_COUNT,
SUM("NUMBER OF PERSONS KILLED") DEATHS,
SUM("NUMBER OF PEDESTRIANS KILLED") PEDESTRIAN_DEATHS,
SUM("NUMBER OF PEDESTRIANS KILLED")/SUM("NUMBER OF PERSONS KILLED") PEDESTRIAN_DEATH_SHARE,
SUM("NUMBER OF PERSONS KILLED") / COUNT(*) DEATHS_PER_CRASH
FROM CRASHES
GROUP BY 1
ORDER BY 1
''', conn)
### END SOLUTION
### Demo function call
demo_crashes_summary = summarize_crashes(conn)
display(demo_crashes_summary)
The demo above should display the following output:
YEAR | CRASH_COUNT | DEATHS | PEDESTRIAN_DEATHS | PEDESTRIAN_DEATH_SHARE | DEATHS_PER_CRASH | |
---|---|---|---|---|---|---|
0 | 2012 | 100545 | 137 | 72 | 0.525547 | 0.00136257 |
1 | 2013 | 203742 | 297 | 176 | 0.592593 | 0.00145773 |
2 | 2014 | 206033 | 262 | 133 | 0.507634 | 0.00127164 |
3 | 2015 | 217694 | 243 | 133 | 0.547325 | 0.00111625 |
4 | 2016 | 229831 | 246 | 149 | 0.605691 | 0.00107035 |
5 | 2017 | 231007 | 256 | 127 | 0.496094 | 0.00110819 |
6 | 2018 | 231564 | 231 | 123 | 0.532468 | 0.000997564 |
7 | 2019 | 211486 | 244 | 131 | 0.536885 | 0.00115374 |
8 | 2020 | 112915 | 269 | 101 | 0.375465 | 0.00238232 |
9 | 2021 | 110549 | 296 | 131 | 0.442568 | 0.00267755 |
10 | 2022 | 103881 | 289 | 134 | 0.463668 | 0.00278203 |
11 | 2023 | 96563 | 273 | 104 | 0.380952 | 0.00282717 |
12 | 2024 | 14259 | 41 | 17 | 0.414634 | 0.00287538 |
The cell below will test your solution for summarize_crashes (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['summarize_crashes']['config']
ex_conf['func'] = summarize_crashes
tester = Tester(ex_conf, key=b'y1po0pBiQV1sEYfgf-WpnO58_KnZX_Pob5d_nj8Zsz0=', path='resource/asnlib/publicdata/')
for _ in range(100):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'FcFiSYXsdMMsRBVlJhjx1zqVB8P_SRw5IcH4SUQ6Nes=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Given a SQLite database connection which contains the VEHICLES and CRASHES tables provide a summary with these columns:
Additionally,
Return your result as a Pandas DataFrame
### Solution - Exercise 4
def summarize_vehicles(conn):
### BEGIN SOLUTION
return pd.read_sql('''
SELECT
UPPER(VEHICLES.VEHICLE_TYPE) VEHICLE_TYPE,
COUNT(*) COUNT,
SUM(UPPER(STATE_REGISTRATION)<>'NY') OUT_OF_STATE,
SUM(CRASHES."NUMBER OF PERSONS KILLED") DEATHS
FROM
VEHICLES
INNER JOIN
CRASHES
ON
VEHICLES.COLLISION_ID = CRASHES.COLLISION_ID
GROUP BY
1
ORDER BY
2 DESC, 1 ASC
LIMIT
30
''',conn)
### END SOLUTION
### Demo function call
demo_vehicle_summary = summarize_vehicles(conn)
display(demo_vehicle_summary)
The demo above should display the following output:
VEHICLE_TYPE | COUNT | OUT_OF_STATE | DEATHS | |
---|---|---|---|---|
0 | SEDAN | 1035884 | 191291 | 1131 |
1 | STATION WAGON/SPORT UTILITY VEHICLE | 825981 | 126382 | 1115 |
2 | PASSENGER VEHICLE | 770753 | 110002 | 623 |
3 | SPORT UTILITY / STATION WAGON | 337927 | 38312 | 383 |
4 | None | 233393 | 10884 | 253 |
5 | TAXI | 152218 | 2326 | 97 |
6 | UNKNOWN | 105571 | 12853 | 79 |
7 | PICK-UP TRUCK | 91798 | 24261 | 118 |
8 | 4 DR SEDAN | 73573 | 11448 | 58 |
9 | VAN | 68465 | 13626 | 72 |
10 | BUS | 67173 | 6527 | 147 |
11 | BOX TRUCK | 53364 | 23557 | 81 |
12 | BIKE | 46765 | 619 | 173 |
13 | OTHER | 45966 | 12336 | 40 |
14 | LARGE COM VEH(6 OR MORE TIRES) | 28604 | 12494 | 66 |
15 | SMALL COM VEH(4 TIRES) | 26683 | 7717 | 12 |
16 | MOTORCYCLE | 23321 | 2394 | 401 |
17 | TRACTOR TRUCK DIESEL | 20887 | 15208 | 77 |
18 | LIVERY VEHICLE | 19441 | 359 | 10 |
19 | BICYCLE | 19333 | 922 | 59 |
20 | AMBULANCE | 10991 | 141 | 7 |
21 | E-BIKE | 9055 | 198 | 73 |
22 | DUMP | 8122 | 1288 | 47 |
23 | CONVERTIBLE | 6470 | 1171 | 9 |
24 | E-SCOOTER | 5779 | 126 | 25 |
25 | FLAT BED | 5099 | 1728 | 26 |
26 | MOPED | 4842 | 208 | 17 |
27 | PK | 4838 | 1414 | 11 |
28 | 2 DR SEDAN | 4833 | 940 | 9 |
29 | GARBAGE OR REFUSE | 4494 | 689 | 21 |
The cell below will test your solution for summarize_vehicles (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['summarize_vehicles']['config']
ex_conf['func'] = summarize_vehicles
tester = Tester(ex_conf, key=b'TGmAkxJCRpTcpTrGvsTDQUEK0II1XnNSOlzAHhB1Nmg=', path='resource/asnlib/publicdata/')
for _ in range(5):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'o7I92u1vWLA2Fet5G2I2Bj_Uj4lknZbbKz-aCOt02Qk=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Given a SQL query, a connection, and a list of coordinate parameters your solution should:
### Solution - Exercise 5
def geo_filter_crashes(qry, conn, params):
### BEGIN SOLUTION
df = pd.read_sql(qry, conn, params=params)
df['CRASH_DATE'] = pd.to_datetime(df['CRASH_DATE'])
return df
### END SOLUTION
### Demo function call
demo_qry = '''
select
"CRASH DATE" AS CRASH_DATE,
LATITUDE,
LONGITUDE,
"NUMBER OF PERSONS KILLED" AS PERSONS_KILLED
from
crashes
where
latitude is not null and latitude between ? and ?
and
longitude is not null and longitude between ? and ?
and
PERSONS_KILLED is not null and PERSONS_KILLED > 0
'''
demo_df = geo_filter_crashes(demo_qry, conn, [40.5, 40.95, -74.1, -73.65])
display(demo_df.head())
The demo above should display the following output:
CRASH_DATE | LATITUDE | LONGITUDE | PERSONS_KILLED | |
---|---|---|---|---|
0 | 2021-07-09 00:00:00 | 40.7205 | -73.8889 | 1 |
1 | 2021-12-12 00:00:00 | 40.8404 | -73.9181 | 1 |
2 | 2021-04-15 00:00:00 | 40.6205 | -74.0293 | 1 |
3 | 2021-04-17 00:00:00 | 40.7825 | -73.9788 | 1 |
4 | 2021-07-08 00:00:00 | 40.7215 | -73.9838 | 1 |
The cell below will test your solution for geo_filter_crashes (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['geo_filter_crashes']['config']
ex_conf['func'] = geo_filter_crashes
tester = Tester(ex_conf, key=b'o0dQd3ONb1-wrgcZ3gkzdjZEVx2I_lGP6Zjj8268qlY=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'KYvrXaqhmbN_JFJTRoqDXkaGuLIcqf5wlqdI_530pk4=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We used a solution for the prior exercise to query the CRASHES table to get the latitude, longitude, and number of fatalities for all crashes in our dataset. Then we fed that result into a KDE model. KDE is a non-parametric technique for estimating the PDF of a distribution given some samples drawn from it. The high level is that for every observation we place a Gaussian distribution centered at the coordinates. Once all the distributions are placed, the sum is proportional to the PDF estimate. For our use case the model estimates the density of fatal crashes over a geographic area - i.e. given that there was a fatal crash higher estimates for the PDF indicate a higher probability of the crash having occurred at those coordinates.
crash_density = stats.gaussian_kde(df[['LONGITUDE', 'LATITUDE']].values.T,
0.02,
weights=df['PERSONS_KILLED'].values)
The model outputs a function which takes a Numpy array as an input and returns a Numpy array of density estimates corresponding to the inputs. The input's shape is (# dimensions, # data points) and the output's shape in (# data points, ).
Given a bounding box of coordinates, a grid size, and an estimator.
x
) of grid_size
evenly spaced numbers between and including the min and max x coordinates.np.linspace
function is useful here.y
) for the y coordinates.x
and one from y
use the estimator
to compute the PDF for the pair.np.meshgrid
, np.vstack
, and np.flatten
to compute a suitable input for the estimator
.estimator
expects an array with shape (2, # data points). The first row is the x
values and the second row is the y
values.(grid_size, grid_size)
.y
value. x
value.y
and x
values increase from the origin at the top-left. Note: The bounds are given as a list [xmin, xmax, ymin, ymax]
.
### Solution - Exercise 6
def density_grid(bounds, grid_size, estimator):
### BEGIN SOLUTION
x = np.linspace(*bounds[:2], grid_size)
y = np.linspace(*bounds[2:], grid_size)
xx, yy = np.meshgrid(x, y)
positions = np.vstack((xx.flatten(), yy.flatten()))
Z = estimator(positions).reshape((grid_size, grid_size))
return Z
### END SOLUTION
### Demo function call
demo_bounds = [0, 9, 0, 90]
demo_grid_size = 10
# estimator adds the x and y values
demo_estimator = lambda a: a[0] + a[1]
demo_grid = density_grid(demo_bounds, demo_grid_size, demo_estimator)
print(demo_grid)
The demo above should display the following output:
[[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
[20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
[30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]
[40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]
[50. 51. 52. 53. 54. 55. 56. 57. 58. 59.]
[60. 61. 62. 63. 64. 65. 66. 67. 68. 69.]
[70. 71. 72. 73. 74. 75. 76. 77. 78. 79.]
[80. 81. 82. 83. 84. 85. 86. 87. 88. 89.]
[90. 91. 92. 93. 94. 95. 96. 97. 98. 99.]]
The cell below will test your solution for density_grid (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['density_grid']['config']
ex_conf['func'] = density_grid
tester = Tester(ex_conf, key=b'9Ca2cAl9o2HgD7GW4J_JZ6VJUOWHVyxxcPbfrflws2g=', path='resource/asnlib/publicdata/')
for _ in range(20):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'Mgy5D2Kk_dTBhv8-_3oFhTc0jjUEbm2B-5bLq50Aedk=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(1):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
If you have reached this point, congratulations! Don't forget to submit your work!
The remaining content is informational only.
We have filled in the gaps with the parameter values and used some reference map data to render the density of traffic collision fatalities over a road map of NYC. It takes a bit to render, so we're just providing the code for reference and showing the rendered image.
import geopandas
from matplotlib import pyplot as plt
from scipy import stats
###
### This query identifies the date and coordinates of all traffic collision fatalities available.
###
qry = '''
select
"CRASH DATE" AS CRASH_DATE,
LATITUDE,
LONGITUDE,
"NUMBER OF PERSONS KILLED" AS PERSONS_KILLED
from
crashes
where
latitude is not null and latitude between ? and ?
and
longitude is not null and longitude between ? and ?
and
PERSONS_KILLED is not null and PERSONS_KILLED > 0
'''
###
### We feed the query into our geo_filter_crashes along with a bounding box for NYC and use
### the result to define a KDE model.
###
df = geo_filter_crashes(qry, conn, [40.5, 40.95, -74.1, -73.65])
crash_density = stats.gaussian_kde(df[['LONGITUDE', 'LATITUDE']].values.T,
0.02,
weights=df['PERSONS_KILLED'].values)
###
### Set a grid size and bounding box then build our density grid
###
grid_size = 800
bounds = [df['LONGITUDE'].min(),
df['LONGITUDE'].max(),
df['LATITUDE'].min(),
df['LATITUDE'].max()]
Z = density_grid(bounds, grid_size, crash_density)
###
### Set up some stuff with the plotting library, read the roadmap from a file and plot it.
###
plt.rcParams['figure.figsize'] = [40, 40]
gdf=geopandas.read_file('resource/asnlib/publicdata/svwp-sbcd.geojson')
gdf.plot(ax=ax, zorder=0)
###
### Plot the density over the map.
###
ax.imshow(Z, extent=bounds, cmap='plasma', alpha=.7*(Z > 20).astype(float), origin='lower')
plt.show()