Spring 2026 FX: Professor Vuduc Forages for Mushrooms¶Version 1.0.3
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of filtering and analysis in Pandas and SQL, Math to Code, and clustering. It has 11 exercises numbered 0 to 10. There are 20 available points. However to earn 100% the threshold is 15 points. (Therefore once you hit 15 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 0 point(s)
Exercise 3 - : 3 point(s)
Exercise 4 - : 1 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 1 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 2 point(s)
Exercise 9 - : 2 point(s)
Exercise 10 - : 3 point(s)
Final reminders:
Background: iNaturalist is an online community that helps users identify plants and animals. Users upload photos of an organism along with their location coordinates, and the app uses machine learning to suggest some possible species matches. After reviewing the information, the user selects the best match. Other users on the platform can weigh in, which increases the confidence of the identification. For casual foragers, this can help determine whether a plant is safe to eat. For scientists, high-quality observations with a strong species consensus can become 'research-grade' and be used to aid population genetics efforts.
We're interested in the first use-case, specifically focused around mushrooms. Mushroom foraging in particular can be dangerous, as several prized edible varieties look similar to poisonous ones.
Your overall task: Your goal is to use iNaturalist data to find some safe areas in the continental United States to search for edible mushrooms. You will do this in two main steps:
First, you will clean, filter, and classify iNaturalist observations using data from Wikipedia.
Second, you will perform K-Means clustering to find population centers and determine the safety of foraging locations.
The datasets: There are three sources of data which you will use to solve the following exercises.
Information about mushroom observations in the continental United States, sourced from iNaturalist, represented as a SQLite database. (Note: This tool does require a login.)
A Data Frame containing information about poisonous mushrooms, scraped from the Wikipedia page on poisonous and deadly mushroom species, respectively.
A Data Frame containing information about edible mushrooms, scraped from Wikipedia.
SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re
import pandas as pd
import numpy as np
from datetime import datetime, timedelta, date
import sqlite3
from pprint import pprint
utils.add_from_file('compute_cartesian', utils)
utils.add_from_file('compute_euclid', utils)
### Run Me!!!
poisonous_df = utils.load_object_from_publicdata('poisonous_df')
edible_df = utils.load_object_from_publicdata('edible_df')
explore_data__FREE
Example: we have defined explore_data__FREE as follows:
This is a free exercise!
Please run the test cell below to collect your FREE point!
We encourage you to review the samples of observations, poisonous_df and edible_df in the cell below:
observations A SQLite containing species, location, and date information for mushroom observations from iNaturalist. poisonous_df A DataFrame containing information about known poisonous mushrooms, from Wikipedia.edible_df A DataFrame containing information about known edible mushrooms, from Wikipedia.### Solution - Exercise 0
def explore_data__FREE(conn: sqlite3.Connection, poisonous_df: pd.DataFrame, edible_df: pd.DataFrame, headsize:int=10) -> tuple:
observations_preview = pd.read_sql('select * from observations limit ?', conn, params=[headsize])
poisonous_preview = poisonous_df.head(n=headsize)
edible_preview = edible_df.head(n=headsize)
return (observations_preview, poisonous_preview, edible_preview)
### Demo function call
with sqlite3.connect('resource/asnlib/publicdata/observations.sqlite_db') as conn:
(
observations_preview,
poisonous_preview,
edible_preview
) = explore_data__FREE(conn, poisonous_df, edible_df)
display(observations_preview.head(5))
display(poisonous_preview.head(5))
display(edible_preview.head(5))
The test cell below will always pass. Please submit to collect your free points for explore_data__FREE (exercise 0).
### Test Cell - Exercise 0
print('Passed! Please submit.')
### Run Me!!!
demo_result_cleanse_observations_TRUE = utils.load_object_from_publicdata('demo_result_cleanse_observations_TRUE')
cleanse_observations
Your task: define cleanse_observations_query as follows:
Define cleanse_observations as follows:
Input: None
Return: A SQLite3 query (Python string)
Requirements:
gbifID, class, order, family, genus, species, state, latitude, longitude, coordinate_uncertainty, day, month, year]coordinate_uncertainty is less than or equal to 25.stateProvince is 'Hawaii' or 'Alaska'.species (since later analysis requires a minimum sample size):species that appear in observations 25 or more timesclass, order, family, genus, species, state].stateProvince to statecoordinateUncertaintyInMeters to coordinate_uncertaintydecimalLatitude to latitudedecimalLongitude to longitudegbifID in ascending orderHint: Remember that certain words are reserved in SQL. This may be helpful when determining how to reference a column name that is also a reserved SQL keyword.
### Solution - Exercise 1
cleanse_observations_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
cleanse_observations_query = '''
with filtered_obs as (select gbifID, class, `order`, family, genus, species, stateProvince as state, decimalLatitude as latitude, decimalLongitude as longitude,
coordinateUncertaintyInMeters as coordinate_uncertainty, day, month, year from observations
where coordinateUncertaintyInMeters <=25
and stateProvince <> "Hawaii"
and stateProvince <> "Alaska"
and stateProvince is not null
and class is not null
and `order` is not null
and family is not null
and genus is not null
and species is not null)
select b.*
FROM filtered_obs AS b
inner join (
select species
from filtered_obs
group by species
having count(*) >= 25
) as a on a.species = b.species
order by gbifID
'''
### END SOLUTION
### Demo function call
with sqlite3.connect('resource/asnlib/publicdata/observations_demo.sqlite_db') as conn:
demo_cleanse_observations_result = pd.read_sql(cleanse_observations_query,conn)
demo_cleanse_observations_result
The demo should display this output.
| gbifID | class | order | family | genus | species | state | latitude | longitude | coordinate_uncertainty | day | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4015220986 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 40.042309 | -124.071195 | 15.0 | 2 | 1 | 2023 |
| 1 | 4021956935 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 40.042178 | -124.067756 | 15.0 | 20 | 1 | 2023 |
| 2 | 4022214055 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 38.427368 | -123.086479 | 4.0 | 19 | 1 | 2023 |
| 3 | 4022261682 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.271950 | -122.293687 | 4.0 | 19 | 1 | 2023 |
| 4 | 4029067610 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 36.599720 | -121.922417 | 5.0 | 25 | 1 | 2023 |
| 5 | 4034748083 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 38.807517 | -123.042917 | 4.0 | 11 | 2 | 2023 |
| 6 | 4054968728 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.031575 | -122.062600 | 10.0 | 1 | 3 | 2023 |
| 7 | 4063063690 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.506000 | -122.499947 | 4.0 | 7 | 3 | 2023 |
| 8 | 4075608178 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 38.536480 | -123.006722 | 4.0 | 20 | 3 | 2023 |
| 9 | 4076362192 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.755494 | -122.127525 | 4.0 | 12 | 3 | 2023 |
| 10 | 4096405558 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.681580 | -122.415697 | 19.0 | 27 | 1 | 2023 |
| 11 | 4111684497 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Oregon | 45.072656 | -123.042174 | 9.0 | 23 | 4 | 2023 |
| 12 | 4121159705 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.751163 | -122.100044 | 4.0 | 20 | 5 | 2023 |
| 13 | 4121274046 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.807613 | -122.167092 | 5.0 | 14 | 5 | 2023 |
| 14 | 4420752950 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | New York | 40.891514 | -73.896952 | 8.0 | 24 | 9 | 2023 |
| 15 | 4453906562 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 40.030081 | -124.061796 | 8.0 | 15 | 11 | 2023 |
| 16 | 4458533792 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 38.416645 | -122.954750 | 24.0 | 7 | 3 | 2023 |
| 17 | 4500840555 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 37.965780 | -122.537045 | 4.0 | 15 | 12 | 2023 |
| 18 | 4507625044 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Oregon | 45.471585 | -122.774923 | 10.0 | 28 | 12 | 2023 |
| 19 | 4507936641 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Oregon | 44.991702 | -122.790710 | 6.0 | 30 | 12 | 2023 |
| 20 | 4508007722 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | California | 38.892892 | -123.621430 | 4.0 | 30 | 12 | 2023 |
| 21 | 4516320520 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Oregon | 45.660188 | -122.841837 | 4.0 | 11 | 12 | 2023 |
| 22 | 4606881312 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Georgia | 33.574117 | -84.068314 | 4.0 | 27 | 5 | 2023 |
| 23 | 4855265811 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Florida | 29.623963 | -82.320450 | 6.0 | 17 | 8 | 2023 |
| 24 | 4863374167 | Agaricomycetes | Cantharellales | Hydnaceae | Clavulina | Clavulina rugosa | Washington | 47.522389 | -122.135715 | 15.0 | 10 | 12 | 2023 |
The cell below will test your solution for cleanse_observations (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(cleanse_observations_query),
ex_name='cleanse_observations',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_observations did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
demo_result_is_it_poisonous_TRUE = utils.load_object_from_publicdata('demo_result_is_it_poisonous_TRUE')
is_it_poisonous
Example: we have defined is_it_poisonous as follows:
THIS IS AN EXAMPLE
There is nothing to solve here!
Input:
mushroom_df: A DataFrame with information about mushrooms.species column, which contains the scientific name of the mushroom species.poisonous_df: A DataFrame with information about poisonous mushrooms.Scientific name column, which contains the scientific name of the poisonous mushroom species.edible_df: A DataFrame with information about edible mushrooms.Scientific name column, which contains the scientific name of the edible mushroom species. Return: mushroom_poison_df: A copy of the DataFrame with two new columns:
poisonous: Boolean with a value of 1 if the value of species is found in poisonous_df, and 0 if it is notedible: Boolean with a value of 1 if the value of species is found in edible_df, and 0 if it is not### Solution - Exercise 2
def is_it_poisonous(mushroom_df: pd.DataFrame, poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> pd.DataFrame:
mushroom_poison_df = mushroom_df.copy()
poison_mushroom_list = list(poisonous_df['Scientific name'].unique())
edibles = list(edible_df['Scientific name'].unique())
mushroom_poison_df['poisonous'] = mushroom_poison_df['species'].isin(poison_mushroom_list).astype(int)
mushroom_poison_df['edible'] = mushroom_poison_df['species'].isin(edibles).astype(int)
return mushroom_poison_df
### Demo function call
poison_demo = utils.load_object_from_publicdata('is_it_poisonous_demo_input.dill')
demo_result_poison_df = is_it_poisonous(poison_demo, poisonous_df, edible_df)
demo_result_poison_df
The demo should display this output.
| gbifID | class | order | family | genus | species | state | latitude | longitude | coordinate_uncertainty | day | month | year | poisonous | edible | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4011490202 | Agaricomycetes | Phallales | Phallaceae | Phallus | Phallus impudicus | California | 40.87 | -124.16 | 4.00 | 1 | 1 | 2023 | 0 | 0 |
| 1 | 4011490221 | Tremellomycetes | Tremellales | Naemateliaceae | Naematelia | Naematelia aurantia | California | 35.13 | -120.59 | 3.00 | 1 | 1 | 2023 | 0 | 0 |
| 2 | 4011547279 | Agaricomycetes | Agaricales | Omphalotaceae | Gymnopus | Gymnopus dryophilus | California | 34.06 | -118.50 | 16.00 | 1 | 1 | 2023 | 0 | 0 |
| 3 | 4011588270 | Agaricomycetes | Agaricales | Mycenaceae | Panellus | Panellus stipticus | Ohio | 39.43 | -82.55 | 4.00 | 1 | 1 | 2023 | 0 | 0 |
The test cell below will always pass. Please submit to collect your free points for is_it_poisonous (exercise 2).
### Test Cell - Exercise 2
print('Passed! Please submit.')
### Run Me!!!
demo_result_DEBUG_find_similar_TRUE = utils.load_object_from_publicdata('demo_result_DEBUG_find_similar_TRUE')
DEBUG_find_similar
Your task: define DEBUG_find_similar as follows:
Create a Python set which contains the scientific names of all edible mushrooms that have poisonous lookalikes.
Input:
poisonous_df: A dataframe containing information about poisonous mushrooms. It is expected to have these columns:Similar edible species: A string containing the scientific names of edible mushrooms that look similar to the poisonous mushroom.edible_df: A dataframe containing information about known edible mushrooms. It is expected to have these columns:Scientific name: The scientific name of the edible mushroom.Return: updated_sim: A set of edible mushrooms that have poisonous lookalikes.
Requirements:
Similar edible species column of poisonous_df.Similar edible species column of poisonous_df that do not contain "species" or "spp"edible_df that have the same genus as any of the extracted scientific names that do contain "species" or "spp".### Solution - Exercise 3
def DEBUG_find_similar(poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> set:
import re
import itertools
edibles = list(edible_df['Scientific name'])
s = poisonous_df['Similar edible species'].apply(lambda x: re.findall(r'^[a-z]+\s[a-z]+$',str(x)))
similar_edibles= list(itertools.chain(s.values()))
display(similar_edibles)
updated_sim = {}
for i in edibles:
if 'species' in i:
genus = i.split()[1]
updated_sim.add([name for name in edibles if 'genus' in name])
else:
updated_sim.append(s)
return updated_sim
##correct solution
def DEBUG_find_similar(poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> set:
import re
import itertools
edibles = list(edible_df['Scientific name'])
##mistake in regex
similar_edibles = poisonous_df['Similar edible species'].apply(lambda x: re.findall(r'[A-Z][a-z]+\s[a-z]+',str(x)))
##values() vs values
similar_edibles= set(itertools.chain.from_iterable(similar_edibles.values))
updated_sim = set()
for species in similar_edibles:
##left off spp ind debug version
if 'species' in species or 'spp' in species:
genus = species.split()[0]
##used add here instead of update
updated_sim.update([name for name in edibles if genus in name])
else:
##used append here instead of add
updated_sim.add(species)
return updated_sim
### Demo function call
similar_demo = utils.load_object_from_publicdata('similar_demo.dill')
demo_result_similar = DEBUG_find_similar(similar_demo, edible_df)
print(demo_result_similar)
The demo should display this printed output.
{'Cantharellus cibarius', 'Infundibulicybe gibba', 'Agaricus campestris', 'Infundibulicybe geotropa', 'Agaricus arvensis', 'Agaricus bisporus', 'Agaricus silvaticus'}
The cell below will test your solution for DEBUG_find_similar (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_find_similar,
ex_name='DEBUG_find_similar',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_find_similar did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
mushroom_poison_df = utils.load_object_from_publicdata('mushroom_poison_df')
edible_dupes = utils.load_object_from_publicdata('edible_dupes')
poisonous_df = utils.load_object_from_publicdata('poisonous_df')
demo_result_determine_severity_TRUE = utils.load_object_from_publicdata('demo_result_determine_severity_TRUE')
determine_severity
Your task: define determine_severity as follows:
Add severe and dupe columns to mushroom_poison_df based on the results of Exercise 3.
Input:
edible_dupes: Set of the scientific names of edible mushrooms with known poisonous lookalikes. (The result of Exercise 3)poisonous_df: A DataFrame containing information about poisonous mushrooms.Scientific name column of poisonous_df are poisonous.Severity column contains information about the severity of the toxin, which is either 'deadly' or 'poisonous'.mushroom_poison_df: Dataframe containing the cleansed observations.poisonous column contains 1 if the mushroom is poisonous and 0 otherwise.edible column contains 1 if the mushroom is edible and 0 otherwise.species column contains the scientific name of the mushroom.Return: mushroom_severity_df: A copy of mushroom_poison_df with these columns added:
severe (int): 1 if the mushroom is poisonous (according to mushroom_poison_df) and has 'deadly' severity (according to poisonous_df), 0 otherwise.dupe (int): 1 if the mushroom is edible (according to mushroom_poison_df) and has a known poisonous lookalike (according to edible_dupes), 0 otherwise.### Solution - Exercise 4
def determine_severity(edible_dupes: set, poisonous_df: pd.DataFrame, mushroom_poison_df: pd.DataFrame) -> pd.DataFrame:
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
def func(row):
if row['poisonous']:
severe = int(poisonous_df.loc[poisonous_df['Scientific name'] == row['species'], 'Severity'].item() == 'deadly')
else:
severe = 0
if row['edible']:
dupe = int(row['species'] in edible_dupes)
else:
dupe = 0
return dupe, severe
# # ##label mushroom severity
mushroom_severity_df= mushroom_poison_df.copy()
mushroom_severity_df[['dupe', 'severe']] = mushroom_severity_df.apply(func, axis = 1, result_type='expand')
return mushroom_severity_df
### END SOLUTION
### Demo function call
severe_demo = utils.load_object_from_publicdata('demo_mushroom_poison_df')
demo_result_severe = determine_severity(edible_dupes, poisonous_df, severe_demo)
demo_result_severe
The demo should display this output.
| gbifID | class | order | family | genus | species | state | latitude | longitude | coordinate_uncertainty | day | month | year | poisonous | edible | dupe | severe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4465656595 | Agaricomycetes | Agaricales | Hymenogastraceae | Galerina | Galerina marginata | Ohio | 39.08 | -84.42 | 4.00 | 5 | 12 | 2023 | 1 | 0 | 0 | 1 |
| 1 | 4438740626 | Agaricomycetes | Agaricales | Agaricaceae | Coprinus | Coprinus comatus | California | 33.73 | -117.94 | 11.00 | 5 | 11 | 2023 | 0 | 1 | 1 | 0 |
| 2 | 4901009967 | Agaricomycetes | Gomphales | Gomphaceae | Turbinellus | Turbinellus floccosus | Tennessee | 35.70 | -83.53 | 4.00 | 11 | 9 | 2023 | 1 | 0 | 0 | 0 |
| 3 | 4507720568 | Agaricomycetes | Agaricales | Physalacriaceae | Flammulina | Flammulina velutipes | Ohio | 39.81 | -83.89 | 3.00 | 30 | 12 | 2023 | 0 | 1 | 0 | 0 |
| 4 | 4600035867 | Agaricomycetes | Russulales | Stereaceae | Xylobolus | Xylobolus frustulatus | Pennsylvania | 41.24 | -78.32 | 3.00 | 6 | 3 | 2023 | 0 | 0 | 0 | 0 |
| 5 | 4018160871 | Agaricomycetes | Agaricales | Bolbitiaceae | Bolbitius | Bolbitius titubans | California | 36.62 | -121.94 | 9.00 | 11 | 1 | 2023 | 0 | 0 | 0 | 0 |
| 6 | 4436384325 | Agaricomycetes | Agaricales | Lycoperdaceae | Apioperdon | Apioperdon pyriforme | New York | 42.53 | -76.31 | 13.00 | 20 | 9 | 2023 | 0 | 0 | 0 | 0 |
The cell below will test your solution for determine_severity (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=determine_severity,
ex_name='determine_severity',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to determine_severity did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
mushroom_severity_df = utils.load_object_from_publicdata('mushroom_severity_df')
demo_result_DEBUG_peak_months_TRUE = utils.load_object_from_publicdata('demo_result_DEBUG_peak_months_TRUE')
DEBUG_peak_months
Your task: define DEBUG_peak_months as follows:
Input:
mushroom_severity_df: DataFrame with information about mushroom observations and their edibility.month column contains the month of the observation (1-12).genus column contains the genus of the mushroom.species column contains the species of the mushroom.genus: A string containing the desired genus we want to find the peak months forReturn: monthly_df: A pivoted DataFrame with months as the indices and species names as the columns
Requirements:
monthspeciesint64### Solution - Exercise 5
def DEBUG_peak_months(mushroom_severity_df: pd.DataFrame, genus:str) -> pd.DataFrame:
mushroom_severity_df['is_genus'] = mushroom_severity_df['genus'] == genus
df = mushroom_severity_df[mushroom_severity_df['is_genus']]
df = df[['month', 'genus', 'species']].groupby(by = ['month', 'genus']).count().rename(columns={'species':'counts'})
monthly_df = df.pivot_table(index = 'species', columns = 'month', values = 'counts').reset_index()
return monthly_df
## correct solution
def DEBUG_peak_months(mushroom_severity_df: pd.DataFrame, genus:str) -> pd.DataFrame:
df = mushroom_severity_df.copy()
df = df[df['genus'] == genus]
##mispelling
df = df[['month', 'genus', 'species']].groupby(by = ['month', 'species'],
as_index = False).count().rename(columns={'genus':'count'})
#bad dtype
monthly_df = df.pivot_table(index = 'month', columns = 'species', values = 'count', fill_value = 0)
## need this segment
for i in range(1,13):
if i not in monthly_df.index:
##doing this as `monthly_df.loc[i,:] = 0` casts everything to float
monthly_df.loc[i] = 0
##didn't sort in debug solution
return monthly_df.sort_index()
### Demo function call
demo_peak_months = DEBUG_peak_months(mushroom_severity_df, 'Mycena')
demo_peak_months
The demo should display this output.
| species | Mycena acicula | Mycena crocea | Mycena epipterygia | Mycena galericulata | Mycena haematopus | Mycena inclinata | Mycena leaiana | Mycena leptocephala | Mycena pura | Mycena purpureofusca |
|---|---|---|---|---|---|---|---|---|---|---|
| month | ||||||||||
| 1 | 16 | 0 | 2 | 5 | 45 | 2 | 0 | 5 | 7 | 11 |
| 2 | 3 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 12 | 0 |
| 3 | 12 | 0 | 0 | 0 | 8 | 1 | 0 | 3 | 3 | 0 |
| 4 | 8 | 1 | 0 | 1 | 11 | 0 | 0 | 1 | 3 | 0 |
| 5 | 12 | 0 | 0 | 1 | 11 | 4 | 58 | 3 | 2 | 0 |
| 6 | 7 | 0 | 0 | 1 | 5 | 1 | 75 | 0 | 3 | 3 |
| 7 | 1 | 0 | 0 | 0 | 9 | 0 | 55 | 0 | 2 | 0 |
| 8 | 1 | 1 | 1 | 1 | 11 | 0 | 136 | 0 | 5 | 0 |
| 9 | 0 | 31 | 1 | 6 | 50 | 12 | 131 | 1 | 6 | 4 |
| 10 | 6 | 37 | 30 | 37 | 41 | 23 | 34 | 14 | 30 | 12 |
| 11 | 3 | 7 | 17 | 6 | 12 | 3 | 2 | 1 | 5 | 6 |
| 12 | 7 | 1 | 11 | 15 | 11 | 3 | 0 | 15 | 7 | 9 |
The cell below will test your solution for DEBUG_peak_months (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(DEBUG_peak_months),
ex_name='DEBUG_peak_months',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_peak_months did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
demo_result_top_states_TRUE = utils.load_object_from_publicdata('demo_result_top_states_TRUE')
top_states
Your task: define top_states as follows:
Find the top states where a species of mushroom can be found.
Input:
mushroom_severity_df: A Dataframe with observations on mushrooms and their locations.state column contains the state where the mushroom was observed.species column contains the name of the species of mushroom observed.species: The species name to compute top states for.Return: top_states: a DataFrame containing columns state and count
- `state` - all states where the specified species was observed
- `count` - the number of observations of the specified species in that state
Requirements:
count in descending order, and then by state in ascending order.### Solution - Exercise 6
def top_states(mushroom_severity_df: pd.DataFrame, species: str) -> pd.DataFrame:
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
df = mushroom_severity_df.copy()
df = df[df['species'] == species]
return df.loc[:,['state','year']].groupby('state', as_index = False).count().sort_values(
by = ['year', 'state'], ascending = [False, True]).reset_index(drop = True).rename(columns = {'year':'count'})
### END SOLUTION
### Demo function call
demo_top_states = top_states(mushroom_severity_df, 'Stropharia ambigua')
demo_top_states
The demo should display this output.
| state | count | |
|---|---|---|
| 0 | Washington | 131 |
| 1 | California | 127 |
| 2 | Oregon | 48 |
| 3 | Arizona | 1 |
The cell below will test your solution for top_states (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(top_states),
ex_name='top_states',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to top_states did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
demo_result_find_population_centers_TRUE = utils.load_object_from_publicdata('demo_result_find_population_centers_TRUE')
find_population_centers
Your task: define find_population_centers as follows:
Use sklearn's KMeans to find the optimal number of clusters and return their population centers for a specified species.
Input:
mushroom_severity_df: DataFrame containing mushroom observation dataspecies column contains the species of the mushroom species observed.longitude and latitude columns contain the longitude and latitude of the observation, respectively.species: A string indicating the species of mushroom for which we want to find population centersrandom_state: An integer that should be passed to KMeans to determine the centroid initializationthreshold: The maximum percentage decrease in inertia allowed to continue increasing the number of clusters in KMeans. Return:
centers: a numpy array of population centers, representing longitude and latitude, for the specified species. The number of centers should be determined by KMeans using the threshold to determine when to stop increasing the number of clusters.Requirements:
mushroom_severity_df to only include observations of the specified species and only the longitude and latitude columns.k centers, using the specified random_state.k=1.k by 1.k value.threshold.threshold, break the loop and return the prior centers.Formula: $$\text{percent decrease} = 100 \left | \frac{\text{prior inertia} - \text{new inertia}}{\text{prior inertia}}\right |$$
Note:
n_clusters and random_state parameters to KMeans when fitting the model.### Solution - Exercise 7
def find_population_centers(mushroom_severity_df: pd.DataFrame, species: str, random_state: int, threshold: int) -> np.ndarray:
from sklearn.cluster import KMeans
### Filter mushroom_severity_df
_df = mushroom_severity_df
df = _df[_df['species'] == species][['longitude','latitude']].copy()
# 1
k = 1
kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
inertia = ... # replace the ... with inertia from kmeans model
centers = ... # replace the ... with centers from kmeans model
# 2-5 loop
k_max = 20
while k < k_max:
# 2 (increment k) - still need to deal with the prior inertia and centers, so do this before fitting the new model
k += 1
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
# 2 - inertia and centers
inertia = kmeans.inertia_ # this is from the prior model with k-1 centers
centers = kmeans.cluster_centers_ # this is from the prior model with k-1 centers
# 3
kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
new_inertia = kmeans.inertia_
# 4
percent_decrease = 100 * ((inertia - new_inertia) / inertia)
### END SOLUTION
# 5
if percent_decrease < threshold:
break
return centers
### Demo function call
demo_pop_centers = find_population_centers(mushroom_severity_df, 'Amanita parcivolvata', 1450, 15)
print('Centers: ', '\n', demo_pop_centers)
demo_pop_centers
The demo should display this printed output.
Centers:
[[-79.43116612 36.01252527]
[-83.40338028 35.8974422 ]
[-84.97993193 34.44744574]
[-92.47715433 33.943829 ]
[-80.83425363 40.05086921]
[-87.44177927 37.00098373]
[-77.24047738 38.777142 ]]
The cell below will test your solution for find_population_centers (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_population_centers),
ex_name='find_population_centers',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_population_centers did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
We want to calculate the geodetic distance between points on the Earth's surface, given their longitude and latitude.
FCC's formula to calculate the distance in kilometers between two coordinate points.
$$D\approx \sqrt{(K_1\Delta\phi)^2 + (K_2\Delta\lambda)^2}$$Using:
$$K_1 = 111.13209 - 0.56605\cos{(2\phi_m)} + 0.00120\cos{(4\phi_m)}$$$$K_2 = 111.41513\cos{(\phi_m)} - 0.09455\cos{(3\phi_m)} + 0.00012\cos{(5\phi_m)}$$Where:
geodetic_distance
Your task: define geodetic_distance as follows:
Input:
coord: a list containing the coordinates of your current location, in [long, lat] orderobs: a list containing the coordinates of a nearby observation, in [long, lat] orderReturn: a float value representing the distance in kilometers between coord and obs. This is $D$ from the formula above.
Intermediate calculations for the demo, rounded to 5 decimal places:
Note:
### Solution - Exercise 8
def geodetic_distance(coord:list, obs:list) -> float:
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
delta_phi = coord[1] - obs[1]
delta_lambda = coord[0] - obs[0]
phi_m = np.radians((coord[1] + obs[1])/2)
K1 = 111.13209 -.56605*np.cos(2*phi_m) + .00120*np.cos(4*phi_m)
K2 = 111.41513*np.cos(phi_m) - .09455*np.cos(3*phi_m) + .00012*np.cos(5*phi_m)
D = np.sqrt((K1*delta_phi)**2 + (K2*delta_lambda)**2)
return D
### END SOLUTION
### Demo function call
coord = [-122.25,38.5]
obs = [-123,39]
result = geodetic_distance(coord,obs)
print(result)
The demo should display this printed output.
85.62530311141984
The cell below will test your solution for geodetic_distance (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=geodetic_distance,
ex_name='geodetic_distance',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to geodetic_distance did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
### Run Me!!!
demo_result_assign_labels_TRUE = utils.load_object_from_publicdata('demo_result_assign_labels_TRUE')
Intuition:
assign_labels
Your task: define assign_labels as follows:
Given a numpy array of cluster centers and an array of observation coordinates, compute the distance from each observation coordinate to the cluster centers and assign the label of the closest cluster.
This exercise does not require you to have completed the prior exercises.
Input:
centers is a Numpy array of cluster centers found in Exercise 7, in [long, lat] ordercoordinates is a Numpy array of coordinates for a certain species, in [long, lat] orderdistance_function is a function meeting these criteria:Return:
labels a Numpy 1-D array containing the cluster label for each observation.
Requirements:
distance_function to compute the distance between each coordinate and all of the cluster centers.coordinates and each column represent the distance to a cluster center.Note:
distance_function is not the same as the geodetic_distance function you implemented in Exercise 8. distance_function is designed to take in a single point and an array of points, and return an array of distances.geodetic_distance function you implemented in Exercise 8 takes in two points and returns a single distance value.np.argmin to find the index of the minimum value along an axis in a Numpy array.### Solution - Exercise 9
def assign_labels(centers: np.ndarray, coordinates: np.ndarray, distance_function) -> np.ndarray:
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
distances_for_each_obs = np.apply_along_axis(distance_function, 1, centers, other_points=coordinates).T
return np.argmin(distances_for_each_obs, axis=1)
### END SOLUTION
### Demo function call
centers, coords = utils.load_object_from_publicdata('labels_demo_input.dill')
demo_result_labels = assign_labels(centers, coords, utils.compute_euclid)
print(demo_result_labels)
The demo should display this printed output.
[0 2 2 2 0 2 1 1 2 1]
The cell below will test your solution for assign_labels (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=assign_labels,
ex_name='assign_labels',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to assign_labels did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Professor Vuduc doesn't want to travel very far to find his mushrooms. Based on his current location and historic data, we want to assess:
How safe is it for him to hunt nearby?
What's the best edible species for him to forage?
We'll determine 1. by calculating a safety score based on all observations within a certain radius of Professor Vuduc's location:
$$\text{safety}\_\text{score} = 100\left(1-\frac{n_{pois}}{N}\right)\left(1-\frac{n_{severe}}{N}\right)\left(1-\frac{n_{dupe}}{N}\right)$$where:
$N$ is the total number of observations within the radius
$n_{pois}$ is the number of poisonous observations within the radius
$n_{severe}$ is the number of severe poisonous observations within the radius
$n_{dupes}$ is the number of dupe observations within the radius
For 2., we will assume that the best species is the most common edible species within the radius.
find_safety_score
Your task: define find_safety_score as follows:
Input:
mushroom_severity_df is the cleaned data frame containing iNaturalist observationsradius is a float containing the distance (in kilometers) Professor Vuduc will search withincoordinates is a tuple containing the (longitude, latitude) coordinates of Professor Vuduc's location.distance_function is a function meeting these criteria:Return:
safety_score: a float value representing a percentage, rounded to 2 decimal placesbest_species: a string representing the best nearby edible species for Professor Vuduc to forageRequirements:
Calculate safety_score:
distance_function to determine the distance between Professor Vuduc and each observation in mushroom_severity_df. radius. edible species within the radius edible species within the radius. This is the best_species to return. Here's the formula again for reference:
$$\text{safety}\_\text{score} = \left(1-\frac{n_{pois}}{N}\right)\left(1-\frac{n_{severe}}{N}\right)\left(1-\frac{n_{dupe}}{N}\right)$$### Solution - Exercise 10
def find_safety_score(mushroom_severity_df: pd.DataFrame, coordinates: tuple, radius: float, distance_function):
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
new_df = mushroom_severity_df.copy()
new_df['distance'] = distance_function(coordinates,new_df[['longitude','latitude']].values)
new_df['inside_circle'] = (new_df['distance'] <= radius)
# count inside circle observations
total_inside_circle_obs = len(new_df[new_df['inside_circle'] == True])
# count instances of poisonous, severely poisonous, and lookalike mushrooms within circle
poison_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['poisonous'] == 1)])
severe_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['severe'] == 1)])
dupe_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['dupe'] == 1)])
##calculate safety score
safety_score = (1-severe_inside_circle_obs/total_inside_circle_obs)\
*(1-poison_inside_circle_obs/total_inside_circle_obs)\
*(1 - dupe_inside_circle_obs/total_inside_circle_obs)
##find edible_species with most specimens inside your radius, sort alphabetically
edible_species = new_df[(new_df['inside_circle'] == True) & (new_df['edible'] == True)]['species'].value_counts().sort_index()
##idxmax() gets first occurence, and since previous filtration is sorted by index (species name), we're good
best_species = edible_species.idxmax()
return round(safety_score*100,2), best_species
### END SOLUTION
### Demo function call
score, species = find_safety_score(mushroom_severity_df, (-93.89165973, 42.936265), 50, utils.compute_euclid)
print('Safety score: {}%'.format(score))
print('Best species: {}'.format(species))
The demo should display this printed output.
Safety score: 86.17%
Best species: Flammulina velutipes
The cell below will test your solution for find_safety_score (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 10
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time
tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
executor = dill.load(f)
@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
return executor(**kwargs)
# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=find_safety_score,
ex_name='find_safety_score',
key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=',
n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_safety_score did not pass the test.'
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')