`Spring 2026 FX`: `Professor Vuduc Forages for Mushrooms`¶

Version 1.0.3

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of filtering and analysis in Pandas and SQL, Math to Code, and clustering. It has 11 exercises numbered 0 to 10. There are 20 available points. However to earn 100% the threshold is 15 points. (Therefore once you hit 15 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 1 point(s)
Exercise 1 - : 2 point(s)
Exercise 2 - : 0 point(s)
Exercise 3 - : 3 point(s)
Exercise 4 - : 1 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 1 point(s)
Exercise 7 - : 3 point(s)
Exercise 8 - : 2 point(s)
Exercise 9 - : 2 point(s)
Exercise 10 - : 3 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

The Problem: Finding a Safe Place to Collect Edible Mushrooms¶

Background: iNaturalist is an online community that helps users identify plants and animals. Users upload photos of an organism along with their location coordinates, and the app uses machine learning to suggest some possible species matches. After reviewing the information, the user selects the best match. Other users on the platform can weigh in, which increases the confidence of the identification. For casual foragers, this can help determine whether a plant is safe to eat. For scientists, high-quality observations with a strong species consensus can become 'research-grade' and be used to aid population genetics efforts.

We're interested in the first use-case, specifically focused around mushrooms. Mushroom foraging in particular can be dangerous, as several prized edible varieties look similar to poisonous ones.

Your overall task: Your goal is to use iNaturalist data to find some safe areas in the continental United States to search for edible mushrooms. You will do this in two main steps:

First, you will clean, filter, and classify iNaturalist observations using data from Wikipedia.
Second, you will perform K-Means clustering to find population centers and determine the safety of foraging locations.

The datasets: There are three sources of data which you will use to solve the following exercises.

Information about mushroom observations in the continental United States, sourced from iNaturalist, represented as a SQLite database. (Note: This tool does require a login.)
A Data Frame containing information about poisonous mushrooms, scraped from the Wikipedia page on poisonous and deadly mushroom species, respectively.
A Data Frame containing information about edible mushrooms, scraped from Wikipedia.

SQLite's syntax documentation can be found here. You may find other resources online are also useful for solving the SQL problems, but not all SQL dialects work the same way. Make sure your solution works with SQLite!

### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re 
import pandas as pd
import numpy as np
from datetime import datetime, timedelta, date
import sqlite3 
from pprint import pprint

utils.add_from_file('compute_cartesian', utils)
utils.add_from_file('compute_euclid', utils)

cse6040_devkit.utils
cse6040_devkit.utils

### Run Me!!!
poisonous_df = utils.load_object_from_publicdata('poisonous_df')
edible_df = utils.load_object_from_publicdata('edible_df')

Exercise 0: (1 points)¶

explore_data__FREE

Example: we have defined explore_data__FREE as follows:

This is a free exercise! Please run the test cell below to collect your FREE point! We encourage you to review the samples of observations, poisonous_df and edible_df in the cell below:

observations A SQLite containing species, location, and date information for mushroom observations from iNaturalist.
poisonous_df A DataFrame containing information about known poisonous mushrooms, from Wikipedia.
edible_df A DataFrame containing information about known edible mushrooms, from Wikipedia.

### Solution - Exercise 0  
def explore_data__FREE(conn: sqlite3.Connection, poisonous_df: pd.DataFrame, edible_df: pd.DataFrame, headsize:int=10) -> tuple:

    observations_preview = pd.read_sql('select * from observations limit ?', conn, params=[headsize])
    poisonous_preview = poisonous_df.head(n=headsize)
    edible_preview = edible_df.head(n=headsize)
    return (observations_preview, poisonous_preview, edible_preview)

### Demo function call
with sqlite3.connect('resource/asnlib/publicdata/observations.sqlite_db') as conn:
    (
        observations_preview, 
        poisonous_preview, 
        edible_preview
    ) = explore_data__FREE(conn, poisonous_df, edible_df)

display(observations_preview.head(5))
display(poisonous_preview.head(5))
display(edible_preview.head(5))

The test cell below will always pass. Please submit to collect your free points for explore_data__FREE (exercise 0).

### Test Cell - Exercise 0  


print('Passed! Please submit.')

Passed! Please submit.

### Run Me!!!
demo_result_cleanse_observations_TRUE = utils.load_object_from_publicdata('demo_result_cleanse_observations_TRUE')

Exercise 1: (2 points)¶

cleanse_observations

Your task: define cleanse_observations_query as follows:

Define cleanse_observations as follows:

Input: None

Return: A SQLite3 query (Python string)

Requirements:

Select the columns: [gbifID, class, order, family, genus, species, state, latitude, longitude, coordinate_uncertainty, day, month, year]
Ensure coordinate_uncertainty is less than or equal to 25.
Remove observations where stateProvince is 'Hawaii' or 'Alaska'.
Remove sparsely observed species (since later analysis requires a minimum sample size):
- Only include species that appear in observations 25 or more times
- Hint: Window functions may be helpful for this
Remove any rows where any of the following columns contain null values: [class, order, family, genus, species, state].
Rename the following columns:
- stateProvince to state
- coordinateUncertaintyInMeters to coordinate_uncertainty
- decimalLatitude to latitude
- decimalLongitude to longitude
Sort observations by gbifID in ascending order

Hint: Remember that certain words are reserved in SQL. This may be helpful when determining how to reference a column name that is also a reserved SQL keyword.

### Solution - Exercise 1  
cleanse_observations_query = '''YOUR QUERY HERE'''
###
### YOUR CODE HERE
###
### BEGIN SOLUTION
cleanse_observations_query = '''

with filtered_obs as (select gbifID, class, `order`, family, genus, species, stateProvince as state, decimalLatitude as latitude, decimalLongitude as longitude, 

coordinateUncertaintyInMeters as coordinate_uncertainty, day, month, year from observations
                where coordinateUncertaintyInMeters <=25
                and stateProvince <> "Hawaii"
                and stateProvince <> "Alaska"
                and stateProvince is not null
                and class is not null
                and `order` is not null
                and family is not null
                and genus is not null
                and species is not null)

select b.*
FROM filtered_obs AS b
inner join (
    select species
    from filtered_obs
    group by species
    having count(*) >= 25
) as a on a.species = b.species

order by gbifID

'''
### END SOLUTION
### Demo function call
with sqlite3.connect('resource/asnlib/publicdata/observations_demo.sqlite_db') as conn:
    demo_cleanse_observations_result = pd.read_sql(cleanse_observations_query,conn)
demo_cleanse_observations_result

The demo should display this output.

	gbifID	class	order	family	genus	species	state	latitude	longitude	coordinate_uncertainty	day	month	year
0	4015220986	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	40.042309	-124.071195	15.0	2	1	2023
1	4021956935	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	40.042178	-124.067756	15.0	20	1	2023
2	4022214055	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	38.427368	-123.086479	4.0	19	1	2023
3	4022261682	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.271950	-122.293687	4.0	19	1	2023
4	4029067610	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	36.599720	-121.922417	5.0	25	1	2023
5	4034748083	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	38.807517	-123.042917	4.0	11	2	2023
6	4054968728	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.031575	-122.062600	10.0	1	3	2023
7	4063063690	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.506000	-122.499947	4.0	7	3	2023
8	4075608178	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	38.536480	-123.006722	4.0	20	3	2023
9	4076362192	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.755494	-122.127525	4.0	12	3	2023
10	4096405558	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.681580	-122.415697	19.0	27	1	2023
11	4111684497	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Oregon	45.072656	-123.042174	9.0	23	4	2023
12	4121159705	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.751163	-122.100044	4.0	20	5	2023
13	4121274046	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.807613	-122.167092	5.0	14	5	2023
14	4420752950	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	New York	40.891514	-73.896952	8.0	24	9	2023
15	4453906562	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	40.030081	-124.061796	8.0	15	11	2023
16	4458533792	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	38.416645	-122.954750	24.0	7	3	2023
17	4500840555	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	37.965780	-122.537045	4.0	15	12	2023
18	4507625044	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Oregon	45.471585	-122.774923	10.0	28	12	2023
19	4507936641	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Oregon	44.991702	-122.790710	6.0	30	12	2023
20	4508007722	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	California	38.892892	-123.621430	4.0	30	12	2023
21	4516320520	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Oregon	45.660188	-122.841837	4.0	11	12	2023
22	4606881312	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Georgia	33.574117	-84.068314	4.0	27	5	2023
23	4855265811	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Florida	29.623963	-82.320450	6.0	17	8	2023
24	4863374167	Agaricomycetes	Cantharellales	Hydnaceae	Clavulina	Clavulina rugosa	Washington	47.522389	-122.135715	15.0	10	12	2023

The cell below will test your solution for cleanse_observations (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sql_executor(cleanse_observations_query),
              ex_name='cleanse_observations',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to cleanse_observations did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.82 seconds
memory after test: 3.41 MB
memory peak during test: 10.11 MB
Passed! Please submit.

### Run Me!!!
demo_result_is_it_poisonous_TRUE = utils.load_object_from_publicdata('demo_result_is_it_poisonous_TRUE')

Exercise 2: (0 points)¶

is_it_poisonous

Example: we have defined is_it_poisonous as follows:

THIS IS AN EXAMPLE

There is nothing to solve here!

Input:

mushroom_df: A DataFrame with information about mushrooms.
- It is expected to have a species column, which contains the scientific name of the mushroom species.
poisonous_df: A DataFrame with information about poisonous mushrooms.
- It is expected to have a Scientific name column, which contains the scientific name of the poisonous mushroom species.
edible_df: A DataFrame with information about edible mushrooms.
- It is expected to have a Scientific name column, which contains the scientific name of the edible mushroom species.

Return: mushroom_poison_df: A copy of the DataFrame with two new columns:

poisonous: Boolean with a value of 1 if the value of species is found in poisonous_df, and 0 if it is not
edible: Boolean with a value of 1 if the value of species is found in edible_df, and 0 if it is not

### Solution - Exercise 2  
def is_it_poisonous(mushroom_df: pd.DataFrame, poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> pd.DataFrame:
    
    mushroom_poison_df = mushroom_df.copy()
    poison_mushroom_list = list(poisonous_df['Scientific name'].unique())
    edibles = list(edible_df['Scientific name'].unique())

    mushroom_poison_df['poisonous'] = mushroom_poison_df['species'].isin(poison_mushroom_list).astype(int)
    mushroom_poison_df['edible'] = mushroom_poison_df['species'].isin(edibles).astype(int)
    
    return mushroom_poison_df

### Demo function call
poison_demo = utils.load_object_from_publicdata('is_it_poisonous_demo_input.dill')
demo_result_poison_df = is_it_poisonous(poison_demo, poisonous_df, edible_df)
demo_result_poison_df

The demo should display this output.

	gbifID	class	order	family	genus	species	state	latitude	longitude	coordinate_uncertainty	day	month	year
0	4011490202	Agaricomycetes	Phallales	Phallaceae	Phallus	Phallus impudicus	California	40.87	-124.16	4.00	1	1	2023
1	4011490221	Tremellomycetes	Tremellales	Naemateliaceae	Naematelia	Naematelia aurantia	California	35.13	-120.59	3.00	1	1	2023
2	4011547279	Agaricomycetes	Agaricales	Omphalotaceae	Gymnopus	Gymnopus dryophilus	California	34.06	-118.50	16.00	1	1	2023
3	4011588270	Agaricomycetes	Agaricales	Mycenaceae	Panellus	Panellus stipticus	Ohio	39.43	-82.55	4.00	1	1	2023

The test cell below will always pass. Please submit to collect your free points for is_it_poisonous (exercise 2).

### Test Cell - Exercise 2  


print('Passed! Please submit.')

Passed! Please submit.

### Run Me!!!
demo_result_DEBUG_find_similar_TRUE = utils.load_object_from_publicdata('demo_result_DEBUG_find_similar_TRUE')

Exercise 3: (3 points)¶

DEBUG_find_similar

Your task: define DEBUG_find_similar as follows:

Create a Python set which contains the scientific names of all edible mushrooms that have poisonous lookalikes.

Input:

poisonous_df: A dataframe containing information about poisonous mushrooms. It is expected to have these columns:
- Similar edible species: A string containing the scientific names of edible mushrooms that look similar to the poisonous mushroom.
edible_df: A dataframe containing information about known edible mushrooms. It is expected to have these columns:
- Scientific name: The scientific name of the edible mushroom.

Return: updated_sim: A set of edible mushrooms that have poisonous lookalikes.

Requirements:

Extract the distinct scientific names from the Similar edible species column of poisonous_df.
- A scientific name meets the following criteria:
  - It consists of two words: a capitalized genus and a lowercase species
    - For example "Amanita muscaria" has a capitalized genus "Amanita" and a lowercase species "muscaria", so it meets the criteria.
  - The scientific name only contains letters and the single space separating the genus and species.
- A single entry in the similar edible species column may contain multiple scientific names. You should extract all two-word strings meeting the above criteria.
Some of the extracted scientific names may contain "species" or "spp" instead of a specific species name.
- For example, "Amanita species" or "Amanita spp".
- These indicate that all edible mushrooms with the same genus (in this case, "Amanita") look similar to the poisonous mushroom.
The returned set should contain the scientific names of all edible mushrooms that have poisonous lookalikes. This includes:
- All of the scientific names extracted from the Similar edible species column of poisonous_df that do not contain "species" or "spp"
- All of the scientific names from edible_df that have the same genus as any of the extracted scientific names that do contain "species" or "spp".

### Solution - Exercise 3  
def DEBUG_find_similar(poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> set:


    import re
    import itertools

    edibles = list(edible_df['Scientific name'])
    s = poisonous_df['Similar edible species'].apply(lambda x: re.findall(r'^[a-z]+\s[a-z]+$',str(x)))
    similar_edibles= list(itertools.chain(s.values()))
    display(similar_edibles)


    updated_sim = {}
    for i in edibles:
        if 'species' in i:

            genus = i.split()[1]
            updated_sim.add([name for name in edibles if 'genus' in name])
        else:
            updated_sim.append(s)
        return updated_sim

    
##correct solution
def DEBUG_find_similar(poisonous_df: pd.DataFrame, edible_df: pd.DataFrame) -> set:
    import re
    import itertools


    edibles = list(edible_df['Scientific name'])
    
    ##mistake in regex
    similar_edibles = poisonous_df['Similar edible species'].apply(lambda x: re.findall(r'[A-Z][a-z]+\s[a-z]+',str(x)))
    
    ##values() vs values
    similar_edibles= set(itertools.chain.from_iterable(similar_edibles.values))


    updated_sim = set()
    for species in similar_edibles:
        ##left off spp ind debug version
        if 'species' in species or 'spp' in species:

            genus = species.split()[0]
            
            ##used add here instead of update
            updated_sim.update([name for name in edibles if genus in name])
        else:
            ##used append here instead of add
            updated_sim.add(species)
    return updated_sim

### Demo function call
similar_demo = utils.load_object_from_publicdata('similar_demo.dill')
demo_result_similar = DEBUG_find_similar(similar_demo, edible_df)
print(demo_result_similar)

{'Agaricus silvaticus', 'Infundibulicybe gibba', 'Infundibulicybe geotropa', 'Agaricus bisporus', 'Agaricus campestris', 'Cantharellus cibarius', 'Agaricus arvensis'}

The demo should display this printed output.

{'Cantharellus cibarius', 'Infundibulicybe gibba', 'Agaricus campestris', 'Infundibulicybe geotropa', 'Agaricus arvensis', 'Agaricus bisporus', 'Agaricus silvaticus'}

The cell below will test your solution for DEBUG_find_similar (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 3  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_find_similar,
              ex_name='DEBUG_find_similar',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_find_similar did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.31 seconds
memory after test: 0.10 MB
memory peak during test: 1.69 MB
Passed! Please submit.

### Run Me!!!
mushroom_poison_df = utils.load_object_from_publicdata('mushroom_poison_df')
edible_dupes = utils.load_object_from_publicdata('edible_dupes')
poisonous_df = utils.load_object_from_publicdata('poisonous_df')
demo_result_determine_severity_TRUE = utils.load_object_from_publicdata('demo_result_determine_severity_TRUE')

Exercise 4: (1 points)¶

determine_severity

Your task: define determine_severity as follows:

Add severe and dupe columns to mushroom_poison_df based on the results of Exercise 3.

Input:

edible_dupes: Set of the scientific names of edible mushrooms with known poisonous lookalikes. (The result of Exercise 3)
poisonous_df: A DataFrame containing information about poisonous mushrooms.
- All species in the Scientific name column of poisonous_df are poisonous.
- The Severity column contains information about the severity of the toxin, which is either 'deadly' or 'poisonous'.
mushroom_poison_df: Dataframe containing the cleansed observations.
- The poisonous column contains 1 if the mushroom is poisonous and 0 otherwise.
- The edible column contains 1 if the mushroom is edible and 0 otherwise.
- The species column contains the scientific name of the mushroom.

Return: mushroom_severity_df: A copy of mushroom_poison_df with these columns added:

severe (int): 1 if the mushroom is poisonous (according to mushroom_poison_df) and has 'deadly' severity (according to poisonous_df), 0 otherwise.
dupe (int): 1 if the mushroom is edible (according to mushroom_poison_df) and has a known poisonous lookalike (according to edible_dupes), 0 otherwise.

### Solution - Exercise 4  
def determine_severity(edible_dupes: set, poisonous_df: pd.DataFrame, mushroom_poison_df: pd.DataFrame) -> pd.DataFrame:

    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    def func(row):
        if row['poisonous']:
            severe =  int(poisonous_df.loc[poisonous_df['Scientific name'] == row['species'], 'Severity'].item() == 'deadly')     
        else:
            severe = 0
        if row['edible']:
            dupe = int(row['species'] in edible_dupes)
        else:
            dupe = 0 
        return dupe, severe


    # # ##label mushroom severity
    mushroom_severity_df= mushroom_poison_df.copy()

    mushroom_severity_df[['dupe', 'severe']] = mushroom_severity_df.apply(func, axis = 1, result_type='expand')

    return mushroom_severity_df
    ### END SOLUTION
### Demo function call
severe_demo = utils.load_object_from_publicdata('demo_mushroom_poison_df')
demo_result_severe = determine_severity(edible_dupes, poisonous_df, severe_demo)
demo_result_severe

The demo should display this output.

	gbifID	class	order	family	genus	species	state	latitude	longitude	coordinate_uncertainty	day	month	year	poisonous	edible	dupe	severe
0	4465656595	Agaricomycetes	Agaricales	Hymenogastraceae	Galerina	Galerina marginata	Ohio	39.08	-84.42	4.00	5	12	2023	1	0	0	1
1	4438740626	Agaricomycetes	Agaricales	Agaricaceae	Coprinus	Coprinus comatus	California	33.73	-117.94	11.00	5	11	2023	0	1	1	0
2	4901009967	Agaricomycetes	Gomphales	Gomphaceae	Turbinellus	Turbinellus floccosus	Tennessee	35.70	-83.53	4.00	11	9	2023	1	0	0	0
3	4507720568	Agaricomycetes	Agaricales	Physalacriaceae	Flammulina	Flammulina velutipes	Ohio	39.81	-83.89	3.00	30	12	2023	0	1	0	0
4	4600035867	Agaricomycetes	Russulales	Stereaceae	Xylobolus	Xylobolus frustulatus	Pennsylvania	41.24	-78.32	3.00	6	3	2023	0	0	0	0
5	4018160871	Agaricomycetes	Agaricales	Bolbitiaceae	Bolbitius	Bolbitius titubans	California	36.62	-121.94	9.00	11	1	2023	0	0	0	0
6	4436384325	Agaricomycetes	Agaricales	Lycoperdaceae	Apioperdon	Apioperdon pyriforme	New York	42.53	-76.31	13.00	20	9	2023	0	0	0	0

The cell below will test your solution for determine_severity (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 4  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=determine_severity,
              ex_name='determine_severity',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to determine_severity did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 2.42 seconds
memory after test: 0.35 MB
memory peak during test: 5.81 MB
Passed! Please submit.

### Run Me!!!
mushroom_severity_df = utils.load_object_from_publicdata('mushroom_severity_df')
demo_result_DEBUG_peak_months_TRUE = utils.load_object_from_publicdata('demo_result_DEBUG_peak_months_TRUE')

Exercise 5: (2 points)¶

DEBUG_peak_months

Your task: define DEBUG_peak_months as follows:

Input:

mushroom_severity_df: DataFrame with information about mushroom observations and their edibility.
- The month column contains the month of the observation (1-12).
- The genus column contains the genus of the mushroom.
- The species column contains the species of the mushroom.
genus: A string containing the desired genus we want to find the peak months for

Return: monthly_df: A pivoted DataFrame with months as the indices and species names as the columns

Requirements:

Filter the dataframe to only include observations of the specified genus.
Count the number of observations of each species in each month.
Pivot the counted dataframe to have the following format:
- The index is month
- The columns are the unique values of species
- The values are the counts
  - Note: values should be type int64
Make sure that all the months (1-12) are included in the index of the returned dataframe.
- If a month is not present in the data, add it to the index with counts of 0 for all species.

### Solution - Exercise 5  
def DEBUG_peak_months(mushroom_severity_df: pd.DataFrame, genus:str) ->  pd.DataFrame:
    mushroom_severity_df['is_genus'] = mushroom_severity_df['genus'] == genus
    df = mushroom_severity_df[mushroom_severity_df['is_genus']]

    df =  df[['month', 'genus', 'species']].groupby(by = ['month', 'genus']).count().rename(columns={'species':'counts'})
    
    monthly_df = df.pivot_table(index = 'species', columns = 'month', values = 'counts').reset_index()

    return monthly_df

## correct solution
def DEBUG_peak_months(mushroom_severity_df: pd.DataFrame, genus:str) ->  pd.DataFrame:
    df = mushroom_severity_df.copy()
    df = df[df['genus'] == genus]
    
    ##mispelling
    df =  df[['month', 'genus', 'species']].groupby(by = ['month', 'species'], 
                                as_index = False).count().rename(columns={'genus':'count'})
    
    #bad dtype
    monthly_df = df.pivot_table(index = 'month', columns = 'species', values = 'count', fill_value = 0)

    
    ## need this segment
    for i in range(1,13):

        if i not in monthly_df.index:
            
            ##doing this as `monthly_df.loc[i,:] = 0` casts everything to float
            monthly_df.loc[i] = 0
        
    ##didn't sort in debug solution
    return monthly_df.sort_index()

### Demo function call
demo_peak_months = DEBUG_peak_months(mushroom_severity_df, 'Mycena')
demo_peak_months

The demo should display this output.

species	Mycena acicula	Mycena crocea	Mycena epipterygia	Mycena galericulata	Mycena haematopus	Mycena inclinata	Mycena leaiana	Mycena leptocephala	Mycena pura	Mycena purpureofusca
month
1	16	0	2	5	45	2	0	5	7	11
2	3	0	0	0	8	0	0	0	12	0
3	12	0	0	0	8	1	0	3	3	0
4	8	1	0	1	11	0	0	1	3	0
5	12	0	0	1	11	4	58	3	2	0
6	7	0	0	1	5	1	75	0	3	3
7	1	0	0	0	9	0	55	0	2	0
8	1	1	1	1	11	0	136	0	5	0
9	0	31	1	6	50	12	131	1	6	4
10	6	37	30	37	41	23	34	14	30	12
11	3	7	17	6	12	3	2	1	5	6
12	7	1	11	15	11	3	0	15	7	9

The cell below will test your solution for DEBUG_peak_months (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 5  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(DEBUG_peak_months),
              ex_name='DEBUG_peak_months',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_peak_months did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 3.36 seconds
memory after test: 0.67 MB
memory peak during test: 41.26 MB
Passed! Please submit.

### Run Me!!!
demo_result_top_states_TRUE = utils.load_object_from_publicdata('demo_result_top_states_TRUE')

Exercise 6: (1 points)¶

top_states

Your task: define top_states as follows:

Find the top states where a species of mushroom can be found.

Input:

mushroom_severity_df: A Dataframe with observations on mushrooms and their locations.
- The state column contains the state where the mushroom was observed.
- The species column contains the name of the species of mushroom observed.
species: The species name to compute top states for.

Return: top_states: a DataFrame containing columns state and count

- `state` - all states where the specified species was observed
- `count` - the number of observations of the specified species in that state

Requirements:

Results should be sorted by count in descending order, and then by state in ascending order.

### Solution - Exercise 6  
def top_states(mushroom_severity_df: pd.DataFrame, species: str) -> pd.DataFrame:
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    df = mushroom_severity_df.copy()
    df = df[df['species'] == species]
    return df.loc[:,['state','year']].groupby('state', as_index = False).count().sort_values(
        by = ['year', 'state'], ascending = [False, True]).reset_index(drop = True).rename(columns = {'year':'count'})
    
    ### END SOLUTION
### Demo function call
demo_top_states = top_states(mushroom_severity_df, 'Stropharia ambigua')
demo_top_states

The demo should display this output.

	state	count
0	Washington	131
1	California	127
2	Oregon	48
3	Arizona	1

The cell below will test your solution for top_states (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 6  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(top_states),
              ex_name='top_states',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to top_states did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 1.41 seconds
memory after test: 0.26 MB
memory peak during test: 15.09 MB
Passed! Please submit.

### Run Me!!!
demo_result_find_population_centers_TRUE = utils.load_object_from_publicdata('demo_result_find_population_centers_TRUE')

Exercise 7: (3 points)¶

find_population_centers

Your task: define find_population_centers as follows:

Use sklearn's KMeans to find the optimal number of clusters and return their population centers for a specified species.

Input:

mushroom_severity_df: DataFrame containing mushroom observation data
- The species column contains the species of the mushroom species observed.
- The longitude and latitude columns contain the longitude and latitude of the observation, respectively.
species: A string indicating the species of mushroom for which we want to find population centers
random_state: An integer that should be passed to KMeans to determine the centroid initialization
threshold: The maximum percentage decrease in inertia allowed to continue increasing the number of clusters in KMeans.
- In other words if the percent decrease in inertia between the prior model (with k-1 centers) and the new model (with k centers) is less than the threshold, then we should stop and return the prior centers.

Return:

centers: a numpy array of population centers, representing longitude and latitude, for the specified species. The number of centers should be determined by KMeans using the threshold to determine when to stop increasing the number of clusters.

Requirements:

Filter the mushroom_severity_df to only include observations of the specified species and only the longitude and latitude columns.
Find the optimal population centers using this algorithm:
1. Fit a k-means model with k centers, using the specified random_state.
  - Start with k=1.
  - Extract the inertia from the model.
  - Extract the centers from the model.
2. Increment k by 1.
  - Save the inertia and centers as the "prior" inertia and centers.
3. Fit a new k-means model with the new k value.
  - Extract the new inertia.
4. Calculate the percentage decrease in inertia from the prior model to the new model using the formula below.
5. Compare the percentage decrease in inertia to the specified threshold.
  - If the percentage decrease in inertia is less than the specified threshold, break the loop and return the prior centers.
  - Otherwise, repeat steps 2-4 until the percentage decrease in inertia is less than the threshold or until you reach a reasonable maximum number of clusters (for example, k_max = 20).

Formula: $$\text{percent decrease} = 100 \left | \frac{\text{prior inertia} - \text{new inertia}}{\text{prior inertia}}\right |$$

Note:

You are expected to pass only the n_clusters and random_state parameters to KMeans when fitting the model.
Reference the kmeans documentation on how to extract the inertia and cluster centers from a fitted kmeans model.
Inertia is the sum of squared distances of samples to their closest cluster center.
- Logically, this metric will always decrease as the number of centers increases.
Coordinates are in the format (longitude, latitude) - which is different from what you might be used to seeing.

### Solution - Exercise 7  
def find_population_centers(mushroom_severity_df: pd.DataFrame, species: str, random_state: int, threshold: int) -> np.ndarray:
    from sklearn.cluster import KMeans
    
    ### Filter mushroom_severity_df
    _df = mushroom_severity_df
    df = _df[_df['species'] == species][['longitude','latitude']].copy()

    # 1    
    k = 1
    kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
    inertia = ... # replace the ... with inertia from kmeans model
    centers = ... # replace the ... with centers from kmeans model
    
    # 2-5 loop
    k_max = 20
    while k < k_max:
        # 2 (increment k) - still need to deal with the prior inertia and centers, so do this before fitting the new model
        k += 1

        ###
        ### YOUR CODE HERE
        ###
        ### BEGIN SOLUTION
        # 2 - inertia and centers
        inertia = kmeans.inertia_         # this is from the prior model with k-1 centers
        centers = kmeans.cluster_centers_ # this is from the prior model with k-1 centers

        # 3
        kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
        new_inertia = kmeans.inertia_


        # 4
        percent_decrease = 100 * ((inertia - new_inertia) / inertia)

        ### END SOLUTION

        # 5
        if percent_decrease < threshold:
            break
        
    return centers

### Demo function call
demo_pop_centers = find_population_centers(mushroom_severity_df, 'Amanita parcivolvata', 1450, 15)
print('Centers: ', '\n', demo_pop_centers)
demo_pop_centers

Centers:  
 [[-79.43116612  36.01252527]
 [-83.40338028  35.8974422 ]
 [-84.97993193  34.44744574]
 [-92.47715433  33.943829  ]
 [-80.83425363  40.05086921]
 [-87.44177927  37.00098373]
 [-77.24047738  38.777142  ]]

array([[-79.43116612,  36.01252527],
       [-83.40338028,  35.8974422 ],
       [-84.97993193,  34.44744574],
       [-92.47715433,  33.943829  ],
       [-80.83425363,  40.05086921],
       [-87.44177927,  37.00098373],
       [-77.24047738,  38.777142  ]])

The demo should display this printed output.

Centers:  
 [[-79.43116612  36.01252527]
 [-83.40338028  35.8974422 ]
 [-84.97993193  34.44744574]
 [-92.47715433  33.943829  ]
 [-80.83425363  40.05086921]
 [-87.44177927  37.00098373]
 [-77.24047738  38.777142  ]]

The cell below will test your solution for find_population_centers (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 7  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=plugins.sqlite_blocker(find_population_centers),
              ex_name='find_population_centers',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_population_centers did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 6.48 seconds
memory after test: 0.19 MB
memory peak during test: 3.82 MB
Passed! Please submit.

Geodetic Distance (for exercise 8)¶

We want to calculate the geodetic distance between points on the Earth's surface, given their longitude and latitude.

FCC's formula to calculate the distance in kilometers between two coordinate points.

$$D\approx \sqrt{(K_1\Delta\phi)^2 + (K_2\Delta\lambda)^2}$$

Using:

$$K_1 = 111.13209 - 0.56605\cos{(2\phi_m)} + 0.00120\cos{(4\phi_m)}$$$$K_2 = 111.41513\cos{(\phi_m)} - 0.09455\cos{(3\phi_m)} + 0.00012\cos{(5\phi_m)}$$

Where:

$D$ = distance in km
$\Delta\phi$ is the difference in latitude and is calculated as $\Delta\phi = \phi_2 - \phi_1$
$\Delta\lambda$ is the difference in longitude and is calculated as $\Delta\lambda = \lambda_2 -\lambda_1$
$\phi_m$ is the mid-latitude as calculated as $\frac{\phi_1 + \phi_2}{2}$
$\Delta\phi$ and $\Delta\lambda$ are in degrees
$\phi_m$ is in units compatible with the method for determining $\cos{(\phi_m)}$
- i.e. if your cosine function takes radians, make sure you're providing $\phi_m$ in radians.

Exercise 8: (2 points)¶

geodetic_distance

Your task: define geodetic_distance as follows:

Input:

coord: a list containing the coordinates of your current location, in [long, lat] order
- For example, 38° N, 122° W would be represented as [-122, 38]
obs: a list containing the coordinates of a nearby observation, in [long, lat] order

Return: a float value representing the distance in kilometers between coord and obs. This is $D$ from the formula above.

Intermediate calculations for the demo, rounded to 5 decimal places:

$\left |\Delta \lambda\right |=0.75$
$\left |\Delta \phi\right |=0.5$
$\phi_m=0.67632$
$K_1=111.00849$
$K_2=86.93263$

Note:

Make sure to convert angles to the correct units for $\phi_m$. There are functions in both the math and numpy libraries to help with this.

### Solution - Exercise 8  
def geodetic_distance(coord:list, obs:list) -> float:
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    delta_phi = coord[1] - obs[1]
    delta_lambda = coord[0] - obs[0]
    phi_m = np.radians((coord[1] + obs[1])/2)
    
    K1 = 111.13209 -.56605*np.cos(2*phi_m)  + .00120*np.cos(4*phi_m)
    K2 = 111.41513*np.cos(phi_m) - .09455*np.cos(3*phi_m) + .00012*np.cos(5*phi_m)
                      
    D = np.sqrt((K1*delta_phi)**2 + (K2*delta_lambda)**2)
    return D
    ### END SOLUTION       
### Demo function call
coord = [-122.25,38.5]
obs = [-123,39]
result = geodetic_distance(coord,obs)
print(result)

85.62530311141984

The demo should display this printed output.

85.62530311141984

The cell below will test your solution for geodetic_distance (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 8  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=geodetic_distance,
              ex_name='geodetic_distance',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to geodetic_distance did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.08 seconds
memory after test: 0.03 MB
memory peak during test: 1.43 MB
Passed! Please submit.

### Run Me!!!
demo_result_assign_labels_TRUE = utils.load_object_from_publicdata('demo_result_assign_labels_TRUE')

On assigning cluster labels¶

Intuition:

You have some coordinates. $x_0, x_1, ..., x_{m-1}$
You also have some cluster centers. $c_0, c_1, ..., c_{n-1}$
Compute the distance from each coordiante to each cluster center.
- $d_{i, j}$ is the distance from coordinate $x_i$ to cluster center $c_j$.
  $\begin{bmatrix} d_{0,0} & d_{0,1} & \cdots & d_{0,n-1} \\ d_{1,0} & d_{1,1} & \cdots & d_{1,n-1} \\ \vdots & \vdots & \ddots & \vdots \\ d_{m-1,0} & d_{m-1,1} & \cdots & d_{m-1,n-1} \end{bmatrix}$
Assign the label of the closest cluster center to each coordinate. These are the labels: $l = [l_0, l_1, ..., l_{m-1}]$
- $l_0$ is the index of the minimum value in the first row of the distance matrix, which corresponds to the closest cluster center to $x_0$.
- $l_1$ is the index of the minimum value in the second row of the distance matrix, which corresponds to the closest cluster center to $x_1$.
- And so on for all coordinates.

Exercise 9: (2 points)¶

assign_labels

Your task: define assign_labels as follows:

Given a numpy array of cluster centers and an array of observation coordinates, compute the distance from each observation coordinate to the cluster centers and assign the label of the closest cluster.

This exercise does not require you to have completed the prior exercises.

Input:

centers is a Numpy array of cluster centers found in Exercise 7, in [long, lat] order
coordinates is a Numpy array of coordinates for a certain species, in [long, lat] order
distance_function is a function meeting these criteria:
- Takes two arguments:
  - point: a 1-D Numpy array representing a single set of coordinates in [long, lat] order
  - other_points: a 2-D Numpy array, each row is a set of coordinates in [long, lat] order
- Returns a float array representing the distance in kilometers between point_1 and each of the rows in point_2.

Return: labels a Numpy 1-D array containing the cluster label for each observation.

Requirements:

Use the distance_function to compute the distance between each coordinate and all of the cluster centers.
- You can construct a 2-D array using apply_along_axis.
- The idea is to make each row correspond to an index of coordinates and each column represent the distance to a cluster center.
Find the minimum distance for each of the observations. The index position of this minimum is the cluster label we want to assign for that observation.
Return a 1-D array of the labels for every observation.

Note:

The distance_function is not the same as the geodetic_distance function you implemented in Exercise 8.
- The distance_function is designed to take in a single point and an array of points, and return an array of distances.
- The geodetic_distance function you implemented in Exercise 8 takes in two points and returns a single distance value.
The coordinates are in the format (longitude, latitude) - which is different from what you might be used to seeing.
You can use np.argmin to find the index of the minimum value along an axis in a Numpy array.

### Solution - Exercise 9  
def assign_labels(centers: np.ndarray, coordinates: np.ndarray, distance_function) -> np.ndarray:
    
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    distances_for_each_obs = np.apply_along_axis(distance_function, 1, centers, other_points=coordinates).T
    return np.argmin(distances_for_each_obs, axis=1)
    ### END SOLUTION
### Demo function call
centers, coords = utils.load_object_from_publicdata('labels_demo_input.dill')
demo_result_labels = assign_labels(centers, coords, utils.compute_euclid)
print(demo_result_labels)

[0 2 2 2 0 2 1 1 2 1]

The demo should display this printed output.

[0 2 2 2 0 2 1 1 2 1]

The cell below will test your solution for assign_labels (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 9  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=assign_labels,
              ex_name='assign_labels',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=20)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to assign_labels did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.09 seconds
memory after test: 0.03 MB
memory peak during test: 1.43 MB
Passed! Please submit.

Let's go hunting!¶

Professor Vuduc doesn't want to travel very far to find his mushrooms. Based on his current location and historic data, we want to assess:

How safe is it for him to hunt nearby?
What's the best edible species for him to forage?

We'll determine 1. by calculating a safety score based on all observations within a certain radius of Professor Vuduc's location:

$$\text{safety}\_\text{score} = 100\left(1-\frac{n_{pois}}{N}\right)\left(1-\frac{n_{severe}}{N}\right)\left(1-\frac{n_{dupe}}{N}\right)$$

where:

$N$ is the total number of observations within the radius
$n_{pois}$ is the number of poisonous observations within the radius
$n_{severe}$ is the number of severe poisonous observations within the radius
$n_{dupes}$ is the number of dupe observations within the radius

For 2., we will assume that the best species is the most common edible species within the radius.

Exercise 10: (3 points)¶

find_safety_score

Your task: define find_safety_score as follows:

Input:

mushroom_severity_df is the cleaned data frame containing iNaturalist observations
radius is a float containing the distance (in kilometers) Professor Vuduc will search within
coordinates is a tuple containing the (longitude, latitude) coordinates of Professor Vuduc's location.
distance_function is a function meeting these criteria:
- Takes two arguments:
  - point: a 1-D Numpy array representing a single set of coordinates in [long, lat] order
  - other_points: a 2-D Numpy array, each row is a set of coordinates in [long, lat] order
- Returns a float array representing the distance in kilometers between point and each of the rows in other_points.

Return:

safety_score: a float value representing a percentage, rounded to 2 decimal places
best_species: a string representing the best nearby edible species for Professor Vuduc to forage

Requirements:

Calculate safety_score:

Use the helper function distance_function to determine the distance between Professor Vuduc and each observation in mushroom_severity_df.
Filter the observations to those whose distance from Professor Vuduc is within (less than or equal to) the radius.
Calculate the safety score:
- Implement the formula above.
- Round the result to 2 decimal places.
Find the best species:
- Make another pass and filter the observations to only edible species within the radius
- Find the most common edible species within the radius. This is the best_species to return.
- If there are ties (i.e. multiple species with the same count), return the species that is first in alphabetical order.

Here's the formula again for reference:

$$\text{safety}\_\text{score} = \left(1-\frac{n_{pois}}{N}\right)\left(1-\frac{n_{severe}}{N}\right)\left(1-\frac{n_{dupe}}{N}\right)$$

### Solution - Exercise 10  
def find_safety_score(mushroom_severity_df: pd.DataFrame, coordinates: tuple, radius: float, distance_function):
    ###
    ### YOUR CODE HERE
    ###
    ### BEGIN SOLUTION
    new_df = mushroom_severity_df.copy()
    new_df['distance'] = distance_function(coordinates,new_df[['longitude','latitude']].values)
    new_df['inside_circle'] = (new_df['distance'] <= radius)
    
    # count inside circle observations
    total_inside_circle_obs = len(new_df[new_df['inside_circle'] == True])
    
    # count instances of poisonous, severely poisonous, and lookalike mushrooms within circle
    poison_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['poisonous'] == 1)])
    severe_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['severe'] == 1)])
    dupe_inside_circle_obs = len(new_df[(new_df['inside_circle'] == True) & (new_df['dupe'] == 1)])
    
    ##calculate safety score
    safety_score = (1-severe_inside_circle_obs/total_inside_circle_obs)\
            *(1-poison_inside_circle_obs/total_inside_circle_obs)\
                 *(1 - dupe_inside_circle_obs/total_inside_circle_obs)
    
    ##find edible_species with most specimens inside your radius, sort alphabetically
    edible_species = new_df[(new_df['inside_circle'] == True) & (new_df['edible'] == True)]['species'].value_counts().sort_index()
    
    ##idxmax() gets first occurence, and since previous filtration is sorted by index (species name), we're good
    best_species = edible_species.idxmax()


    return round(safety_score*100,2), best_species
    ### END SOLUTION
### Demo function call
score, species = find_safety_score(mushroom_severity_df, (-93.89165973, 42.936265), 50, utils.compute_euclid)
print('Safety score: {}%'.format(score))
print('Best species: {}'.format(species))

Safety score: 86.17%
Best species: Flammulina velutipes

The demo should display this printed output.

Safety score: 86.17%
Best species: Flammulina velutipes

The cell below will test your solution for find_safety_score (exercise 10). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 10  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=find_safety_score,
              ex_name='find_safety_score',
              key=b'vGHMjQvW681anFgLS1TNrLRVPLOvdwd6j--2tZsJ0-M=', 
              n_iter=50)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to find_safety_score did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')

initial memory usage: 0.00 MB
Test duration: 0.89 seconds
memory after test: 0.27 MB
memory peak during test: 13.34 MB
Passed! Please submit.

	gbifID	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	...	verbatimScientificName	countryCode	stateProvince	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	day	month	year	institutionCode
0	4014887930	https://www.inaturalist.org/observations/14587...	Fungi	Basidiomycota	Agaricomycetes	Polyporales	Phanerochaetaceae	Bjerkandera	Bjerkandera adusta	None	...	Bjerkandera adusta	US	Indiana	39.248716	-86.169787	NaN	4	1	2023	iNaturalist
1	4022293832	https://www.inaturalist.org/observations/14616...	Fungi	Basidiomycota	Agaricomycetes	Agaricales	Strophariaceae	Leratiomyces	Leratiomyces ceres	None	...	Leratiomyces ceres	US	California	37.758255	-122.496116	NaN	8	1	2023	iNaturalist
2	4018247771	https://www.inaturalist.org/observations/14624...	Fungi	Basidiomycota	Agaricomycetes	Agaricales	Crepidotaceae	Crepidotus	Crepidotus calolepis	None	...	Crepidotus calolepis	US	California	37.387561	-122.265908	4.0	9	1	2023	iNaturalist
3	4442372627	https://www.inaturalist.org/observations/14935...	Fungi	Basidiomycota	Agaricomycetes	Polyporales	Polyporaceae	Ganoderma	Ganoderma applanatum	None	...	Ganoderma applanatum	US	Vermont	43.604858	-72.428442	27443.0	21	2	2023	iNaturalist
4	4516152845	https://www.inaturalist.org/observations/15099...	Fungi	Basidiomycota	Agaricomycetes	Russulales	Auriscalpiaceae	Auriscalpium	Auriscalpium vulgare	None	...	Auriscalpium vulgare	US	California	37.002547	-122.066500	79.0	12	3	2023	iNaturalist

	Scientific name	Common name	Severity	Active agent	Similar edible species
0	Agaricus californicus	California Agaricus	poisonous	phenol and xanthodermin	Edible Agaricus species
1	Agaricus hondensis	Felt-ringed Agaricus	poisonous	phenol and xanthodermin	Edible Agaricus species
2	Agaricus menieri		poisonous	phenol and xanthodermin	Edible Agaricus species
3	Agaricus moelleri	Inky Mushroom	poisonous	phenol and xanthodermin	Edible Agaricus species
4	Agaricus phaeolepidotus		poisonous	phenol and xanthodermin	Edible Agaricus species

	Scientific name	Common name
0	Agaricus arvensis	Horse mushroom[13]
1	Agaricus bisporus	Button mushroom, common mushroom, cremini, por...
2	Agaricus campestris	Field mushroom
3	Agaricus silvaticus	Pinewood mushroom
4	Aleuria aurantia	Orange peel fungus

Spring 2026 FX: Professor Vuduc Forages for Mushrooms¶

The Problem: Finding a Safe Place to Collect Edible Mushrooms¶

Exercise 0: (1 points)¶

Exercise 1: (2 points)¶

Exercise 2: (0 points)¶

Exercise 3: (3 points)¶

Exercise 4: (1 points)¶

Exercise 5: (2 points)¶

Exercise 6: (1 points)¶

Exercise 7: (3 points)¶

Geodetic Distance (for exercise 8)¶

Exercise 8: (2 points)¶

On assigning cluster labels¶

Exercise 9: (2 points)¶

Let's go hunting!¶

Exercise 10: (3 points)¶

`Spring 2026 FX`: `Professor Vuduc Forages for Mushrooms`¶