PMT1: Debugging Examples

Version 0.0.1

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of Debugging examples around dictionaries, tuples, lists, strings, regex, and more.. It has 4 exercises numbered 0 to 3. There are 10 available points. However to earn 100% the threshold is 10 points. (Therefore once you hit 10 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 2 point(s)

  • Exercise 1 - : 2 point(s)

  • Exercise 2 - : 3 point(s)

  • Exercise 3 - : 3 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure
In [1]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
from cse6040_devkit.training_wheels import run_with_timeout, suppress_stdout
import tracemalloc
from time import time
import re 
import pandas as pd
import pprint 
In [2]:
user_data=utils.load_object_from_publicdata('user_data.dill')
pprint.pprint(user_data[:5])
[{'logins': [('2024-11-03', '192.168.1.10'),
             ('2025-01-13', '192.168.1.11'),
             ('2025-01-15', '192.168.1.11')],
  'profile': {'email': 'tom@example.com', 'id': 101}},
 {'logins': [('2023-09-22', '10.0.0.5'), ('2024-12-31', '10.0.0.6')],
  'profile': {'email': 'bob@university.edu', 'id': 102}},
 {'logins': [('2025-06-04', '172.16.0.3'), ('2025-06-05', '172.16.0.3')],
  'profile': {'email': 'charlie@example.net', 'id': 103}},
 {'logins': [('2025-02-10', '192.168.100.1'), ('2025-03-18', '192.168.100.2')],
  'profile': {'email': 'dave@college.edu', 'id': 104}},
 {'logins': [('2025-07-01', '8.8.8.8'),
             ('2025-07-02', '9.9.9.9'),
             ('2025-07-03', '10.10.10.10'),
             ('2025-07-04', '11.11.11.11'),
             ('2025-07-05', '119.911.119.119')],
  'profile': {'email': 'not-an-email', 'id': 105}}]

Exercise 0: (2 points)

DEBUG_get_valid_emails

Your task: define DEBUG_get_valid_emails as follows:

Write a function that extracts all valid email addresses for users who logged in at least once in 2025.

Inputs:

  • user_data (list): user profile data stored as a list of dictionaries. Each user record contains basic metadata and a list of login events.

Outputs:

  • emails (list): a list of distinct sorted email addresses

Requirements:

  1. Only include emails that:
    • Match the pattern username@domain.top_level_domain
    • The top_level_domain should be either .com or .edu
  2. Only include users who have at least one login in the year 2025.
  3. Order email addresses alphabetically
In [4]:
### Solution - Exercise 0  
def DEBUG_get_valid_emails(user_data: list) -> list:
    emails = []
    for user in user_data:
        email = user['profile']['email']
        for date, ip in user['logins']:
            if date.startswith("2025"):
                if re.fullmatch(r".+@.+\.(com|edu)", email):
                    emails.append(email)
    return sorted(list(set(emails)))

### Demo function call
pprint.pprint(DEBUG_get_valid_emails(user_data))
['dave@college.edu', 'tom@example.com']

The demo should display this printed output.

['dave@college.edu', 'tom@example.com']


The cell below will test your solution for DEBUG_get_valid_emails (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [5]:
### Test Cell - Exercise 0  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_get_valid_emails,
              ex_name='DEBUG_get_valid_emails',
              key=b'fGO-YUef8E12q6V2N03HygUXUf9H1sv5-rsIWKDI-ms=', 
              n_iter=51)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_get_valid_emails did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.17 seconds
memory after test: 3.22 MB
memory peak during test: 5.35 MB
Passed! Please submit.

Exercise 1: (2 points)

DEBUG_count_2025_logins

Your task: define DEBUG_count_2025_logins as follows:

Write a function that builds a summary report of user login activity for 2025. This report will be used by downstream systems to flag suspicious behavior and generate statistics.

Inputs:

  • user_data (list): user profile data stored as a list of dictionaries. Each user record contains basic metadata and a list of login events.

Outputs:

  • counts (dict): a dictionary of user's counts
    • profile_id -> number of logins

Requirements:

  1. Only include users who have at least one login in 2025.
  2. Count all logins in 2025 (not just one).
In [6]:
### Solution - Exercise 1  
def DEBUG_count_2025_logins(user_data: list) -> dict:
    counts = {}
    for user in user_data:
        user_id = user['profile']['id']
        count = 0
        for login in user['logins']:
            date, ip=login
            if date[0:4] == '2025':
                count += 1
        if count > 0:
            counts[user_id] = count
    return counts

### Demo function call
pprint.pprint(DEBUG_count_2025_logins(user_data))
{101: 2, 103: 2, 104: 2, 105: 5}

The demo should display this printed output.

{101: 2, 103: 2, 104: 2, 105: 5}


The cell below will test your solution for DEBUG_count_2025_logins (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [7]:
### Test Cell - Exercise 1  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_count_2025_logins,
              ex_name='DEBUG_count_2025_logins',
              key=b'fGO-YUef8E12q6V2N03HygUXUf9H1sv5-rsIWKDI-ms=', 
              n_iter=51)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_count_2025_logins did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.12 seconds
memory after test: 0.12 MB
memory peak during test: 2.23 MB
Passed! Please submit.

Exercise 2: (3 points)

DEBUG_summarize_2025_activity

Your task: define DEBUG_summarize_2025_activity as follows:

Write a function that builds a summary report of user login activity for 2025. This report will be used by downstream systems to flag suspicious behavior and generate statistics.

Inputs:

  • user_data (list): user profile data stored as a list of dictionaries. Each user record contains basic metadata and a list of login events.

Outputs:

  • user_activity (dict): a dictionary of user's activity
{
    user_id: {
        "email": <email string>,
        "count": <number of logins in 2025>,
        "ips": <sorted list of unique IP addresses used in 2025>
    },
    ...
}

Requirements:

  1. Only include users with at least one login in 2025.
  2. Count all logins in 2025.
  3. Include unique IP addresses only.
  4. The IP address list must be sorted lexicographically.
In [8]:
### Solution - Exercise 2  
def DEBUG_summarize_2025_activity(user_data: list) -> dict:
    summary = {}
    template = {"email": None, "count": 0, "ips": []}
    for user in user_data:
        record = template.copy()
        record['ips']=[]
        record["email"] = user["profile"]["email"]
        for login in user["logins"]:
            date, ip = login
            if date[:4] == '2025':
                record["count"] += 1
                record["ips"].append(ip)
        if record["count"] > 0:
            summary[user["profile"]["id"]] = record
    for user_id in summary:
        summary[user_id]["ips"] = sorted(list(set(summary[user_id]["ips"])))
    return summary

### Demo function call
pprint.pprint(DEBUG_summarize_2025_activity(user_data))
{101: {'count': 2, 'email': 'tom@example.com', 'ips': ['192.168.1.11']},
 103: {'count': 2, 'email': 'charlie@example.net', 'ips': ['172.16.0.3']},
 104: {'count': 2,
       'email': 'dave@college.edu',
       'ips': ['192.168.100.1', '192.168.100.2']},
 105: {'count': 5,
       'email': 'not-an-email',
       'ips': ['10.10.10.10',
               '11.11.11.11',
               '119.911.119.119',
               '8.8.8.8',
               '9.9.9.9']}}

The demo should display this printed output.

{101: {'count': 2, 'email': 'tom@example.com', 'ips': ['192.168.1.11']},
 103: {'count': 2, 'email': 'charlie@example.net', 'ips': ['172.16.0.3']},
 104: {'count': 2,
       'email': 'dave@college.edu',
       'ips': ['192.168.100.1', '192.168.100.2']},
 105: {'count': 5,
       'email': 'not-an-email',
       'ips': ['10.10.10.10',
               '11.11.11.11',
               '119.911.119.119',
               '8.8.8.8',
               '9.9.9.9']}}


The cell below will test your solution for DEBUG_summarize_2025_activity (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [9]:
### Test Cell - Exercise 2  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_summarize_2025_activity,
              ex_name='DEBUG_summarize_2025_activity',
              key=b'fGO-YUef8E12q6V2N03HygUXUf9H1sv5-rsIWKDI-ms=', 
              n_iter=51)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_summarize_2025_activity did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.14 seconds
memory after test: 0.12 MB
memory peak during test: 2.45 MB
Passed! Please submit.

Blocklists

Assume that our Information Security teams have already determined that some IP ranges contain nefarious traffic. Therefore, they deem these IP Addresses as NOT-safe traffic.

Looking at blocklists below, they have stored these in a list of tuples where the first element of the tuple is the start of the IP Address range and the second element of the tuple is the end of the IP Address range.

blocklists = [
    (167772160, 184549375),    # 10.0.0.0 – 10.255.255.255
    (3232235520, 3232301055),  # 192.168.0.0 – 192.168.255.255
]

IP Addresses Background

An IP address consists of 32 bits, written as 4 numbers separated by dots. Each of these numbers is called an octet because it represents 8 bits. Each octet ranges from 0 and 255 because $2^8=256$. The four octets together form a single 32-bit number.

To convert an IP to an integer:

  • Need to split an IP address (e.g. $A.B.C.D$) into the four octets (e.g. [A,B,C,D])
  • Assign powers of two because each octet represents 8 bits, its position determines which power of two it is multiplied by
Octet Bit position Multiplier
A bits 24–31 $2^{24}$
B bits 16–23 $2^{16}$
C bits 8–15 $2^{8}$
D bits 0–7 $2^{0}$
  • Compute the numeric value of the IP address. To do so, $A*2^{24}+B*2^{16}+C*2^8+D*2^0$

Therefore, to calculate integers based on the ones in blocklists:

  • $10*2^{24}+0*2^{16}+0*2^8+0*2^0=167772160$
  • $10*2^{24}+255*2^{16}+255*2^8+255*2^0=184549375$
  • $192*2^{24}+168*2^{16}+0*2^8+0*2^0=3232235520$
  • $192*2^{24}+168*2^{16}+255*2^8+255*2^0=3232301055$
In [10]:
blocklists=utils.load_object_from_publicdata('blocklists.dill')
pprint.pprint(blocklists[:5])
[(167772160, 184549375), (3232235520, 3232301055)]

Exercise 3: (3 points)

DEBUG_collect_safe_2025_ips

Your task: define DEBUG_collect_safe_2025_ips as follows:

Write a function to identify safe logins and IP addresses from 2025 that are not on known blocklists.

Inputs:

  • user_data (list): user profile data stored as a list of dictionaries. Each user record contains basic metadata and a list of login events.
  • blocklists (list): known blocklists stored as a list. Each blocklist defines IP ranges numerically.

Outputs:

  • safe_logins (dict): a dictionary of safe logins IP addresses
{
    user_id: [<sorted list of unique safe IP addresses used in 2025>],
    ...
}

Requirements:

  1. Only include users with at least one login in 2025.
  2. Only include IPs that are:
    • Valid IPv4 addresses
    • Not in any blocklist range (found in the blocklists variable)
  3. Ensure unique IPs per user
  4. Sort IPs numerically within octets
In [11]:
### Solution - Exercise 3  
def DEBUG_collect_safe_2025_ips(user_data: list, blocklists: list) -> dict:
    def _ip_to_num(ip):
        parts = ip.split(".")
        return sum(int(p) << (8*(3-i)) for i,p in enumerate(parts))

    def _is_valid_ip(ip):
        pattern = r"^(\d{1,3}\.){3}\d{1,3}$"
        if not re.fullmatch(pattern, ip):
            return False
        return all(0 <= int(octet) <= 255 for octet in ip.split("."))

    result = {}
    for user in user_data:
        user_id = user["profile"]["id"]
        safe_ips = []

        for login in user["logins"]:
            date, ip = login

            if _is_valid_ip(ip) and date[:4] == '2025':
                ipnum = _ip_to_num(ip)
                blocked = False
                for start, end in blocklists:
                    if ipnum >= start and ipnum <= end:
                        blocked = True
                        break
                if not blocked:
                    safe_ips.append(ip)

        sorted_ips = sorted(
                list(set(safe_ips)),
                key=_ip_to_num
            )
        
        if sorted_ips:
            result[user_id] = sorted_ips

    return result

### Demo function call
pprint.pprint(DEBUG_collect_safe_2025_ips(user_data,blocklists))
{103: ['172.16.0.3'], 105: ['8.8.8.8', '9.9.9.9', '11.11.11.11']}

The demo should display this printed output.

{103: ['172.16.0.3'], 105: ['8.8.8.8', '9.9.9.9', '11.11.11.11']}


The cell below will test your solution for DEBUG_collect_safe_2025_ips (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [12]:
### Test Cell - Exercise 3  


from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
from time import time

tracemalloc.start()
mem_start, peak_start = tracemalloc.get_traced_memory()
print(f"initial memory usage: {mem_start/1024/1024:.2f} MB")

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    executor = dill.load(f)

@run_with_timeout(error_threshold=200.0, warning_threshold=100.0)
@suppress_stdout
def execute_tests(**kwargs):
    return executor(**kwargs)


# Execute test
start_time = time()
passed, test_case_vars, e = execute_tests(func=DEBUG_collect_safe_2025_ips,
              ex_name='DEBUG_collect_safe_2025_ips',
              key=b'fGO-YUef8E12q6V2N03HygUXUf9H1sv5-rsIWKDI-ms=', 
              n_iter=51)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
duration = time() - start_time
print(f"Test duration: {duration:.2f} seconds")
current_memory, peak_memory = tracemalloc.get_traced_memory()
print(f"memory after test: {current_memory/1024/1024:.2f} MB")
print(f"memory peak during test: {peak_memory/1024/1024:.2f} MB")
tracemalloc.stop()
if e: raise e
assert passed, 'The solution to DEBUG_collect_safe_2025_ips did not pass the test.'

###
### AUTOGRADER TEST - DO NOT REMOVE
###

print('Passed! Please submit.')
initial memory usage: 0.00 MB
Test duration: 0.13 seconds
memory after test: 0.13 MB
memory peak during test: 2.28 MB
Passed! Please submit.