SQLite and Pandas: Exercise format examples

Version 0.0.2

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of Pandas and SQLite. It has 2 exercises numbered 0 to 1. There are 10 available points. However to earn 100% the threshold is 10 points. (Therefore once you hit 10 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

  • Exercise 0 - : 5 point(s)

  • Exercise 1 - : 5 point(s)

Final reminders:

  • Submit after every exercise
  • Review the generated grade report after you submit to see what errors were returned
  • Stay calm, skip problems as needed and take short breaks at your leisure
In [1]:
### Global imports
import dill
from cse6040_devkit import plugins, utils
import sqlite3 
import pandas as pd 
import numpy as np 

Introduction

This notebook is to familiarize students with the format for SQLite and Pandas based exercises. The implementation of these examples is very simple. Much simpler than you should expect on an exam. Throughout this notebook the "narrative" will be interrupted by explanation on what you will expect to see.

Data for SQLite exercises

There will be a database connection and some details about the relavent tables given in the introduction. You are going to run queries against a similarly structured connection, so it's a good idea to familiarize yourself with the basics to get a sense of what you're working with.

For this example we have a SQLite connection to star_wars.db which contains a bunch of fact tables about the Star Wars film franchise.

In [2]:
conn = sqlite3.connect('resource/asnlib/publicdata/star_wars.db')
In [3]:
pd.read_sql('SELECT * FROM quotes LIMIT 5', conn)
Out[3]:
id character_name quote source
0 1 Luke Skywalker May the Force be with you. A New Hope
1 2 Darth Vader I am your father. The Empire Strikes Back
2 3 Yoda Do or do not, there is no try. The Empire Strikes Back
3 4 Han Solo I've got a bad feeling about this. A New Hope
4 5 Admiral Ackbar It's a trap! Return of the Jedi

Of particular interest to us is the quotes table. The context is that character_name spoke the quote in the source (film). The primary key for the table is id. The character_name values are expected to contain duplicates, indicating multiple quotes associated with a single character.

SQLite exercise format

For these exercises we will ask you to complete a function that returns a string. That string is a SQLite query. In the most basic case the function will take no parameters and you can just return a static string. Since you're returning a query, there isn't a feasible option to solve this type of exercise with Pandas.

Possible variations

  • Inputs to customize the query
  • Input validation (i.e. raise an error if a given input doesn't meet a condition)

You will not see

  • A connection as an input
  • Table creation
  • Data inserts/deletes/updates

You're free to work with connections and Pandas to debug. The demos are set up to execute your query. You can use the same pattern for checking against the test data. You don't need to include either of the following in your solution.

  • Anything related to a connection.
    • i.e. execute, fetch, fetchall, close, commit, etc.
  • Reading into Pandas
    • i.e. pd.read_sql

Extracting quote data

In the next exercise we will construct a query to extract the quotes data. Down the line, we need a mapping of each character_name to all of their quotes. We should make the extract query do some of the work on the database side of the connection to reduce the number of rows we have in the result.

Towards that end, a good choice is to concatenate the individual quotes (using a delimiter) into one string. The downstream process can easily parse the result. We should also get a count of the quotes for each character so that the downstream can validate the parsing.

You may not be familiar with string concatenation like this in SQL, but the counts should be like riding a bike at this point.

Exercise 0: (5 points)

concatenate_quotes

Your task: define concatenate_quotes as follows:

Re-organize the data in the quotes table such that all quote values associated with a particular character_name are in a single row.

Returns:

  • str: a sql query that creates the following columns:
    • name - The name of a character
    • quotes - All quote values associated with the character, concatenated together.
      • Quotes should be ordered alphabetically.
      • Quotes should be separated by the "pipe" character '|'

Hints:

  • Look into using GROUP_CONCAT for concatenating the strings.
  • Think about using a subquery to pre-sort the quotes before applying other logic. It's much easier than trying to use ORDER BY as part of a GROUP_CONCAT function call.
  • The name quotes (with the "s") refers to both the source table and a target column. The name quote (no "s") refers to the source column.
In [4]:
### Solution - Exercise 0  
def concatenate_quotes():
    return '''
    SELECT 
        character_name AS name,
        GROUP_CONCAT(quote, '|') AS quotes,
        COUNT(*) AS quote_count
    FROM (SELECT * FROM quotes order by quote)
    GROUP BY character_name
    ORDER BY quote_count DESC, name;
    '''

### Demo function call
with pd.option_context('display.max_colwidth', 0):
    demo_cq_result = pd.read_sql(concatenate_quotes(), conn)
    demo_cq_result['quotes'] = demo_cq_result['quotes'].str.wrap(100)
    display(demo_cq_result[demo_cq_result['name'].isin({'C-3PO', 'Qui-Gon Jinn', 'Admiral Ackbar'})])
name quotes quote_count
8 C-3PO We seem to be made to suffer. It’s our lot in life.|We’re doomed. 2
11 Qui-Gon Jinn There’s always a bigger fish.|Your focus determines your reality. 2
12 Admiral Ackbar It's a trap! 1

The demo above should display the following output. We're displaying just a few rows of your result to demonstrate the idea behind the exercise. Your result should have the expected shape attribute as well.

name quotes quote_count
8 C-3PO We seem to be made to suffer. It’s our lot in life.|We’re doomed. 2
11 Qui-Gon Jinn There’s always a bigger fish.|Your focus determines your reality. 2
12 Admiral Ackbar It's a trap! 1

result shape: (19, 3)


The result is derived from these rows in quotes:

id character_name quote source
0 5 Admiral Ackbar It's a trap! Return of the Jedi
1 15 Qui-Gon Jinn Your focus determines your reality. The Phantom Menace
2 18 Qui-Gon Jinn There’s always a bigger fish. The Phantom Menace
3 48 C-3PO We’re doomed. A New Hope
4 62 C-3PO We seem to be made to suffer. It’s our lot in life. A New Hope

SQLite testing

For the testing we will execute your query against a bunch of test connections with similar structure. We will compare the DataFrame result with our pre-computed solution. As usual both will be given back to you with the debug variables. Additionally, the same in-memory connection used for testing will be available under input_vars alongside the usual parameters used by your function.

Since we're comparing DataFrames in the test, values aren't the only concern.

  • Column types can be assumed to carry forward unless the prompt indicates otherwise. For example, if you are extracting part of a string that is a number, leave it a string unless we ask for a number.
    • The exceptions are obvious situations where the type must change. For example, you can't do math with strings.
  • Row sort orders should be well defined in the prompt if the test is sensitive to ordering.
  • Column names and sequence should match what is given in the prompt.


The cell below will test your solution for concatenate_quotes (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [5]:
### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(concatenate_quotes),
              ex_name='concatenate_quotes',
              key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=', 
              n_iter=10)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to concatenate_quotes did not pass the test.'


print('Passed! Please submit.')
concatenate_quotes test ran 10 iterations in 0.12 seconds
Passed! Please submit.

Data for Pandas exercises

For these, you will be referring to DataFrames instead of tables. The explanation for DataFrames is similar to tables, but other Python objects come into play a lot more frequently.

Run the cell below to load two DataFrames containing more Star Wars data.

In [6]:
### Run Me!!!
char_df = utils.load_object_from_publicdata('char_df')
spec_df = utils.load_object_from_publicdata('spec_df')
Successfully loaded char_df.
Successfully loaded spec_df.

char_df contains character demographic information like their name, species, gender, etc. Character names are distinct.

Nerdy aside that you can skip:
The "year" columns are lacking some information. Years in the fictional universe are relative to the events of the original film. It's similar to the western BC/AD convention. We would not be able to use these columns without a qualifier (otherwise it looks like the "Yoda" character died 892 years before he was born).

In [7]:
display(char_df.head())
id name species gender height weight hair_color eye_color skin_color year_born homeworld year_died description
0 1 Luke Skywalker Human Male 1.72 77.0 Blond Blue Light 19.0 Tatooine 34.0 The main protagonist of the original trilogy.
1 2 Leia Organa Human Female 1.50 49.0 Brown Brown Light 19.0 Alderaan 35.0 A leader in the Rebel Alliance and twin sister...
2 3 Darth Vader Human Male 2.02 136.0 None Yellow Pale 41.0 Tatooine 4.0 The Sith Lord formerly known as Anakin Skywalker.
3 4 Yoda Yoda's species Male 0.66 17.0 White Brown Green 896.0 Unknown 4.0 A wise and powerful Jedi Master.
4 5 Han Solo Human Male 1.80 80.0 Brown Hazel Light 29.0 Corellia 34.0 A smuggler turned hero in the Rebel Alliance.

spec_df contains summary data on the different species in the Star Wars universe.

It looks like the name column in this DataFrame is a potential way to link this frame to char_df by way of its species column.

In [8]:
display(spec_df.head())
id name classification designation average_height skin_colors hair_colors eye_colors average_lifespan language homeworld
0 1 Human Mammal Sentient 1.80 Light, Dark Various Various 79.0 Galactic Basic Various
1 2 Yoda's species Unknown Sentient 0.66 Green White Brown 900.0 Galactic Basic Unknown
2 3 Wookiee Mammal Sentient 2.28 Brown Brown Blue 400.0 Shyriiwook Kashyyyk
3 4 Gungan Amphibian Sentient 1.96 Orange None Orange 70.0 Gungan Naboo
4 5 Twi'lek Mammal Sentient 1.80 Blue, Green, Red, Yellow None Various 80.0 Twi'leki Ryloth

Pandas exercise format

For these, you will be working directly with the Pandas objects to create outputs. This is exactly like what you see in Notebook 7 and whenever tabular data comes from something besides a DB connection.

Possible variations

  • Literally anything in the course aside from SQLite.

What you won't see

  • dfs_to_conn - this made an appearance on some prior exams. It leverages an in-memory database connection to read DataFrames into SQLite as tables. You would then be able to write a SQL query to extract it back out again into a DataFrame. The results would often diverge from a pure Pandas implementation in subtle ways due to both conversions relying on deriving things like column types, index types, etc from the data. While the exercises were solvable, the difficulty often depended on which approach students happened to start with more than we like. This utility will not be available for use on future exams, and you should not rely on it. Learn both Pandas and SQLite.

What you should not do

  • Attempt to create a SQLite connection.

TL;DR

Treat these exercises as if SQLite doesn't exist.

Exercise 1: (5 points)

species_count

Your task: define species_count as follows:

Determine the count of characters in char_df belonging to each classification in spec_df.

Args:

  • char_df (pd.DataFrame): contains character data including the 'name' and 'species' columns.
  • spec_df (pd.DataFrame): contains species data including the 'name' and 'classification' columns.

Returns:

  • pd.DataFrame: contains the following columns:
    • classification (str): species classification. Each row should have a distinct value in this column.
    • count (int): count indicating how many characters from the char_df are members of a species with the classification.

Hint:

  • There is no sorting requirement.
  • The char_df['species'] column relates to the spec_df['name'] column.
  • Characters without a corresponding entry in spec_df should not contribute to any count.
In [9]:
### Solution - Exercise 1  
def species_count(char_df, spec_df):
    merged = char_df.merge(spec_df, left_on='species', right_on='name')
    return merged['classification'].value_counts().reset_index().set_axis(['classification', 'count'], axis=1)\
#         .sort_values('classification').reset_index(drop=True) # You can sort differently and still pass!

### Demo function call
display(species_count(char_df, spec_df))
classification count
0 Mammal 74
1 Amphibian 4
2 Artificial 4
3 Reptilian 3
4 Hybrid 3
5 Unknown 1
6 Gastropod 1

The demo above should display this output.

classification count
0 Mammal 74
1 Amphibian 4
2 Artificial 4
3 Reptilian 3
4 Hybrid 3
5 Unknown 1
6 Gastropod 1

Pandas exercise tests

These tests are identical to the usual test. The caveat is that DataFrames carry around a lot of attributes that can be sensitive to implementation.

  • pandas.testing.assert_frame_equal() will not only tell you if two DataFrames match, but it will highlight the difference. It's what the test uses under the hood to check your result.
  • assert_frame_equal is sensitive to row/column ordering. If the prompt mentions that the test should not take the ordering into account you will have to "canonicalize" the DataFrames before comparing.
  • We have blocked SQLite connections in the test cells. The test will fail if your solution uses a SQLite connection.


The cell below will test your solution for species_count (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

  • input_vars - Input variables for your solution.
  • original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
  • returned_output_vars - Outputs returned by your solution.
  • true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.
In [10]:
### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(species_count),
              ex_name='species_count',
              key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to species_count did not pass the test.'


print('Passed! Please submit.')
species_count test ran 100 iterations in 1.30 seconds
Passed! Please submit.