`SQLite and Pandas`: `Exercise format examples`¶

Version 0.0.2

All of the header information is important. Please read it..

Topics number of exercises: This problem builds on your knowledge of Pandas and SQLite. It has 2 exercises numbered 0 to 1. There are 10 available points. However to earn 100% the threshold is 10 points. (Therefore once you hit 10 points you can stop. There is no extra credit for exceeding this threshold.)

Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.

Demo cells: Code cells starting with the comment ### Run Me!!! load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.

Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).

Exercise point breakdown:

Exercise 0 - : 5 point(s)
Exercise 1 - : 5 point(s)

Final reminders:

Submit after every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed and take short breaks at your leisure

### Global imports
import dill
from cse6040_devkit import plugins, utils
import sqlite3 
import pandas as pd 
import numpy as np

Introduction¶

This notebook is to familiarize students with the format for SQLite and Pandas based exercises. The implementation of these examples is very simple. Much simpler than you should expect on an exam. Throughout this notebook the "narrative" will be interrupted by explanation on what you will expect to see.

Data for SQLite exercises¶

There will be a database connection and some details about the relavent tables given in the introduction. You are going to run queries against a similarly structured connection, so it's a good idea to familiarize yourself with the basics to get a sense of what you're working with.

For this example we have a SQLite connection to star_wars.db which contains a bunch of fact tables about the Star Wars film franchise.

conn = sqlite3.connect('resource/asnlib/publicdata/star_wars.db')

pd.read_sql('SELECT * FROM quotes LIMIT 5', conn)

Of particular interest to us is the quotes table. The context is that character_name spoke the quote in the source (film). The primary key for the table is id. The character_name values are expected to contain duplicates, indicating multiple quotes associated with a single character.

SQLite exercise format¶

For these exercises we will ask you to complete a function that returns a string. That string is a SQLite query. In the most basic case the function will take no parameters and you can just return a static string. Since you're returning a query, there isn't a feasible option to solve this type of exercise with Pandas.

Possible variations

Inputs to customize the query
Input validation (i.e. raise an error if a given input doesn't meet a condition)

You will not see

A connection as an input
Table creation
Data inserts/deletes/updates

You're free to work with connections and Pandas to debug. The demos are set up to execute your query. You can use the same pattern for checking against the test data. You don't need to include either of the following in your solution.

Anything related to a connection.
- i.e. execute, fetch, fetchall, close, commit, etc.
Reading into Pandas
- i.e. pd.read_sql

Extracting quote data¶

In the next exercise we will construct a query to extract the quotes data. Down the line, we need a mapping of each character_name to all of their quotes. We should make the extract query do some of the work on the database side of the connection to reduce the number of rows we have in the result.

Towards that end, a good choice is to concatenate the individual quotes (using a delimiter) into one string. The downstream process can easily parse the result. We should also get a count of the quotes for each character so that the downstream can validate the parsing.

You may not be familiar with string concatenation like this in SQL, but the counts should be like riding a bike at this point.

Exercise 0: (5 points)¶

concatenate_quotes

Your task: define concatenate_quotes as follows:

Re-organize the data in the quotes table such that all quote values associated with a particular character_name are in a single row.

Returns:

str: a sql query that creates the following columns:
- name - The name of a character
- quotes - All quote values associated with the character, concatenated together.
  - Quotes should be ordered alphabetically.
  - Quotes should be separated by the "pipe" character '|'

Hints:

Look into using GROUP_CONCAT for concatenating the strings.
Think about using a subquery to pre-sort the quotes before applying other logic. It's much easier than trying to use ORDER BY as part of a GROUP_CONCAT function call.
The name quotes (with the "s") refers to both the source table and a target column. The name quote (no "s") refers to the source column.

### Solution - Exercise 0  
def concatenate_quotes():
    return '''
    SELECT 
        character_name AS name,
        GROUP_CONCAT(quote, '|') AS quotes,
        COUNT(*) AS quote_count
    FROM (SELECT * FROM quotes order by quote)
    GROUP BY character_name
    ORDER BY quote_count DESC, name;
    '''

### Demo function call
with pd.option_context('display.max_colwidth', 0):
    demo_cq_result = pd.read_sql(concatenate_quotes(), conn)
    demo_cq_result['quotes'] = demo_cq_result['quotes'].str.wrap(100)
    display(demo_cq_result[demo_cq_result['name'].isin({'C-3PO', 'Qui-Gon Jinn', 'Admiral Ackbar'})])

The demo above should display the following output. We're displaying just a few rows of your result to demonstrate the idea behind the exercise. Your result should have the expected shape attribute as well.

	name	quotes	quote_count
8	C-3PO	We seem to be made to suffer. It’s our lot in life.\|We’re doomed.	2
11	Qui-Gon Jinn	There’s always a bigger fish.\|Your focus determines your reality.	2
12	Admiral Ackbar	It's a trap!	1

result shape: (19, 3)

The result is derived from these rows in quotes:

	id	character_name	quote	source
0	5	Admiral Ackbar	It's a trap!	Return of the Jedi
1	15	Qui-Gon Jinn	Your focus determines your reality.	The Phantom Menace
2	18	Qui-Gon Jinn	There’s always a bigger fish.	The Phantom Menace
3	48	C-3PO	We’re doomed.	A New Hope
4	62	C-3PO	We seem to be made to suffer. It’s our lot in life.	A New Hope

SQLite testing¶

For the testing we will execute your query against a bunch of test connections with similar structure. We will compare the DataFrame result with our pre-computed solution. As usual both will be given back to you with the debug variables. Additionally, the same in-memory connection used for testing will be available under input_vars alongside the usual parameters used by your function.

Since we're comparing DataFrames in the test, values aren't the only concern.

Column types can be assumed to carry forward unless the prompt indicates otherwise. For example, if you are extracting part of a string that is a number, leave it a string unless we ask for a number.
- The exceptions are obvious situations where the type must change. For example, you can't do math with strings.
Row sort orders should be well defined in the prompt if the test is sensitive to ordering.
Column names and sequence should match what is given in the prompt.

The cell below will test your solution for concatenate_quotes (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 0  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(concatenate_quotes),
              ex_name='concatenate_quotes',
              key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=', 
              n_iter=10)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to concatenate_quotes did not pass the test.'


print('Passed! Please submit.')

concatenate_quotes test ran 10 iterations in 0.12 seconds
Passed! Please submit.

Data for Pandas exercises¶

For these, you will be referring to DataFrames instead of tables. The explanation for DataFrames is similar to tables, but other Python objects come into play a lot more frequently.

Run the cell below to load two DataFrames containing more Star Wars data.

### Run Me!!!
char_df = utils.load_object_from_publicdata('char_df')
spec_df = utils.load_object_from_publicdata('spec_df')

Successfully loaded char_df.
Successfully loaded spec_df.

char_df contains character demographic information like their name, species, gender, etc. Character names are distinct.

Nerdy aside that you can skip:
The "year" columns are lacking some information. Years in the fictional universe are relative to the events of the original film. It's similar to the western BC/AD convention. We would not be able to use these columns without a qualifier (otherwise it looks like the "Yoda" character died 892 years before he was born).

display(char_df.head())

spec_df contains summary data on the different species in the Star Wars universe.

It looks like the name column in this DataFrame is a potential way to link this frame to char_df by way of its species column.

display(spec_df.head())

Pandas exercise format¶

For these, you will be working directly with the Pandas objects to create outputs. This is exactly like what you see in Notebook 7 and whenever tabular data comes from something besides a DB connection.

Possible variations

Literally anything in the course aside from SQLite.

What you won't see

dfs_to_conn - this made an appearance on some prior exams. It leverages an in-memory database connection to read DataFrames into SQLite as tables. You would then be able to write a SQL query to extract it back out again into a DataFrame. The results would often diverge from a pure Pandas implementation in subtle ways due to both conversions relying on deriving things like column types, index types, etc from the data. While the exercises were solvable, the difficulty often depended on which approach students happened to start with more than we like. This utility will not be available for use on future exams, and you should not rely on it. Learn both Pandas and SQLite.

What you should not do

Attempt to create a SQLite connection.

TL;DR¶

Treat these exercises as if SQLite doesn't exist.

Exercise 1: (5 points)¶

species_count

Your task: define species_count as follows:

Determine the count of characters in char_df belonging to each classification in spec_df.

Args:

char_df (pd.DataFrame): contains character data including the 'name' and 'species' columns.
spec_df (pd.DataFrame): contains species data including the 'name' and 'classification' columns.

Returns:

pd.DataFrame: contains the following columns:
- classification (str): species classification. Each row should have a distinct value in this column.
- count (int): count indicating how many characters from the char_df are members of a species with the classification.

Hint:

There is no sorting requirement.
The char_df['species'] column relates to the spec_df['name'] column.
Characters without a corresponding entry in spec_df should not contribute to any count.

### Solution - Exercise 1  
def species_count(char_df, spec_df):
    merged = char_df.merge(spec_df, left_on='species', right_on='name')
    return merged['classification'].value_counts().reset_index().set_axis(['classification', 'count'], axis=1)\
#         .sort_values('classification').reset_index(drop=True) # You can sort differently and still pass!

### Demo function call
display(species_count(char_df, spec_df))

The demo above should display this output.

	classification	count
0	Mammal	74
1	Amphibian	4
2	Artificial	4
3	Reptilian	3
4	Hybrid	3
5	Unknown	1
6	Gastropod	1

Pandas exercise tests¶

These tests are identical to the usual test. The caveat is that DataFrames carry around a lot of attributes that can be sensitive to implementation.

pandas.testing.assert_frame_equal() will not only tell you if two DataFrames match, but it will highlight the difference. It's what the test uses under the hood to check your result.
assert_frame_equal is sensitive to row/column ordering. If the prompt mentions that the test should not take the ordering into account you will have to "canonicalize" the DataFrames before comparing.
We have blocked SQLite connections in the test cells. The test will fail if your solution uses a SQLite connection.

The cell below will test your solution for species_count (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.

input_vars - Input variables for your solution.
original_input_vars - Copy of input variables from prior to running your solution. Any key:value pair in original_input_vars should also exist in input_vars - otherwise the inputs were modified by your solution.
returned_output_vars - Outputs returned by your solution.
true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output.

### Test Cell - Exercise 1  

# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
    execute_tests = dill.load(f)

# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(species_count),
              ex_name='species_count',
              key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=', 
              n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars

assert passed, 'The solution to species_count did not pass the test.'


print('Passed! Please submit.')

species_count test ran 100 iterations in 1.30 seconds
Passed! Please submit.

	id	name	species	gender	height	weight	hair_color	eye_color	skin_color	year_born	homeworld	year_died	description
0	1	Luke Skywalker	Human	Male	1.72	77.0	Blond	Blue	Light	19.0	Tatooine	34.0	The main protagonist of the original trilogy.
1	2	Leia Organa	Human	Female	1.50	49.0	Brown	Brown	Light	19.0	Alderaan	35.0	A leader in the Rebel Alliance and twin sister...
2	3	Darth Vader	Human	Male	2.02	136.0	None	Yellow	Pale	41.0	Tatooine	4.0	The Sith Lord formerly known as Anakin Skywalker.
3	4	Yoda	Yoda's species	Male	0.66	17.0	White	Brown	Green	896.0	Unknown	4.0	A wise and powerful Jedi Master.
4	5	Han Solo	Human	Male	1.80	80.0	Brown	Hazel	Light	29.0	Corellia	34.0	A smuggler turned hero in the Rebel Alliance.

	id	name	classification	designation	average_height	skin_colors	hair_colors	eye_colors	average_lifespan	language	homeworld
0	1	Human	Mammal	Sentient	1.80	Light, Dark	Various	Various	79.0	Galactic Basic	Various
1	2	Yoda's species	Unknown	Sentient	0.66	Green	White	Brown	900.0	Galactic Basic	Unknown
2	3	Wookiee	Mammal	Sentient	2.28	Brown	Brown	Blue	400.0	Shyriiwook	Kashyyyk
3	4	Gungan	Amphibian	Sentient	1.96	Orange	None	Orange	70.0	Gungan	Naboo
4	5	Twi'lek	Mammal	Sentient	1.80	Blue, Green, Red, Yellow	None	Various	80.0	Twi'leki	Ryloth

	id	character_name	quote	source
0	1	Luke Skywalker	May the Force be with you.	A New Hope
1	2	Darth Vader	I am your father.	The Empire Strikes Back
2	3	Yoda	Do or do not, there is no try.	The Empire Strikes Back
3	4	Han Solo	I've got a bad feeling about this.	A New Hope
4	5	Admiral Ackbar	It's a trap!	Return of the Jedi

SQLite and Pandas: Exercise format examples¶