SQLite and Pandas
: Exercise format examples
¶Version 0.0.2
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of Pandas and SQLite
. It has 2 exercises numbered 0 to 1. There are 10 available points. However to earn 100% the threshold is 10 points. (Therefore once you hit 10 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### Run Me!!!
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 5 point(s)
Exercise 1 - : 5 point(s)
Final reminders:
### Global imports
import dill
from cse6040_devkit import plugins, utils
import sqlite3
import pandas as pd
import numpy as np
This notebook is to familiarize students with the format for SQLite and Pandas based exercises. The implementation of these examples is very simple. Much simpler than you should expect on an exam. Throughout this notebook the "narrative" will be interrupted by explanation on what you will expect to see.
There will be a database connection and some details about the relavent tables given in the introduction. You are going to run queries against a similarly structured connection, so it's a good idea to familiarize yourself with the basics to get a sense of what you're working with.
For this example we have a SQLite connection to star_wars.db
which contains a bunch of fact tables about the Star Wars film franchise.
conn = sqlite3.connect('resource/asnlib/publicdata/star_wars.db')
pd.read_sql('SELECT * FROM quotes LIMIT 5', conn)
Of particular interest to us is the quotes
table. The context is that character_name
spoke the quote
in the source
(film). The primary key for the table is id
. The character_name
values are expected to contain duplicates, indicating multiple quotes associated with a single character.
For these exercises we will ask you to complete a function that returns a string. That string is a SQLite query. In the most basic case the function will take no parameters and you can just return a static string. Since you're returning a query, there isn't a feasible option to solve this type of exercise with Pandas.
Possible variations
You will not see
You're free to work with connections and Pandas to debug. The demos are set up to execute your query. You can use the same pattern for checking against the test data. You don't need to include either of the following in your solution.
execute
, fetch
, fetchall
, close
, commit
, etc.pd.read_sql
In the next exercise we will construct a query to extract the quotes data. Down the line, we need a mapping of each character_name
to all of their quotes. We should make the extract query do some of the work on the database side of the connection to reduce the number of rows we have in the result.
Towards that end, a good choice is to concatenate the individual quotes (using a delimiter) into one string. The downstream process can easily parse the result. We should also get a count of the quotes for each character so that the downstream can validate the parsing.
You may not be familiar with string concatenation like this in SQL, but the counts should be like riding a bike at this point.
concatenate_quotes
Your task: define concatenate_quotes
as follows:
Re-organize the data in the quotes
table such that all quote
values associated with a particular character_name
are in a single row.
Returns:
name
- The name of a characterquotes
- All quote values associated with the character, concatenated together.'|'
Hints:
GROUP_CONCAT
for concatenating the strings. quotes
(with the "s") refers to both the source table and a target column. The name quote
(no "s") refers to the source column. ### Solution - Exercise 0
def concatenate_quotes():
return '''
SELECT
character_name AS name,
GROUP_CONCAT(quote, '|') AS quotes,
COUNT(*) AS quote_count
FROM (SELECT * FROM quotes order by quote)
GROUP BY character_name
ORDER BY quote_count DESC, name;
'''
### Demo function call
with pd.option_context('display.max_colwidth', 0):
demo_cq_result = pd.read_sql(concatenate_quotes(), conn)
demo_cq_result['quotes'] = demo_cq_result['quotes'].str.wrap(100)
display(demo_cq_result[demo_cq_result['name'].isin({'C-3PO', 'Qui-Gon Jinn', 'Admiral Ackbar'})])
The demo above should display the following output. We're displaying just a few rows of your result to demonstrate the idea behind the exercise. Your result should have the expected shape attribute as well.
name | quotes | quote_count | |
---|---|---|---|
8 | C-3PO | We seem to be made to suffer. It’s our lot in life.|We’re doomed. | 2 |
11 | Qui-Gon Jinn | There’s always a bigger fish.|Your focus determines your reality. | 2 |
12 | Admiral Ackbar | It's a trap! | 1 |
result shape: (19, 3)
The result is derived from these rows in quotes
:
id | character_name | quote | source | |
---|---|---|---|---|
0 | 5 | Admiral Ackbar | It's a trap! | Return of the Jedi |
1 | 15 | Qui-Gon Jinn | Your focus determines your reality. | The Phantom Menace |
2 | 18 | Qui-Gon Jinn | There’s always a bigger fish. | The Phantom Menace |
3 | 48 | C-3PO | We’re doomed. | A New Hope |
4 | 62 | C-3PO | We seem to be made to suffer. It’s our lot in life. | A New Hope |
For the testing we will execute your query against a bunch of test connections with similar structure. We will compare the DataFrame result with our pre-computed solution. As usual both will be given back to you with the debug variables. Additionally, the same in-memory connection used for testing will be available under input_vars
alongside the usual parameters used by your function.
Since we're comparing DataFrames in the test, values aren't the only concern.
The cell below will test your solution for concatenate_quotes (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sql_executor(concatenate_quotes),
ex_name='concatenate_quotes',
key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=',
n_iter=10)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to concatenate_quotes did not pass the test.'
print('Passed! Please submit.')
For these, you will be referring to DataFrames instead of tables. The explanation for DataFrames is similar to tables, but other Python objects come into play a lot more frequently.
Run the cell below to load two DataFrames containing more Star Wars data.
### Run Me!!!
char_df = utils.load_object_from_publicdata('char_df')
spec_df = utils.load_object_from_publicdata('spec_df')
char_df
contains character demographic information like their name, species, gender, etc. Character names are distinct.
Nerdy aside that you can skip:
The "year" columns are lacking some information. Years in the fictional universe are relative to the events of the original film. It's similar to the western BC/AD convention. We would not be able to use these columns without a qualifier (otherwise it looks like the "Yoda" character died 892 years before he was born).
display(char_df.head())
spec_df
contains summary data on the different species in the Star Wars universe.
It looks like the name
column in this DataFrame is a potential way to link this frame to char_df
by way of its species
column.
display(spec_df.head())
For these, you will be working directly with the Pandas objects to create outputs. This is exactly like what you see in Notebook 7 and whenever tabular data comes from something besides a DB connection.
Possible variations
What you won't see
dfs_to_conn
- this made an appearance on some prior exams. It leverages an in-memory database connection to read DataFrames into SQLite as tables. You would then be able to write a SQL query to extract it back out again into a DataFrame. The results would often diverge from a pure Pandas implementation in subtle ways due to both conversions relying on deriving things like column types, index types, etc from the data. While the exercises were solvable, the difficulty often depended on which approach students happened to start with more than we like. This utility will not be available for use on future exams, and you should not rely on it. Learn both Pandas and SQLite.What you should not do
Treat these exercises as if SQLite doesn't exist.
species_count
Your task: define species_count
as follows:
Determine the count of characters in char_df
belonging to each classification
in spec_df
.
Args:
char_df
(pd.DataFrame): contains character data including the 'name' and 'species' columns.spec_df
(pd.DataFrame): contains species data including the 'name' and 'classification' columns.Returns:
classification
(str): species classification. Each row should have a distinct value in this column.count
(int): count indicating how many characters from the char_df
are members of a species with the classification
.Hint:
char_df['species']
column relates to the spec_df['name']
column.spec_df
should not contribute to any count.### Solution - Exercise 1
def species_count(char_df, spec_df):
merged = char_df.merge(spec_df, left_on='species', right_on='name')
return merged['classification'].value_counts().reset_index().set_axis(['classification', 'count'], axis=1)\
# .sort_values('classification').reset_index(drop=True) # You can sort differently and still pass!
### Demo function call
display(species_count(char_df, spec_df))
The demo above should display this output.
classification | count | |
---|---|---|
0 | Mammal | 74 |
1 | Amphibian | 4 |
2 | Artificial | 4 |
3 | Reptilian | 3 |
4 | Hybrid | 3 |
5 | Unknown | 1 |
6 | Gastropod | 1 |
These tests are identical to the usual test. The caveat is that DataFrames carry around a lot of attributes that can be sensitive to implementation.
pandas.testing.assert_frame_equal()
will not only tell you if two DataFrames match, but it will highlight the difference. It's what the test uses under the hood to check your result.assert_frame_equal
is sensitive to row/column ordering. If the prompt mentions that the test should not take the ordering into account you will have to "canonicalize" the DataFrames before comparing.
The cell below will test your solution for species_count (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
# Load testing utility
with open('resource/asnlib/publicdata/execute_tests', 'rb') as f:
execute_tests = dill.load(f)
# Execute test
passed, test_case_vars = execute_tests(func=plugins.sqlite_blocker(species_count),
ex_name='species_count',
key=b'rWbhjie9GE-jNW-VVYzg9udUnA-yo4GMSFzz3dRN9SE=',
n_iter=100)
# Assign test case vars for debugging
input_vars, original_input_vars, returned_output_vars, true_output_vars = test_case_vars
assert passed, 'The solution to species_count did not pass the test.'
print('Passed! Please submit.')