Midterm 1 - Fall 2024
: Professor Vuduc Writes a Pop Song
¶Version 1.0.2
History:
All of the header information is important. Please read it..
Topics number of exercises: This problem builds on your knowledge of ['string processing', 'nested data', 'sorting']
. It has 10 exercises numbered 0 to 9. There are 18 available points. However to earn 100% the threshold is 13 points. (Therefore once you hit 13 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging your code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise point breakdown:
Exercise 0 - : 1 FREE point(s)
Exercise 1 - : 1 point(s)
Exercise 2 - : 2 point(s)
Exercise 3 - : 1 point(s)
Exercise 4 - : 3 point(s)
Exercise 5 - : 2 point(s)
Exercise 6 - : 2 point(s)
Exercise 7 - : 2 point(s)
Exercise 8 - : 3 point(s)
Exercise 9 - : 1 point(s)
Final reminders:
Run the following cells to set up the problem.
%load_ext autoreload
%autoreload 2
import dill
import re
from pprint import pprint
from cse6040_devkit import plugins
Your overall task: Professor Vuduc wants to become a pop star, but he needs your help to write his new hit song! First, you'll need to analyze Spotify's Most Streamed Songs from 2023, and determine what attributes his song should contain. Then, you'll analyze lyrics from some of these most streamed artists. Finally, we will combine the results of this analysis to build a lyric generator capable of writing Professor Vuduc's new song!
The datasets:
The first dataset is the metadata for Spotify's Most Streamed Songs from 2023. The Spotify dataset was sourced from here.
The second dataset is the raw lyrics dataset. The lyrics were scraped from Genius using their LyricsGenius Python client.
Run the cells below to load the data, and view samples of the two variables: spotify_metadata
and raw_lyrics
Note: You might notice that some of the raw_lyrics
differ slightly from the original song lyrics. This is because we are using the Kidz Bop version of the lyrics for each artist/song in raw_lyrics
. Professor Vuduc wants to ensure the whole family can enjoy his new hit song!
spotify_metadata
¶with open('resource/asnlib/publicdata/spotify_metadata.dill', 'rb') as fp:
spotify_metadata = dill.load(fp)
print(f"=== Success: Loaded {len(spotify_metadata):,} Spotify song metadata records. ===")
print(f"\nExample: Records 0 and 7:\n")
pprint([spotify_metadata[k] for k in [0, 7]])
raw_lyrics
¶with open('resource/asnlib/publicdata/lyrics.dill', 'rb') as fp:
raw_lyrics = dill.load(fp)
print(f"=== Success: Loaded song lyrics from {len(raw_lyrics):,} artists. ===")
print(f"=== Success: Loaded {len([song for song_dict in raw_lyrics.values() for song in song_dict]):,} song lyrics total. ===")
print(f"\nExample: Harry Styles - 'As It Was' First 12 Lines:\n")
pprint({'Harry Styles': {'As It Was': raw_lyrics['Harry Styles']['As It Was'][:12]}})
spotify_metadata__FREE
¶This is a free exercise!
The first dataset we will be working with is the metadata of the most streamed songs on Spotify in 2023: spotify_metadata
is a list of dictionaries, where each dictionary contains the information for a single song.
Each dictionary contains the following keys, and all values are of the data type string:
Please run the test cell below to collect your FREE point
### Test Cell - Exercise 0
print('Passed! Please submit.')
compute_song_stats
¶Your task: Define compute_song_stats
as follows: Compute the average BPM, danceability, and number of streams for the Spotify Top Songs of 2023.
Input: spotify_metadata
: A list of dictionaries, as described in Exercise 0.
Return: A tuple containing the following in order: (average_bpm
, average_danceability
, average_streams
)
Requirements:
bpm
, average danceability
, average streams
for all songs in spotify_metadata
streams
to be a string with commas as the thousands separator. For example, the integer 1000 should be represented as '1,000'Hint: This example https://stackoverflow.com/questions/1823058/how-to-print-a-number-using-commas-as-thousands-separators may help with the formatting requirement
### Solution - Exercise 1
def compute_song_stats(spotify_metadata: list) -> tuple:
# 1. Create a function which calculates the average for any
# collection of values associated with a key in the input dictionaries
# 2. Use the function to calculate the average values
# 3. Use Python string formatting to properly format the stream string
# 4. Return the values as a tuple
# You could also define the mean function yourself!
from statistics import mean
# Take any attribute string and calculate the average over the spotify metadata
def avg_calcer(attribute):
return round(mean([int(song[attribute]) for song in spotify_metadata]))
# Calculate our averages
avg_bpm = avg_calcer('bpm')
avg_danceability = avg_calcer('danceability_%')
avg_streams = avg_calcer('streams')
# Format the average strings
avg_streams = f"{avg_streams:,}"
# The parentheses are optional, here!
return (avg_bpm, avg_danceability, avg_streams)
### Demo function call
song_stats_demo_input = spotify_metadata[:3]
print(compute_song_stats(song_stats_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
(118, 67, '138,367,321')
The cell below will test your solution for compute_song_stats (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['compute_song_stats']['config']
ex_conf['func'] = compute_song_stats
tester = Tester(ex_conf, key=b'xH1i2Ha4qxJQ5O6vK9uj_3UQoel_h6vw4MwPpikJhzw=', path='resource/asnlib/publicdata/')
for _ in range(200):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
find_songs_by_artist
¶Your task: Define find_songs_by_artist
as follows: Reorganize Spotify metadata into a dictionary of lists of tuples, containing the song information for each artist.
Input: spotify_metadata
: A list of dictionaries, as described in Exercise 0.
Return: songs_by_artist
: A dictionary of lists of tuples, where the artist is the key, and the value is a list of tuples of the form: (track_name
, streams
)
Requirements:
artist_name
field as needed to handle multiple artists. Multiple artists are separated by commasartist_name
stringsstreams
to an integer, and sort the list of tuples for each artist by the number of streams in descending orderNote: If the same artist is listed multiple times for the same song, that song should appear the same number of times in that artist's list.
### Solution - Exercise 2
def find_songs_by_artist(spotify_metadata: list) -> dict:
# 1. Create an initialize our dictionary
# 2. Loop through each song and get the artists
# 3. Clean the artist names up
# 4. For each artist, add the track name and streams to the dictionary
# 5. Sort each artist's list
# 5. Return the dictionary
# Import our default dictionary and initialize it
from collections import defaultdict
songs_by_artist = defaultdict(list)
# Loop through each song
for song in spotify_metadata:
# Get the artist names
artists = song['artist_name'].split(",")
artists = [artist.strip() for artist in artists]
# Add the song to our dictionary for each artist
for artist in artists:
songs_by_artist[artist].append((song['track_name'], int(song['streams'])))
# Sort the lists of songs
for artist in songs_by_artist:
songs_by_artist[artist] = sorted(songs_by_artist[artist],
key=lambda x: x[1],
reverse=True)
# Return a dictionary of our results
return dict(songs_by_artist)
### Demo function call
songs_by_artist_demo_input = [spotify_metadata[k] for k in [0, 41]]
pprint(find_songs_by_artist(songs_by_artist_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
{'BTS': [('Left and Right', 720434240)],
'Charlie Puth': [('Left and Right', 720434240)],
'Jung Kook': [('Left and Right', 720434240), ('Seven', 141381703)],
'Latto': [('Seven', 141381703)]}
The cell below will test your solution for find_songs_by_artist (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['find_songs_by_artist']['config']
ex_conf['func'] = find_songs_by_artist
tester = Tester(ex_conf, key=b'LyTdGFBVK3zjO6u1Afo9Py7pkIGtCg0_OS8DIGKe05Q=', path='resource/asnlib/publicdata/')
for _ in range(300):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload the results of Exercise 2 in the global variable, songs_by_artist
.
with open('resource/asnlib/publicdata/top_songs.dill', 'rb') as fp:
songs_by_artist = dill.load(fp)
print(f"=== Success: Loaded stats from {len(songs_by_artist):,} artists. ===")
discover_top_artists
¶Your task: Define discover_top_artists
as follows: Given the result of Exercise 2, return a list of the top X artists with the most songs in the Spotify metadata.
Input:
songs_by_artist
: A dictionary of lists of tuples, where the artist is the key, and the value is a list of tuples of the form: (track_name
, streams
)X
: An integer representing the maximum number of tuples to return.Return: top_artists
: A list of X tuples of the form: (artist name, number of songs, number of total streams)
Requirements:
X
tuples ### Solution - Exercise 3
def discover_top_artists(songs_by_artist: dict, X: int) -> list:
# 1. Define a function which calculates the number
# of songs and total streams for a given list of songs
# 2. Use a list comprehension to build the tuples (the solution below
# uses "tuple unpacking", which is a bit of an advanced concept! See
# https://docs.python.org/3/tutorial/controlflow.html#tut-unpacking-arguments
# for more details)
# 3. Sort the output. You can use tuple comparisons, or sort twice, like we do here.
# 4. Return up to X results.
def calc_artist_output(songs):
total_songs = len(songs)
total_streams = sum(song[1] for song in songs)
return total_songs, total_streams
artist_counts = [
(artist, *calc_artist_output(songs_by_artist[artist]))
for artist in songs_by_artist
]
# This is an example of using Python's sort-stability to do hierarchical sorting!
artist_counts_sorted = sorted(artist_counts,
key=lambda x: x[2],
reverse=True)
artist_counts_sorted = sorted(artist_counts_sorted,
key=lambda x: x[1],
reverse=True)
return artist_counts_sorted[:X]
### Demo function call
top_artists_demo_input = {k: songs_by_artist[k] for k in ['Latto', 'Jung Kook', 'Myke Towers']}
print(discover_top_artists(top_artists_demo_input, 2))
Example. A correct implementation should produce, for the demo, the following output:
[('Jung Kook', 5, 1469963422), ('Latto', 1, 141381703)]
The cell below will test your solution for discover_top_artists (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['discover_top_artists']['config']
ex_conf['func'] = discover_top_artists
tester = Tester(ex_conf, key=b'6oIUE7811jz51nFds0V_2Ya-FtZEQ6kDeqgrqEhv7Oo=', path='resource/asnlib/publicdata/')
for _ in range(500):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
We will be working with data from raw_lyrics
for the remainder of the exercises.
raw_lyrics
is a dictionary of dictionaries of lists, where the outermost key is the artist name, the value is a dictionary where the keys are the song titles and their values are lists containing the raw lyric data. See below for an example of a single artist's entry in raw_lyrics
:
pprint({'Doja Cat': raw_lyrics['Doja Cat']})
cleanse_lyrics
¶Your task: Define cleanse_lyrics
as follows: Given a list of lyrics for a single song, cleanse the text of each line, and return the cleansed list.
Input: lyrics_list
: A list of lyrics for one song, where each element of the list corresponds to one line of lyrics in that song.
Return: cleansed_lyrics_list
: A list of song lyrics, where the raw text is cleansed as outlined in the rules below.
Recommended Steps:
### Solution - Exercise 4
def cleanse_lyrics(lyrics_list: list) -> list:
# NOTE: This approach /does not use RegEx/. This is to show how you /could/
# do something like this. Personally, I would recommend using RegEx, but doing
# this without it is a good exercise to check your understanding.
#
# 1. Join all of the lists into one big string
# 2. Remove every pair of parentheses
# 3. Make everything lowercase and remove hyphens
# 4. Split everything by newline, so we can iterate over the characters directly
# 5. Only keep the valid characters, like single quotes and whitespace
# 6. Join the characters together and strip the whitespace
# 7. Filter out empty lines and return the results
# Import constants
from string import ascii_letters
# Join all lines into one string, to take out parentheses
combined_lyrics = r"\n".join(lyrics_list)
# Strip out parenthesized contents
while '(' in combined_lyrics and ')' in combined_lyrics:
left_idx = combined_lyrics.index('(')
right_idx = combined_lyrics.index(')')
combined_lyrics = combined_lyrics[:left_idx] + combined_lyrics[right_idx + 1:]
# Lowercase, hyphens
lower_lyrics = combined_lyrics.lower()
no_hyphens = lower_lyrics.replace("-", " ")
# Split everything apart so we don't have to worry about newlines
lyric_lines = no_hyphens.split(r'\n')
# Remove bad characters
for idx in range(len(lyric_lines)):
chars = []
for char in lyric_lines[idx]:
if char in ascii_letters or char in ("' "):
chars.append(char)
lyric_lines[idx] = ''.join(chars).strip()
# Remove empty lines
lyric_lines = list(filter(lambda x: x, lyric_lines))
return lyric_lines
### Demo function call
cleanse_lyrics_demo_input = raw_lyrics['Doja Cat']['Say So'][-27:-19] + raw_lyrics['Doja Cat']['Say So'][-7:]
pprint(cleanse_lyrics(cleanse_lyrics_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
['tell me what must i do',
"'cause luckily i'm good at reading",
"i wouldn't tell him but he won't stop cheesin'",
'and we can dance all day around it',
"if you frontin' i'll be bouncing",
'if you know it scream it shout it babe',
'before i leave you dry',
"didn't like to know it keep with me in the moment",
"i'd say it had i known it why don't you say so",
"didn't even notice no punches left to roll with",
'you got to keep me focused you know it say so',
'you might also like',
'ooh ah ha ah ah ha ah ha ah ha',
'ooh ah ha ah ah ha ah ha ah ha']
The cell below will test your solution for cleanse_lyrics (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['cleanse_lyrics']['config']
ex_conf['func'] = cleanse_lyrics
tester = Tester(ex_conf, key=b'9x2HROuOnqJCQO021wsYDHF6s5EXXs4IA6GVAD-AB-k=', path='resource/asnlib/publicdata/')
for _ in range(1000):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload the results of Exercise 4 in the global variable, all_cleansed_lyrics
.
with open('resource/asnlib/publicdata/all_cleansed_lyrics.dill', 'rb') as fp:
all_cleansed_lyrics = dill.load(fp)
print(f"=== Success: Loaded cleansed lyrics from {len(all_cleansed_lyrics):,} artists. ===")
vibe_check
¶Your task: Define vibe_check
as follows: Given a list of cleansed lyrics, identify which words appear most frequently in a song, so that we can determine the "vibe" of the song.
To do this effectively, we should first remove any stop words from the lyrics before counting the occurrences of each word. Stop words are a set of words that are so commonly used in the English language that they carry little useful information to our analysis.
Input:
cleansed_lyrics_list
: A list of cleansed lyrics from a single song.X
: An integer representing the maximum number of words to return.Return: top_vibes
: A set of up to the top X most common words found in the lyrics.
Requirements:
STOP_WORDS
X
common words. If the counts of two words are the same, sort by length of the word in descending order. If two words share the same count and length, then sort by the order words are encountered in the input.Hint: Remember that sets in Python are unordered. Your result must contain the same words as the expected result, but the order may differ!
### Load our Stop Words:
with open('resource/asnlib/publicdata/stopwords.dill', 'rb') as fp:
STOP_WORDS = dill.load(fp)
print(f"=== Success: Loaded {len(STOP_WORDS):,} stop words. ===")
### Solution - Exercise 5
def vibe_check(cleansed_lyrics_list: list, X: int) -> set:
# 1. Import and make an instance of a counter
# 2. Loop over each line and count the words in the string
# 3. Get the keys from our counter and filter out the stop words
# 4. Sort the remaining keys
# 5. Keep only the first X keys and put them into a set
# Approach 1 ----------------------------------------------------------------------------
# Seems like a good opportunity to use a Counter!
from collections import Counter
lyric_count = Counter()
# Loop over each line and split the line into words
for line in cleansed_lyrics_list:
words = line.split()
lyric_count.update(words)
# Sort our lyric counts
lyrics = list(lyric_count.keys())
# Could also use filter()
lyrics = [lyric for lyric in lyrics if lyric not in STOP_WORDS]
# METHOD 1: Tuple comparison
lyrics = sorted(lyrics,
key=lambda lyric: (lyric_count[lyric], len(lyric)),
reverse=True)
# METHOD 2: Double Sort
# lyrics = sorted(lyrics, key=lambda lyric: len(lyric), reverse=True)
# lyrics = sorted(lyrics, key=lambda lyric: lyric_count[lyric], reverse=True)
# Create and return the set
return set(lyrics[:X])
# Approach 2 -----------------------------------------------------------------------------
# Seems like a good opportunity to use a defaultdict!
from collections import defaultdict
lyric_count = defaultdict(int)
# Loop over each line and split the line into words
for line in cleansed_lyrics_list:
words = line.split()
# If the word isn't a stopword, increase the count
for word in words:
if word not in STOP_WORDS:
lyric_count[word] += 1
# Sort our lyric counts
lyrics = list(lyric_count.keys())
lyrics = sorted(lyrics,
key=lambda lyric: len(lyric),
reverse=True)
lyrics = sorted(lyrics,
key=lambda lyric: lyric_count[lyric],
reverse=True)
# Create and return the set
return set(lyrics[:X])
### Demo function call
vibe_check_demo_input = all_cleansed_lyrics['Taylor Swift']['Cruel Summer']
print(vibe_check(vibe_check_demo_input, 3))
Example. A correct implementation should produce, for the demo, the following output:
{'oh', 'summer', 'cruel'}
The cell below will test your solution for vibe_check (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['vibe_check']['config']
ex_conf['func'] = vibe_check
tester = Tester(ex_conf, key=b'aAJm0NmL6IXkjrs_VRLsb8ZU7tC-cbIjMxDJY-9Hpts=', path='resource/asnlib/publicdata/')
for _ in range(200):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
generate_bigrams
¶Your task: Define generate_bigrams
as follows: Given a list of the cleansed lyrics, find the count of all word bigrams.
What is a bigram? A bigram is a pair of consecutive written units. In our exercise, we want to find the counts of pairs of consecutive words in each line of lyrics.
Input:
cleansed_lyrics_list
: A list of cleansed lyrics from a single song.
Return:
bigrams_dict
: A dictionary in which the key is a tuple (first_word, second_word), and the value is the count of the number of times that bigram appears in the lyrics.
Requirements:
Example: For the line 'you might also like', the bigrams would be ('you', 'might'), ('might', 'also'), and ('also', 'like').
### Solution - Exercise 6
def generate_bigrams(cleansed_lyrics_list: list) -> dict:
# 1. Import and create a counter
# 2. Loop over each line
# 3. Split the line into words
# 4. Create the bigrams from the list of words
# 5. Count the bigrams generated in step 4
def generate_bigrams(word_list):
num_bigrams = len(word_list) - 1
bigram_pointer = range(num_bigrams)
bigrams = [
(word_list[idx], word_list[idx + 1])
for idx in bigram_pointer
]
return bigrams
# Import and create the counter
from collections import Counter
bigram_count = Counter()
# Loop over each line and split the line into words
for line in cleansed_lyrics_list:
words = line.split()
# Generate and count the bigrams
bigrams = generate_bigrams(words)
bigram_count.update(bigrams)
return dict(bigram_count)
### Demo function call
bigrams_demo_input = all_cleansed_lyrics['Doja Cat']['Say So'][-3:]
pprint(generate_bigrams(bigrams_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
{('ah', 'ah'): 2,
('ah', 'ha'): 8,
('also', 'like'): 1,
('ha', 'ah'): 6,
('might', 'also'): 1,
('ooh', 'ah'): 2,
('you', 'might'): 1}
The cell below will test your solution for generate_bigrams (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['generate_bigrams']['config']
ex_conf['func'] = generate_bigrams
tester = Tester(ex_conf, key=b'0SJ3lw8EW4pH2sJ0vG2hYTV27GP_7FmJHNnV3MYGTsc=', path='resource/asnlib/publicdata/')
for _ in range(200):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload the results of Exercise 6 in the global variable, bigrams_dict
.
with open('resource/asnlib/publicdata/all_bigrams.dill', 'rb') as fp:
bigrams_dict = dill.load(fp)
print(f"=== Success: Loaded {len(bigrams_dict):,} bigrams. ===")
rhyme_time
¶Your task: Define rhyme_time
as follows: Given a list of cleansed lyrics, generate a rhyming dictionary using the last word in each line of lyrics.
Input:
cleansed_lyrics_list
: A list of cleansed lyrics from a single song.
Return:
rhyming_dict
: A dictionary in which the key is the last word of a lyric line, and its value is a set of all of the last words found in cleansed_lyrics_list
that rhyme with the key.
Requirements:
rhyme_checker
to determine if two words rhymerhyme_checker
returns True
if the two words provided rhyme and False
otherwise'are': {'car'}
and 'car': {'are'}
appear### Load our Rhyming Data
with open('resource/asnlib/publicdata/rhyming_dict.dill', 'rb') as fp:
rhyme_lookup = dill.load(fp)
with open('resource/asnlib/publicdata/lookup.dill', 'rb') as fp:
lookup = dill.load(fp)
### Helper Function
def rhyme_checker(word1, word2):
return word1 in plugins.rhymes(word2, lookup, rhyme_lookup)
### Solution - Exercise 7
def rhyme_time(cleansed_lyrics_list: list) -> dict:
# 1. Create a default dictionary
# 2. Get all of the last words
# 3. For each last word, generate a set of the other last words which rhyme with it
# 4. Add each of the rhyming words to the set in the rhyming dictionary
# 5. Return the dictionary
def get_last_word(line):
words = line.split()
return words[-1]
def get_rhymes(word, possible_rhymes):
valid_rhymes = set(
rhyme for rhyme in possible_rhymes
if rhyme_checker(word, rhyme)
)
return valid_rhymes
# Create a defaultdictionary
from collections import defaultdict
rhyming_dict = defaultdict(set)
# Get the last words
last_words = set(
get_last_word(line)
for line
in cleansed_lyrics_list
)
# For each word, get all valid rhymes and add them to the set
for word in last_words:
rhymes = get_rhymes(word, last_words)
for rhyme in rhymes:
rhyming_dict[word].add(rhyme)
# Return the dictionary
return dict(rhyming_dict)
### Demo function call
rhyming_demo_input = all_cleansed_lyrics['Taylor Swift']['Cruel Summer']
pprint(rhyme_time(rhyming_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
{'are': {'car'},
'below': {'oh'},
'car': {'are'},
'fate': {'gate'},
'gate': {'fate'},
'lying': {'trying'},
'oh': {'below'},
'true': {'you'},
'trying': {'lying'},
'you': {'true'}}
The cell below will test your solution for rhyme_time (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['rhyme_time']['config']
ex_conf['func'] = rhyme_time
tester = Tester(ex_conf, key=b'ayf9kq6by6ehuJv9J_-MRoQ7ae8BwPXEwout_w2hu4o=', path='resource/asnlib/publicdata/')
for _ in range(70):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
count_syllables
¶Your task: Define count_syllables
as follows: Given a set of words, find the number of syllables in each word.
Input:
set_of_words
: A set of words, such as {'are', 'car', 'trying'}
Return:
syllable_dict
: A dictionary in which the keys are the words found in set_of_words
, and the value is the number of syllables in that word.
Requirements: To determine the number of syllables in a word:
vowels_no_y
, add 1consonants
followed immediately by a letter within vowels
, add 1 for each occurrencevowels
consecutively, add 1 for each occurrence. For instance, 'uoyy' would be considered a valid match'e'
. If so, subtract 1, unless the word ends with 'le'
and the preceding letter is a letter within consonants
, then do nothingCreate a dictionary containing the word as a key and the number of syllables in that word as its value.
Hint: While not required, Regular Expressions might be helpful to use for Steps 1-4!
Examples:
### Solution - Exercise 8
def count_syllables(set_of_words: set) -> dict:
vowels_no_y = 'aeiou'
vowels = 'aeiouy'
consonants = 'bcdfghjklmnpqrstvwxz'
###
# REGEX APPROACH -----------------------------------------------------------
# 1. Loop over each word and add the word: syllable count to a dictionary,
# using a custom function
# 1.1. Add 1 if starting with a valid vowel
# 1.2. Count the consonant/vowel pairs with regex
# 1.3. Count the triple+ vowel occurances
# 1.4. Check the ending
# 1.5. If the syllable count is less than 1, set it to 1.
# 2. Return the dictionary
# Import regex
import re
# Define our syllable count
def count_word_syllables(word):
syllables = 0
# Step 1
if word[0] in vowels_no_y:
syllables += 1
# Step 2
pattern = f'[{consonants}][{vowels}]'
counts = re.findall(pattern, word)
syllables += len(counts)
# Step 3
triple_count = r"{3,}"
pattern = f'[{vowels}]{triple_count}'
counts = re.findall(pattern, word)
syllables += len(counts)
# Step 4
if word[-1] == 'e' and not re.search(f'[{consonants}]le$', word):
syllables -= 1
# Step 5
if syllables < 1:
syllables = 1
return syllables
# Use our function to build the syllable dictionary
syllables_dict = {
word: count_word_syllables(word)
for word in set_of_words
}
# Return the dictionary
return syllables_dict
###
### Demo function call
syllables_demo_input = {'queue', 'luckily', 'quiet', 'the', 'a', 'you', 'trouble', 'irritate', 'stale'}
pprint(count_syllables(syllables_demo_input))
### Solution - Exercise 8
def count_syllables(set_of_words: set) -> dict:
vowels_no_y = 'aeiou'
vowels = 'aeiouy'
consonants = 'bcdfghjklmnpqrstvwxz'
# NON-REGEX APPROACH -----------------------------------------------------------
# 1. Loop over each word and add the word: syllable count to a dictionary,
# using a custom function
# 1.1. Add 1 if starting with a valid vowel
# 1.2. Count the consonant/vowel pairs with regex
# 1.3. Count the triple+ vowel occurances
# 1.4. Check the ending
# 1.5. If the syllable count is less than 1, set it to 1.
# 2. Return the dictionary
def count_word_syllables(word):
syllables = 0
# Step 1
if word[0] in vowels_no_y:
syllables += 1
# Step 2
string_pointer = 0
while string_pointer < len(word) - 1:
is_consonant = word[string_pointer] in consonants
followed_by_vowel = word[string_pointer + 1] in vowels
if is_consonant and followed_by_vowel:
syllables += 1
string_pointer += 1
# Step 3
start_pointer = 0
while start_pointer < len(word):
if word[start_pointer] in vowels:
end_pointer = start_pointer
while end_pointer < len(word) and word[end_pointer] in vowels:
end_pointer += 1
if end_pointer - start_pointer >= 3:
syllables += 1
start_pointer = end_pointer
start_pointer += 1
# Step 4
ends_in_e = word[-1] == 'e'
if len(word) >= 3:
ends_in_le = word[-2:] == 'le'
preceeded_by_cons = word[-3] in consonants
else:
ends_in_le = False
preceeded_by_cons = False
if ends_in_e and not (ends_in_le and preceeded_by_cons):
syllables -= 1
# Step 5
if syllables < 1:
syllables = 1
return syllables
# Use our function to build the syllable dictionary
syllables_dict = {
word: count_word_syllables(word)
for word in set_of_words
}
# Return the dictionary
return syllables_dict
### Demo function call
syllables_demo_input = {'queue', 'luckily', 'quiet', 'the', 'a', 'you', 'trouble', 'irritate', 'stale'}
pprint(count_syllables(syllables_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
{'a': 1,
'irritate': 3,
'luckily': 3,
'queue': 1,
'quiet': 2,
'stale': 1,
'the': 1,
'trouble': 2,
'you': 1}
The cell below will test your solution for count_syllables (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['count_syllables']['config']
ex_conf['func'] = count_syllables
tester = Tester(ex_conf, key=b'QGWv7NoU-PUzYjlptlEfeSBYFBQNaTeePcOXZAPcFbA=', path='resource/asnlib/publicdata/')
for _ in range(500):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
build_markov_process
¶Your task: Define build_markov_process
as follows: Using the result from Exercise 6, generate a Markov process so our lyric generator can select the next word in a lyric with probabilities matching those found in our lyric dataset. Before attempting this exercise, make sure you have loaded the global variable bigrams_dict
located in the 'Run Me' code cell following Exercise 6.
Input:
bigrams_dict
: A dictionary containing bigram keys that are tuples (first_word, second_word), with values that are the count of the number of times that bigram appears in the lyrics.
Return:
markov_dict
: A dictionary of lists, in which the key is the first word, and the value is a list containing all potential second words.
Requirements:
markov_dict
in which the keys are the first words from bigrams_dict
, and the values are a list of the second words in that bigram duplicated the number of times specified by the count for that bigramExample:
For input {('first', 'second'): 2, ('first', 'other'): 1}
, your result would be: {'first': ['second', 'second', 'other']}
### Solution - Exercise 9
def build_markov_process(bigrams_dict: dict) -> dict:
# 1. Create a default dictionary
# 2. For each bigram:
# 2.1. Get the first term
# 2.2. Get the second term
# 2.3. Add the second term to the default dictionary's first-term list,
# repeating for the number of times as specified in the bigram_dict
# 3. Return the dictionary
from collections import defaultdict
markov_dict = defaultdict(list)
for bigram in bigrams_dict:
key = bigram[0]
value = bigram[1]
for _ in range(bigrams_dict[bigram]):
markov_dict[key].append(value)
return dict(markov_dict)
### Demo function call
markov_demo_input = {k: bigrams_dict[k] for k in [('like', 'ah'), ('like', 'what'), ('ah', 'he'), ('he', 'got')]}
pprint(build_markov_process(markov_demo_input))
Example. A correct implementation should produce, for the demo, the following output:
{'ah': ['he'], 'he': ['got', 'got', 'got'], 'like': ['ah', 'ah', 'ah', 'ah', 'what', 'what', 'what', 'what']}
The cell below will test your solution for build_markov_process (exercise 9). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 9
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['build_markov_process']['config']
ex_conf['func'] = plugins.postprocess_sort_dict(build_markov_process)
tester = Tester(ex_conf, key=b'TfnCIOMcUBY0m-_81gdDHjykus0D9WgVS6gMasPLb_E=', path='resource/asnlib/publicdata/')
for _ in range(200):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
## Pre-processing Lyric Data:
# Step 0: Create one large list containing all song lyrics
all_lyrics_list = [line for song_dict in raw_lyrics.values() for lyrics in song_dict.values() for line in lyrics]
# Step 1: Cleanse lyrics of all songs (using Exercise 4)
all_cleansed_lyrics = cleanse_lyrics(all_lyrics_list)
# Step 2: Generate markov process from bigrams of all cleansed lyrics (using Exercises 6 and 9)
all_bigrams_dict = generate_bigrams(all_cleansed_lyrics)
all_markov_process = build_markov_process(all_bigrams_dict)
# Step 3: Generate rhyming words dictionary for all cleansed lyrics (using Exercise 7) (load this as it takes a long time to run)
with open('resource/asnlib/publicdata/epilogue_rhyming_dict.dill', 'rb') as fp:
all_songs_rhyming_dict = dill.load(fp)
# Step 4: Generate syllables dictionary for all words in cleansed lyrics (using Exercise 8)
all_words = {word for line in all_cleansed_lyrics for word in line.split()}
all_words_syllables_dict = count_syllables(all_words)
# Step 5: Find most common starting words of each lyric line (optional, using Exercise 4 result)
with open('resource/asnlib/publicdata/epilogue_top_starting_words.dill', 'rb') as fp:
top_starting_words_sorted = dill.load(fp)
## --------------------------------------------------------------------------------------------
## Create a function to generate a single line of lyrics:
# Step 6: Create generate_lyric function with inputs: starting word of each line of lyrics, desired number of syllables per line, and the word to rhyme end word with (optional):
# 1. Use glabal variable 'all_words_syllables_dict' to count syllables as we go
# 2. Generate next_word using global variable 'all_markov_process'
# 3. Select final word of lyric line from global variable 'all_songs_rhyming_dict'
from random import sample
def generate_lyric(first_word_of_lyric, desired_syllables, rhyming_word=None):
total_syllable_count = all_words_syllables_dict[first_word_of_lyric]
lyric = first_word_of_lyric
next_word = first_word_of_lyric
# If we were provided a rhyming word as input, find a word that rhymes with it to become the end of our next line of lyrics
if rhyming_word:
new_rhyming_word = sample(all_songs_rhyming_dict[rhyming_word], 1)[0]
total_syllable_count += all_words_syllables_dict[new_rhyming_word]
# While our number of syllables for the line of lyrics is less than the desired number of syllables, keep generating words
while total_syllable_count < desired_syllables:
prior_word = next_word
next_word = sample(all_markov_process[prior_word], 1)[0]
tries = 0
while next_word not in all_markov_process and tries < 50:
next_word = sample(all_markov_process[prior_word], 1)[0]
tries += 1
if next_word not in all_markov_process:
next_word = sample(list(all_markov_process.keys()), 1)[0]
lyric = lyric + ' ' + next_word
total_syllable_count += all_words_syllables_dict[next_word]
# If a rhyming word was not provided as input, randomly choose an ending rhyming word
if rhyming_word is None:
if next_word not in all_songs_rhyming_dict:
final_word = sample(list(all_songs_rhyming_dict.keys()), 1)[0]
lyric = lyric + ' ' + final_word
total_syllable_count += all_words_syllables_dict[final_word]
return lyric, final_word, min(total_syllable_count, 10)
else:
return lyric, next_word, min(total_syllable_count, 10)
else:
lyric = lyric + ' ' + new_rhyming_word
return lyric, None, min(total_syllable_count, 10)
## --------------------------------------------------------------------------------------------
# Step 7: Repeatedly call Step 6's generate_lyric function and add to final list of lyrics
# Choose Song Structure:
# 2 verses of 6 lines each
# Chorus of 8 lines
# 2 verses of 6 lines each
# Same chorus of 8 lines again
verse_lyrics = []
chorus_lyrics = ['[Chorus:]']
syllable_count = 10
for j in range(4):
verse_count = j+1
verse_lyrics.append(f'[Verse {verse_count}:]')
verse_starting_words = sample(top_starting_words_sorted, 6)
for i, starting_word in enumerate(verse_starting_words):
if i % 2:
one_lyric_line, rhyming_word, syllable_count = generate_lyric(starting_word, syllable_count, rhyming_word)
verse_lyrics.append(one_lyric_line)
else:
one_lyric_line, rhyming_word, syllable_count = generate_lyric(starting_word, syllable_count)
verse_lyrics.append(one_lyric_line)
verse_lyrics.append('\n')
chorus_starting_words = sample(top_starting_words_sorted, 8)
for i, starting_word in enumerate(chorus_starting_words):
if i % 2:
one_lyric_line, rhyming_word, syllable_count = generate_lyric(starting_word, syllable_count, rhyming_word)
chorus_lyrics.append(one_lyric_line)
else:
one_lyric_line, rhyming_word, syllable_count = generate_lyric(starting_word, syllable_count)
chorus_lyrics.append(one_lyric_line)
song_lyrics = verse_lyrics[:16] + chorus_lyrics + verse_lyrics[15:] + chorus_lyrics
song_lyrics_no_titles = verse_lyrics[1:7] + verse_lyrics[9:15] + chorus_lyrics[1:] + verse_lyrics[18:23] + verse_lyrics[25:31] + chorus_lyrics[1:]
## --------------------------------------------------------------------------------------------
# Step 8: Run vibe check on list of generated lyrics to choose a song title (using Exercise 5)
vibe_analysis = vibe_check(song_lyrics_no_titles, 5)
print('Vibe of the song: ', vibe_analysis, '\n')
## --------------------------------------------------------------------------------------------
# Step 9: Join lyric lines with newline characters and return full string of the song
final_song = '\n'.join(song_lyrics)
print(final_song)
{'oh', 'see', 'baby', 'get', 'like'}
[Verse 1:]
don't you say never fade away with california gurls we're free
let's hope with me trippin' oh oh yeah me
you're tired of my heart is be another
we gotta gotta know i'm giving brother
i've been movin' so just need to replace
baby my head still breathing fire fire face
[Verse 2:]
it's not fazed only want from all eyes
it was you might also like you see skies
all these dreams come back time we were right to
i'll be without ya i don't get achoo
and i got nothing on my enemy your hand
this love right here drippin' off and i'm sand
[Chorus:]
hey i've been a thing for the way that chico nice
if the way your friends talk to find other twice
when you're beautiful liar bad blood hey yeah it's so
baby that's caught up higher over yo
and uh huh you you get our very special with you
got fake people you get you like nah nah do
i'll never see it takes you ever
i'm on get a little bit of forever
[Verse 3:]
yeah yeah yeah i'm coming down together
we are full of the applause applause weather
'cause i cry me go crazy what makes you
what people hatin' say the ice cream for knew
you better now i feel nothin' happens when
just wanna talk to forget her again
[Verse 4:]
ooh oh oh oh oh ooh i just like baby
it's not gonna walk that it's true no maybe
now i'm 'bout it again to fall as you
i was long time on these tears me through oooh
it didn't come kick him it that could see you
she was not around the rhythm and this two
[Chorus:]
hey i've been a thing for the way that chico nice
if the way your friends talk to find other twice
when you're beautiful liar bad blood hey yeah it's so
baby that's caught up higher over yo
and uh huh you you get our very special with you
got fake people you get you like nah nah do
i'll never see it takes you ever
i'm on get a little bit of forever