Version 1.1
In this notebook, you'll implement a procedure that attempts to summarize a long document by identifying and ranking important sentences automatically. It exercises your understanding of the singular value decomposition (SVD), Numpy, and some basic Python string manipulation. It is worth ten (10) points in total. While the exercises build on one another conceptually, we have precomputed solutions you'll need in each instance so that you can complete them in any order, similar to the Midterm 2 problems.
Note 0. Exercises 0-2 are "pure coding" exercises. To solve Exercises 3 and 4, you'll need a little bit of math.
Note 1. When you submit to the autograder, there will be a strict 60-second time limit to execute your notebook. So if you notice your codes are taking much longer than that to run, reconsider your approach! Our reference solutions complete in less than 10 seconds.
Pro-tips.
Actions
$\rightarrow$ Reset Assignment
to get a fresh, original copy of this notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those somewhere safe if you intend to keep or reuse them!)print
statement), causing the notebook to load slowly or not at all, use Actions
$\rightarrow$ Clear Notebook Output
to get a clean copy. The clean copy will retain your code but remove any generated output. However, it will also rename the notebook to clean.xxx.ipynb
. Since the autograder expects a notebook file with the original name, you'll need to rename the clean notebook accordingly.Revision history.
Run the following code cells, which will load the modules you'll need.
import sys
print("* Python version:", sys.version)
import pickle
from problem_utils import get_path
# The usual suspects
import numpy as np
import scipy as sp
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
print("* Numpy version:", np.__version__)
print("* Scipy version:", sp.__version__)
print("* Matplotlib version:", matplotlib.__version__)
Suppose you have a long report or document to read but limited time. Which sentences are the most important? In this notebook, you'll formulate the task of finding key sentences as a linear algebra problem, and see if such an automatic method can help.
Idea. Let's start with the following intuitive idea:
A sentence is important if it contains many important words. And a word is important if it appears in many important sentences.
At first glance, this idea seems a bit circular! But it turns out we can formulate it as a well-defined mathematical problem in the language of linear algebra.
Data representation. Suppose the document consists of $n$ distinct sentences and $m$ unique words. Next, let $i$ be a word and let $j$ be a sentence. Furthermore, let $n_i$ denote the number of sentences containing word $i$. We can represent the document by a matrix $A$, called the term-sentence matrix, where each entry $a_{i,j}$ is defined as
$$ a_{i,j} = \left\{ \begin{array}{cl} \dfrac{1}{\ln \left( \frac{n+1}{n_i} \right)} & \mbox{if word $i$ appears in sentence $j$, or} \\ 0 & \mbox{otherwise.} \end{array} \right. $$Here, $\ln x$ denotes the natural logarithm of $x$. Observe that $a_{i,j}$ tends to be small if $i$ does not appear in many sentences (i.e., $n_i$ is small). The $n+1$ ensures that we do not divide by zero when calculating $a_{i,j}$; in particular, even if word $i$ appears in every sentence ($n_i = n$), the value $\ln \frac{n+1}{n_i} > 0$. Lastly, you'd expect each sentence to contain just a few of the possible words. Therefore, you might expect $A$ to be sparse.
A mathematical formulation of the task. Given word $i$ and sentence $j$, let's denote word $i$'s importance score by $w_i$, and let $s_j$ denote the importance score of sentence $j$. These scores are just numbers on an arbitrary scale, but let's take them to be non-negative ($w_i, s_j \geq 0$).
We said we think these scores are inter-related. Suppose we believe these relationships are linear. Then we might model their relationships by
$$\begin{eqnarray} w_i & \, \propto \, & \sum_j a_{i,j} \, s_j \\ s_j & \, \propto \, & \sum_i a_{j,i} \, w_i, \end{eqnarray}$$where the $\propto$ symbol means proportional to. So, the first formula says that if word $i$ appears in sentence $j$, meaning $a_{i,j} \neq 0$, then $i$'s score, $w_i$, increases in proportion to $j$'s score, $s_j$. The second formula is interpreted analogously.
If we know $w_i$ and $s_j$ for every word $i$ and every sentence $j$, then the most important words and sentences should have the largest scores. Thus, similar to the webpage ranking problem of Notebook 11, we can use these scores to rank words or sentences.
Right away, you should recognize these two formulae can be rewritten in matrix form. Letting $w = [ w_0, w_1, \ldots, w_{m-1} ]$ be the (column) vector of word scores and $s = [s_0, s_1, \ldots, s_{n-1}]$ be the (column) vector of sentence scores, we can define $w$ and $s$ as
$$\begin{eqnarray} c_w \, w & = & A \, s \\ c_s \, s & = & A^T \, w, \end{eqnarray}$$where $c_w$ and $c_s$ are two unknown constants. (They are unknown because we only said that scores have proportional relationships, so there may be an arbitrary scaling factor.)
Going one step further, let's plug these two equations into one another to obtain the following:
$$\begin{eqnarray} (A A^T) \, w & = & (c_s c_w) \, w \\ (A^T A) \, s & = & (c_w c_s) \, s. \end{eqnarray}$$That should look familiar: it's an eigenvalue problem! In particular, $c_s c_w$ is an eigenvalue of the $m \times m$ matrix $A A^T$, whose corresponding eigenvector is $w$. Similarly, $c_w c_s = c_s c_w$ is also an eigenvalue of the $n \times n$ matrix $A^T A$, whose corresponding eigenvector is $s$.
Recall that every matrix has a singular value decomposition. Suppose we were to give you one of the singular values of $A$---call it $\sigma_0$ and suppose it is the largest one. Let's say we also give you the left and right singular vectors---call them, $u_0$ and $v_0$, respectively---for $\sigma_0$. From these, you should be able to figure out how to obtain $w$ and $s$ from $u_0$ and/or $v_0$. (Notebook 15, anyone?)
Aside: Recall that the singular values are always non-negative. Furthermore, since the entries of $A$ are also non-negative, if $\sigma_0$ is the largest singular value, then it turns out that $u_0$ and $v_0$ are also non-negative. You'll need this fact later.
An algorithm. From the background above, we now have a "simple" algorithm to rank words and sentences.
In the exercises below, there are two major steps. First, you need to write some code to clean a document and construct a sparse term-sentence matrix. These are Exercises 0-2. Then, you need to implement the scoring algorithm above (see "An algorithm"), which are Exercises 3 and 4.
Sample data. Let's use as our sample dataset a recent article from the New York Times. Here is some code to read the raw contents of this article from a text file, splitting it into a list of lines named raw_doc
:
def read_doc(filepath):
with open(filepath, 'rt', encoding='utf-8') as fp:
doc = [line for line in fp.read().splitlines() if line.strip() != '']
return doc
# Read sample file:
raw_doc = read_doc(get_path('nyt2.txt'))
print("=== Here are the first 7 lines of the sample document ===")
for k, line in enumerate(raw_doc[:7]):
print(f"\n* raw_doc[{k}] == {repr(line)}")
print("\n... (and so on) ...")
Recall that each element of raw_doc
is one line from a file. Each line may contain more than one sentence. For example, raw_doc[4]
above has two sentences. We want to separate these.
Implement the function, get_sentences(doc)
, below, so that it does so. It should take as input a variable doc
, which is a list of lines just like raw_doc
above. It should then return a list of sentences, following this procedure:
'.'
), question marks ('?'
), and exclamation points ('!'
). These delimiters should be omitted from the returned sentences, too.''
), that sentence should be omitted.Hint. The
re
module includes a helpful function namedre.split()
.
The next code cell has an example of the expected input and output.
ex0_demo_input = [" This is a phrase; this, too, is a phrase. But this is another sentence.",
"Hark!",
" ",
"Come what may <-- save those spaces, but not these --> ",
"What did you say?Split into 3 (even without a space)? Okie dokie."]
ex0_demo_output = ["This is a phrase; this, too, is a phrase",
"But this is another sentence",
"Hark",
"Come what may <-- save those spaces, but not these -->",
"What did you say",
"Split into 3 (even without a space)",
"Okie dokie"]
import re # Maybe re.split() will help?
# Complete this function:
def get_sentences(doc):
assert isinstance(doc, list)
### BEGIN SOLUTION
sentences = []
for line in doc:
line_sentences = [s.strip() for s in re.split('[.?!]', line)]
sentences += [s for s in line_sentences if s != '']
return sentences
### END SOLUTION
# Demo:
get_sentences(ex0_demo_input)
# Test cell: `ex0__get_sentences` (2 points)
### BEGIN HIDDEN TESTS
def ex0__f742nbv(doc):
from re import split
sentences = []
for line in doc:
line_sentences = [s.strip() for s in split('[.?!]', line)]
sentences += [s for s in line_sentences if s != '']
return sentences
def ex0_gen_soln(force=False):
from os.path import isfile
soln_fn = get_path('ex0_soln.txt')
if force or not isfile(soln_fn):
print("Generating solutions file, " + repr(soln_fn) + ", ...")
doc = read_doc(get_path('nyt2.txt'))
soln = ex0__f742nbv(doc)
with open(soln_fn, 'wt') as fp:
for line in soln:
fp.write(line + '\n')
else:
print("Solutions file, " + repr(soln_fn) + ", exists; skipping...")
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
ex0_gen_soln(force=False)
### END HIDDEN TESTS
# Random tests
from problem_utils import ex0_check_one
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex0_check_one(get_sentences)
print("\n(Passed.)")
Here is code that loads a precomputed set of sentences for the input document. Regardless of whether your Exercise 0 works or not, please run this cell now so subsequent exercises can continue. It will define a variable named raw_sents
. Subsequent code needs it, so do not modify it!
with open(get_path('ex0_soln.txt'), 'rt') as fp:
raw_sents = [s.strip() for s in fp.readlines()]
print("=== First few sentences ===")
for k, s in enumerate(raw_sents[:10]):
print(f"raw_sents[{k}] == {repr(s)}")
Recall that a common text-preprocessing step is to remove stopwords. These are words that occur frequently but do not carry substantive meaning.
We have determined a collection of stopwords for you. The following code cell creates a Python set called STOPWORDS
containing them. Later, you will filter these words from sentences. Run this cell now.
from problem_utils import STOPWORDS
print(f"There are {len(STOPWORDS)} stopwords. Examples:")
for w in ["ourselves", "you're", "needn't", "cat", "dog", "python"]:
is_or_isnt = "is" if w in STOPWORDS else "is *not*"
print(f"- {repr(w)} {is_or_isnt} a stopword.")
Recall from Midterm 1 that you developed a method to clean a sentence, breaking it up into a set of normalized words where single-quotes embedded within a word (e.g., "don't"
, "you're"
) are preserved. Here was the sample solution to that problem, which you will need again.
def clean(s):
from re import finditer
pattern = r"[a-z]+('[a-z])?[a-z]*"
return [match.group(0) for match in finditer(pattern, s.lower())]
print("* Sentence #6 ==", repr(raw_sents[6]))
print("\n* clean(raw_sents[6]) ==", clean(raw_sents[6]))
Your next task is to construct the bag-of-words representation of each sentence. If s
is a sentence, its bag-of-words representation is a set (or "bag") of its unique words. However, this set should never include stopwords.
Complete the function gen_bag_of_words(sents)
to perform this task. In particular, the input sents
is a list of sentences, as might be produced by Exercise 0. Your function should do the following:
sents[k]
into a list of words, using clean()
defined above.bags
, then bags[k]
should be the set for sents[k]
.The next code cell shows an example of the expected input and output of your function.
ex1_demo_input = ["This is a phrase; this, too, is a phrase",
"But this is another sentence",
"Hark",
"Come what may <-- save those spaces, but not these -->",
"What did you say",
"Split into 3 (even without a space)",
"Okie dokie"]
ex1_demo_output = [{'phrase'},
{'another', 'sentence'},
{'hark'},
{'come', 'may', 'save', 'spaces'},
{'say'},
{'even', 'space', 'split', 'without'},
{'dokie', 'okie'}]
def gen_bag_of_words(sents):
assert isinstance(sents, list)
### BEGIN SOLUTION
def bag(s):
return set(clean(s)) - STOPWORDS
return [bag(s) for s in sents]
### END SOLUTION
# Demo:
gen_bag_of_words(ex1_demo_input)
# Test cell: `ex1__gen_bag_of_words` (2 points)
### BEGIN HIDDEN TESTS
def ex1__v7o2BVLj3(sents):
assert isinstance(sents, list)
def bag(s):
return set(clean(s)) - STOPWORDS
return [bag(s) for s in sents]
def ex1_gen_soln(force=False):
from os.path import isfile
from pickle import dump
soln_fn = get_path('ex1_soln.pickle')
if force or not isfile(soln_fn):
print("Generating solutions file, " + repr(soln_fn) + ", ...")
with open(get_path('ex0_soln.txt'), 'rt') as fp:
sents = [s.strip() for s in fp.readlines()]
soln = ex1__v7o2BVLj3(sents)
with open(soln_fn, 'wb') as fp:
dump(soln, fp)
else:
print("Solutions file, " + repr(soln_fn) + ", exists; skipping...")
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
ex1_gen_soln(force=False)
### END HIDDEN TESTS
from problem_utils import ex1_check_one
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex1_check_one(gen_bag_of_words)
print("\n(Passed.)")
Here is some code to load precomputed bags for the input document. Regardless of whether your Exercise 1 works or not, please run this cell now so subsequent exercises can continue. It will define a variable named bags
. Subsequent code needs it, so do not modify it!
with open(get_path('ex1_soln.pickle'), 'rb') as fp:
bags = pickle.load(fp)
print("=== First few bags ===")
for k, b in enumerate(bags[:10]):
print(f"bags[{k}] == {repr(b)}")
The next major step is to build the sparse matrix $A$ from the bag-of-words representation. To do so, we need to give every word and every sentence an ID.
For sentence IDs, let's use the index in raw_sents
. That is, for sentence, raw_sents[k]
, take its ID to be simply k
. If there are n
sentences, then 0 <= k < n
.
For word IDs, let's create a dictionary named word_to_id
such that given a word, w
, its ID is word_to_id[w]
. Later, we will need the inverse mapping as well. So, the code cell below also computes id_to_word
, which maps an ID i
back to its word, id_to_word[i]
. In other words, word_to_id[id_to_word[i]] == i
and id_to_word[word_to_id[w]] == w
. If there are m
unique words, then 0 <= i < m
.
from random import choices
all_words = set()
for b in bags:
all_words |= b
word_to_id = {w: k for k, w in enumerate(all_words)}
id_to_word = {k: w for k, w in enumerate(all_words)}
print(f"There are {len(all_words)} unique words.")
for w in choices(list(all_words), k=5):
print("- ID for", repr(w), "==", word_to_id[w])
assert id_to_word[word_to_id[w]] == w
We need to build the sparse matrix $A$. Recall the following facts:
Recall that entry $a_{i,j}$ is defined as follows: $$ a_{i,j} = \left\{ \begin{array}{cl} \left(\ln \dfrac{n+1}{n_i}\right)^{-1} & \mbox{if word $i$ appears in sentence $j$} \\ 0 & \mbox{otherwise.} \end{array} \right. $$
Your task. Complete the function gen_coords(bags, word_to_id)
, below, so that it generates the coordinates and values of the sparse matrix. In particular, it should take these two inputs:
bags
, a Python list of bags (i.e., a list of word-sets);word_to_id
, a Python dictionary mapping of each word w
to its ID, which is word_to_id[w]
.It should create and return three Python lists, rows
, cols
, and vals
, which store the coordinate (COO) representation of the sparse matrix. That is, every triple (rows[k], cols[k], vals[k])
indicates that a word, whose ID is rows[k]
, appears in a sentence whose ID is cols[k]
. Furthermore, let rows[k]
$=i$ and cols[k]
$=j$. Then vals[k]
$=a_{i,j}$, as defined by the above formula.
Note 0. To calculate the natural logarithm, you can use either Python's built-in or Numpy's.
Note 1. Since it is a coordinate representation, the order of nonzeros in the output does not matter.
Here is an example of a simple input and expected output:
# Example inputs: (7 words) x (4 sentences)
ex2_demo_bags = [{'cat', 'dog', 'fish'},
{'red', 'blue'},
{'dog'},
{'one', 'two', 'dog', 'fish'}]
ex2_demo_w2id = {'cat': 0, 'dog': 1, 'fish': 2, 'red': 3, 'blue': 4, 'one': 5, 'two': 6}
from math import log # log(x) == natural logarithm of x
ex2_rows = [ 0, 1, 2, 3, 4, 1, 5, 6, 1, 2]
ex2_cols = [ 0, 0, 0, 1, 1, 2, 3, 3, 3, 3]
ex2_vals = [1/log(5), 1/log(5/3), 1/log(2.5), 1/log(5), 1/log(5), 1/log(5/3), 1/log(5), 1/log(5), 1/log(5/3), 1/log(2.5)]
# Your task: Complete this function!
def gen_coords(bags, word_to_id):
# Some code to help you get started:
m, n = len(word_to_id), len(bags)
rows, cols, vals = [], [], []
# Construct rows, cols, and vals:
### BEGIN SOLUTION
from numpy import zeros
from math import log
# Compute the `n_i` values
num_docs = zeros(m)
for b in bags:
for w in b:
num_docs[word_to_id[w]] += 1
# Construct a coordinate representation
for j, b in enumerate(bags): # loop over each sentence j
for w in b: # loop over each word i in sentence j
i = word_to_id[w]
a_ij = 1.0 / log((n+1) / num_docs[i])
rows.append(i)
cols.append(j)
vals.append(a_ij)
### END SOLUTION
# Returns your arrays:
return rows, cols, vals
# Runs your function on the demo:
gen_coords(ex2_demo_bags, ex2_demo_w2id)
# Test cell: `ex2__gen_matrix` (3 points)
### BEGIN HIDDEN TESTS
def ex2__avl82n38vOI(bags, word_to_id):
from numpy import zeros
from math import log
m, n = len(word_to_id), len(bags)
rows, cols, vals = [], [], []
num_docs = zeros(m)
for b in bags:
for w in b:
num_docs[word_to_id[w]] += 1
for j, b in enumerate(bags):
for w in b:
i = word_to_id[w]
a_ij = 1.0 / log((n+1) / num_docs[i])
rows.append(i)
cols.append(j)
vals.append(a_ij)
return rows, cols, vals
def ex2_gen_soln(force=False):
global word_to_id
from os.path import isfile
from pickle import load, dump
soln_fn = get_path('ex2_soln.pickle')
if force or not isfile(soln_fn):
print("Generating solutions file, " + repr(soln_fn) + ", ...")
with open(get_path('ex1_soln.pickle'), 'rb') as fp:
bags = load(fp)
rows, cols, vals = ex2__avl82n38vOI(bags, word_to_id)
with open(soln_fn, 'wb') as fp:
dump(rows, fp)
dump(cols, fp)
dump(vals, fp)
else:
print("Solutions file, " + repr(soln_fn) + ", exists; skipping...")
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
ex2_gen_soln(force=False)
### END HIDDEN TESTS
from problem_utils import ex2_check_one
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex2_check_one(gen_coords)
print("\n(Passed.)")
Here is some code to load precomputed coordinates for the input document. Regardless of whether your Exercise 2 works or not, please run this cell now so subsequent exercises can continue. It will define variables named rows
, cols
, and vals
, which are the coordinate representation of the matrix. The subsequent code needs it, so do not modify it!
with open(get_path('ex2_soln.pickle'), 'rb') as fp:
rows = pickle.load(fp)
cols = pickle.load(fp)
vals = pickle.load(fp)
print("=== First few coordinates ===")
print(f"* rows[:5] == {rows[:5]}")
print(f"* cols[:5] == {cols[:5]}")
print(f"* vals[:5] == {vals[:5]}")
Let's now create the sparse matrix and compute the SVD.
First, here is code to construct a sparse matrix $A$, using Scipy's compressed sparse row (CSR) storage. It stores $A$ in the variable named A
, and visualizes its nonzero structure:
from scipy.sparse import csr_matrix
A = csr_matrix((vals, (rows, cols)), shape=(len(word_to_id), len(bags)))
plt.figure(figsize=(9, 9))
plt.spy(A, marker='.', markersize=1)
pass
Next, let's compute $\sigma_0$, the largest (or "principal") singular value of $A$, along with its left- and right-singular vectors, denoted as $u_0$ and $v_0$ (the "principal singular vectors"). Take note of the shape of the singular vectors:
def get_svds_largest(A):
from scipy.sparse.linalg import svds
from numpy import abs
u, s, v = svds(A, k=1, which='LM', return_singular_vectors=True)
return s, abs(u.reshape(A.shape[0])), abs(v.reshape(A.shape[1]))
sigma0, u0, v0 = get_svds_largest(A)
print("sigma_0 ==", sigma0)
print("u0.shape ==", u0.shape)
print("v0.shape ==", v0.shape)
Here, we plot the entries of $u_0$ and $v_0$, which will be related to the rankings of the words and sentences. (Remember, you need to figure out how these might be related to the word and sentence ranking vectors, $w$ and $s$.)
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(16, 6), gridspec_kw={'width_ratios': [u0.shape[0], v0.shape[0]]})
ax0.plot(u0, '.')
ax0.set_title('$k$-th entry of $u_0$', loc='left')
ax0.set_xlabel('$k$')
ax1.plot(v0, '.')
ax1.set_title('$k$-th entry of $v_0$', loc='left')
ax1.set_xlabel('$k$')
pass
Suppose you are given the principal left- and right-singular vectors, $u_0$ and $v_0$. Complete the function, rank_words(u0, v0)
, so that it returns the word IDs in descending order of word-importance score.
The inputs u0
and v0
will be 1-D Numpy arrays. You should return a 1-D Numpy array of integers.
If it's not immediately obvious what to do, go back to the description of the mathematical model and consider the shapes of the matrix and vector operands involved to see if you can figure it out. The setup of this problem is tricky, but the solution is simple.
def rank_words(u0, v0):
### BEGIN SOLUTION
from numpy import argsort
return argsort(u0)[::-1]
### END SOLUTION
# Demo on the input document:
word_ranking = rank_words(u0, v0)
top_ten_words = [id_to_word[k] for k in word_ranking[:10]]
print("Top 10 words:", top_ten_words)
# Test cell: `ex3__rank_words` (2 points)
### BEGIN HIDDEN TESTS
def ex3__lvkjhw38nv(u0, v0):
from numpy import argsort
return argsort(u0)[::-1]
def ex3_gen_soln(force=False):
global u0, v0
from os.path import isfile
from pickle import load, dump
soln_fn = get_path('ex3_soln.pickle')
if force or not isfile(soln_fn):
print("Generating solutions file, " + repr(soln_fn) + ", ...")
ranking = ex3__lvkjhw38nv(u0, v0)
with open(soln_fn, 'wb') as fp:
dump(ranking, fp)
else:
print("Solutions file, " + repr(soln_fn) + ", exists; skipping...")
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
ex3_gen_soln(force=False)
### END HIDDEN TESTS
from problem_utils import ex3_check_one
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex3_check_one(rank_words)
print("\n(Passed.)")
Suppose you are given the principal left- and right-singular vectors, $u_0$ and $v_0$. Complete the function, rank_sentences(u0, v0)
, so that it returns the sentence IDs in descending order of sentence-importance score.
The inputs u0
and v0
will be 1-D Numpy arrays. You should return a 1-D Numpy array of integers.
If it's not immediately obvious what to do, go back to the description of the mathematical model and consider the shapes of the matrix and vector operands involved to see if you can figure it out.
def rank_sentences(u0, v0):
### BEGIN SOLUTION
from numpy import argsort
return argsort(v0)[::-1]
### END SOLUTION
# Demo:
sentence_ranking = rank_sentences(u0, v0)
top_five_sentences = [raw_sents[k] for k in sentence_ranking[:5]]
print("=== Top 5 sentences ===")
for k, s in enumerate(top_five_sentences):
print(f"\n{k}.", repr(s))
# Test cell: `ex4__rank_sentences` (1 point)
### BEGIN HIDDEN TESTS
def ex4__vj8kj2vu(u0, v0):
from numpy import argsort
return argsort(v0)[::-1]
def ex4_gen_soln(force=False):
global u0, v0
from os.path import isfile
from pickle import load, dump
soln_fn = get_path('ex4_soln.pickle')
if force or not isfile(soln_fn):
print("Generating solutions file, " + repr(soln_fn) + ", ...")
ranking = ex4__vj8kj2vu(u0, v0)
with open(soln_fn, 'wb') as fp:
dump(ranking, fp)
else:
print("Solutions file, " + repr(soln_fn) + ", exists; skipping...")
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
ex4_gen_soln(force=False)
### END HIDDEN TESTS
from problem_utils import ex4_check_one
for trial in range(10):
print(f"=== Trial #{trial} / 9 ===")
ex4_check_one(rank_sentences)
print("\n(Passed.)")
Fin! You’ve reached the end of this part. Don’t forget to restart and run all cells again to make sure it’s all working when run in sequence; and make sure your work passes the submission process. Good luck!
Epilogue 0. If you are interested in this method and others like it, see Matrix methods in data mining and pattern recognition by Lars Eldén. Official Georgia Tech students can download chapters for free through the GT library.
Epilogue 1. It is an entertaining exercise to run this notebook on our course syllabus or the text of the Whirlwind Tour of Python recommended textbook, and see if the results align with your experience.