Version 1.0a-c148515
All of the header information is important. Please read it.
Topics number of exercises: This problem builds on your knowledge of basic Python, pandas, Numpy, clustering, and matrix factorization. It has 9 exercises numbered 0 to 8. There are 16 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging you code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise points breakdown:
Ex. 0 - get_record
: 2 point(s)
Ex. 1 - clean_record
: 1 point(s)
Ex. 2 - records_to_metadf
: 1 point(s)
Ex. 3 - gen_corpus
: 2 point(s)
Ex. 4 - build_cluster_matrix
: 2 point(s)
Ex. 5 - get_top_tokens
: 2 point(s)
Ex. 6 - find_distinctive_tokens
: 3 point(s)
Ex. 7 - lsa_svd_cluster__FREE
: 1 point(s)
Ex. 8 - get_nmf_clusters
: 2 point(s)
Final reminders:
Run the following cells to set up the problem.
%load_ext autoreload
%autoreload 2
import numpy as np
import scipy as sp
import pandas as pd
import dill
from pprint import pprint
Your overall task. In this problem, the dataset is the metadata for a collection of computer science research papers and technical books. Your task will be to try to identify topics, or collections of similar documents, using three different algorithms:
You won't have to implement these algorithms, but you will have to prepare the data and figure how to run code that implements them.
The dataset. The dataset comes from the AMiner collection. Run the following cell to load the raw data into a variable named raw_records
and display a sample.
with open('resource/asnlib/publicdata/acm-v9-sample-100k.dill', 'rb') as fp:
raw_records = dill.load(fp)
print(f"=== Success: Loaded {len(raw_records):,} raw records. ===")
print(f"\nExample: Records 0, 7, and 15:\n")
pprint([raw_records[k] for k in [0, 7, 15]])
The raw_records
data structure is a list of raw records. Each raw record represents the metadata for a technical paper or book, stored as a list of raw (unprocessed) strings. Each string is a raw attribute.
Each raw attribute encodes some meta-information about the paper, like its title, list of authors, year of publication, and so on. The attribute starts with a short prefix beginning with #
. The prefix indicates what the attribute is, and it is immediately followed by its string value, with no delimiters separating the prefix from the value.
Here are the possible prefixes and their meaning:
Prefix | Meaning |
---|---|
#* |
Title of the work |
#@ |
Author names |
#t |
Year of publication |
#c |
Where the work was published |
#index |
ID number of this work |
#% |
ID number of a work that this work cites |
#! |
Abstract (short description of the work) |
Compare these prefixes to the example raw records printed above and observe the following:
#%
prefix represents a citation, that is, the ID number of a work that this work cites. A raw record may have more than one #%
attribute. Stopwords. Per common practice, our text analysis will involve ignoring certain words, referred to as stopwords. The code cell below will load a pre-selected set of such words for some of the demos that follow.
with open('resource/asnlib/publicdata/STOPWORDS.dill', 'rb') as fp:
STOPWORDS = dill.load(fp)
print("A sample of stopwords:")
sorted(list(STOPWORDS))[:7]
get_record
¶Your task: Define get_record
as follows:
Convert a single 'raw' record into a specially formatted Python dictionary.
Input: lines
: A list of strings, where each string is a raw attribute.
Return: record
: A Python dictionary, constructed as follows.
Requirements/steps: For each raw attribute, convert its prefix into a human-readable key (see below) and extract its value as a string.
Recall that a raw attribute, like '#%'
, may appear more than once.
Therefore, the output record should associate each key with a sorted list of all values encountered.
The human-readable key corresponding to each raw attribute is:
'#*'
: 'title'
'#@'
: 'authors'
'#t'
: 'year'
'#c'
: 'venue'
'#index'
: 'id'
'#%'
: 'refs'
'#!'
: 'abstract'
Other notes: If a raw attribute appears only once, it should still appear in a list. (The list will be one-element long in this case.)
Example. A correct implementation should produce, for the demo, the following output:
{'authors': ['Mordechai Ben-Ari'],
'id': ['269'],
'refs': ['2135000', '317992', '320265', '320491', '598024'],
'title': ['Algorithms for on-the-fly garbage collection'],
'venue': ['ACM Transactions on Programming Languages and Systems (TOPLAS)'],
'year': ['1984']}
### Solution - Exercise 0
def get_record(lines: list) -> dict:
### BEGIN SOLUTION
def get_attribute(line):
assert len(line) > 0
attributes = {'#*': 'title',
'#@': 'authors',
'#t': 'year',
'#c': 'venue',
'#index': 'id',
'#%': 'refs',
'#!': 'abstract'
}
line = line.strip()
for key, translation in attributes.items():
i = len(key)
if line[:i] == key:
return (translation, line[i:])
return None
from collections import defaultdict
record = defaultdict(list)
for line in lines:
attribute = get_attribute(line)
if attribute is not None:
key, value = attribute
record[key].append(value)
for key, value in record.items():
record[key] = sorted(value)
return dict(record)
### END SOLUTION
### Demo function call
pprint(get_record(raw_records[15]))
Test cell. The cell below will test your solution for get_record (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_record']['config']
ex_conf['func'] = get_record
tester = Tester(ex_conf, key=b'4oTuD3jYU_dfQvrxIzSmKJQBDeiWMaODi2nmk0sQk1o=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'PdjlQ97Vu4Me81fv-vIhBhmfwQcZj3v526wMAW1DGW0=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a list of processed records (dictionaries) in the global variable, records
.
with open('resource/asnlib/publicdata/records.dill', 'rb') as fp:
records = dill.load(fp)
print(f"Sample post-processed records:")
for k in [0, 7, 15]:
print(f"\n=== get_record(raw_records[{k}]) ===")
pprint(records[k])
clean_record
¶Your task: Define clean_record
as follows:
Clean a (non-raw) dictionary record.
Input: record
— A Python dictionary as might be produced by get_record
.
Return: cleaned
— A copy of record
cleaned as explained below, or None
if any required keys are missing.
Requirements/steps: The goal of this task is two-fold: 1) Ensure that the record has certain "mandatory" attributes, and 2) simplify the values for certain single-value attributes. More specifically, your function should satisfy these requirements.
None
: ['id', 'title', 'year', 'venue', 'abstract']
.'id'
and 'year'
keys, assume there is only one value in the list and convert it into an integer and store this value.'refs'
key, convert each element of its list to an integer and store this list of integers.record
, assume there is only one value and store that value as-is (i.e., as a string).Return the cleaned dictionary (or None
per the first requirement).
Example. A correct implementation should produce, for the demo, the following output:
None
{'abstract': 'A mathematical model for communicating sequential processes '
'isgiven, and a number of its interesting and useful properties '
'arestated and proved. The possibilities of nondetermimsm are '
'fullytaken into account.',
'authors': 'S. D. Brookes, C. A. R. Hoare, A. W. Roscoe',
'id': 319,
'refs': [2135000, 318212, 320203, 374129, 408527, 547420, 555361, 565837,
566544, 566549, 680046, 689430],
'title': 'A Theory of Communicating Sequential Processes',
'venue': 'Journal of the ACM (JACM)',
'year': 1984}
### Solution - Exercise 1
def clean_record(record: dict) -> dict:
### BEGIN SOLUTION
if any(e not in record for e in ['id', 'title', 'year', 'venue', 'abstract']):
return None
cleaned = {}
for key, value in record.items():
if key in ['id', 'year']:
cleaned[key] = int(value[0])
elif key in ['refs']:
cleaned[key] = [int(e) for e in value]
else:
cleaned[key] = value[0]
return cleaned
### END SOLUTION
### Demo function call
pprint(clean_record(records[0])) # None!
pprint(clean_record(records[17])) # Valid
Test cell. The cell below will test your solution for clean_record (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['clean_record']['config']
ex_conf['func'] = clean_record
tester = Tester(ex_conf, key=b'BhZT_oWZZqOjWxKNFXA5A7uL44XtRE6X7vsh1IcmmOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'YOrz4qxU6x7xlwO1Og0kri7NEs5HYMq8xZTiABfDJbI=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a list of cleaned records in the global variable, cleaned_records
.
with open('resource/asnlib/publicdata/papers.dill', 'rb') as fp:
cleaned_records = dill.load(fp)
print("Examples of cleaned records:")
for k in [0, len(cleaned_records)//2, len(cleaned_records)-1]:
record = cleaned_records[k]
print(f"\n=== Record {k} ===")
pprint(record)
Whether your solution is working or not, run the following code cell, which will preload a dataframe of metadata records in the global variable, metadf
.
with open('resource/asnlib/publicdata/metadf.dill', 'rb') as fp:
metadf = dill.load(fp)
# For Ex. 3, break dependence with a working Ex. 2
demo_metadf = metadf.loc[[11222, 11239, 12329]].reset_index(drop=True)
print("Examples of metadata records:")
display(metadf.sample(5))
gen_corpus
¶Your task: Define gen_corpus
as follows:
Construct a "document corpus" that can be fed into some text analysis tools. This corpus consists of "pseudo-documents," built up from the documents' "tokens."
Inputs:
metadf
– A dataframe with document metadata, as produced by records_to_metadf
.stopwords
- A Python set containing tokens to ignore.Return: pseudodf
– A new dataframe where tokens have been extracted and filtered from some of the attributes, formatted as explained below.
Requirements/steps: Your dataframe should have the following columns:
'id'
: The document ID, taken as-is from the input, metadf['id']
.'title'
: The document title, taken as-is from metadf['title']
.'pseudodoc'
: A "pseudo-document," which is a string built from the 'title'
, 'venue'
, and 'abstract'
as follows.Recall that each row of metadf
represents a document.
For each document, its pseudo-document is constructed by extracting tokens from lowercase versions of its title, venue, and abstract strings.
When extracting tokens, regard a consecutive sequence of plain alphabetic characters, 'a'
through 'z'
, as a token. (Regex may be helpful here!)
To construct the pseudo-document from these tokens, first sort the unique tokens, remove any stopwords, and then join the remaining tokens using a single space character, ' '
, to separate them.
Other notes: Per the above requirements, do not forget to:
stopwords
(and not the global variable, STOPWORDS
, which is for demos); andExample: Suppose we run gen_corpus
on the demo metadata dataframe from the previous exercise.
A correct implementation would return:
id | title | pseudodoc |
---|---|---|
648046 | A Tutorial for CC++ | abstract available cc tutorial |
648461 | Logics of Programs | available logics none programs |
673426 | Codes and Number Theory | codes number theory |
### Solution - Exercise 3
def gen_corpus(metadf, stopwords):
### BEGIN SOLUTION
def gen_pseudodoc(s):
from re import findall
pattern = r'[a-zA-Z]+'
s = s.lower()
initial_tokens = set(findall(pattern, s))
final_tokens = initial_tokens - stopwords
return ' '.join(sorted(final_tokens))
df = metadf[['id', 'title', 'venue', 'abstract']].copy()
df['rawdoc'] = df['title'] + ' ' + df['venue'] + ' ' + df['abstract']
df['pseudodoc'] = df['rawdoc'].apply(gen_pseudodoc)
df = df[['id', 'title', 'pseudodoc']]
return df
### END SOLUTION
### Demo function call
demo_corpus = gen_corpus(demo_metadf, STOPWORDS)
display(demo_corpus)
Test cell. The cell below will test your solution for gen_corpus (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['gen_corpus']['config']
ex_conf['func'] = gen_corpus
tester = Tester(ex_conf, key=b'o0kicBPd2HinEbuciVMQHNjfCLJj4eVAWPBqKCfSeNw=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'8BhPU2TJmMG5oymGh3XrC6PJOblGSLgiioDFIwm6DlY=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a corpus of pseudo-documents in the global dataframe, corpusdf
.
with open('resource/asnlib/publicdata/corpus.dill', 'rb') as fp:
corpusdf = dill.load(fp)
print(f"A sample of the pseudo-document corpus:")
display(corpusdf.sample(4))
A "classical" way to represent the corpus for data mining is in the bag of words model. In this model, we transform a corpus consisting of $n$ documents and $m$ unique tokens into an $n \times m$ (sparse) matrix $X$, where element $x_{ij}$ "measures" the strength of the relationship between document $i$ and token $j$.
Using one of scikit-learn's text feature extraction tools, we constructed a matrix from the corpus. Run the next code cell to load the matrix into the global variable, X
.
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
X = dill.load(fp)
print(f"The shape of `X`: {X.shape}")
print(f"- Smallest value: {X.min()}")
print(f"- Largest value: {X.max()}")
print(f"- Mean value: {X.mean()}")
Here is "spy plot" of X
, which visualizes the locations of the nonzero entries.
A topic model via k-means. Given X
, we can then use scikit-learn's K-means implementation to cluster the documents. It works by (magically) calculating a vector-space representation of the documents of X
and then running K-means on it similar to what you did in Notebook 14. We can interpret each cluster as a "topic" that groups documents containing similar tokens together.
To save a little time, we ran this clustering for you to try to find 10 topics (clusters). (The code we used appears at the end of the exam.)
Run the cell below to load the precomputed clusters, which consists of the following:
km_sizes
: The number of documents in each cluster.km_labels
: A 1-D array where km_labels[i]
is the the cluster ID number (0-9) to which i
is assigned.km_centers
: A 2-D array where row km_centers[j, :]
is the coordinate of the center of cluster j
.with open('resource/asnlib/publicdata/kmeans.dill', 'rb') as fp:
_, km_sizes, km_centers, km_labels = dill.load(fp)
print(f"- Cluster sizes:", km_sizes)
print(f"- Cluster labels:", km_labels)
print(f"- Cluster centers, {km_centers.shape[0]:,} x {km_centers.shape[1]}, where each column is a centroid:")
pprint(km_centers)
The "matrix factorization" interpretation of k-means. Although we commonly think of k-means as clustering points, we can also view what it does as computing a certain kind of matrix factorization.
Recall that $X$ had $m$ rows (documents) and $n$ columns (tokens), and we identified $k$ topics using k-means clustering. Let $C$ be the $n \times k$ matrix of centroids that resulted.
What k-means does is assign each document $i$ to one of these centroids. Let $R$ be a $m \times k$ (sparse) matrix where entry $r_{is} = 1$ if document $i$ is assigned to topic $s$ and equals 0 otherwise. Then one way to interpret k-means is to say that it approximates the matrix $X$ by the product,
$$X \approx R \cdot C^T.$$To see why, consider let $\hat{r}_i^T$ be the $i$-th row of $R$ and $c_s$ be the $s$-th column of $C$. By the definition of matrix product,
$$\hat{r}_i^T \cdot C^T = \hat{r_i}^T \Biggl[\begin{matrix} \\ c_0 & \cdots & c_s & \cdots & c_{k-1} \\ \\ \end{matrix}\Biggl]^T = \left[\begin{matrix} r_{i0} & \cdots & r_{is} & \cdots & r_{i,k-1} \end{matrix}\right] \cdot \left[\begin{matrix} c_0^T \\ \vdots \\ c_s^T \\ \vdots \\ c_{k-1}^T \end{matrix}\right] = \sum_{s} r_{is} c_s^T. $$However, since k-means assigned document $i$ to exactly one cluster $s$, then only one term in the sum is nonzero, which is $c_s$, the centroid for cluster $s$. In other words, the action of $R \cdot C^T$ is to assign each document (row) of $X$ to one centroid.
With this concept in mind, you are now ready for the next exercise.
build_cluster_matrix
¶Your task: Define build_cluster_matrix
as follows:
Given a k-means clustering result, construct a k-means assignment matrix. From the preceding discussion, it is the $R$ matrix.
Inputs:
labels
: A 1-D array where labels[i]
is the cluster ID assigned to document i
.max_label
: The largest cluster ID plus 1. That is, 0 <= labels[i] < max_label
for all i
.Return: R
, a Scipy COO matrix (scipy.sparse.coo_matrix).
Requirements/steps: The returned matrix R
should have len(labels)
rows and max_label
columns.
It should only have entries equal to 1.0 at positions (i, j)
where j == labels[i]
;
all other entries should be regarded as zero values and, therefore, not stored explicitly.
Example: Suppose we run:
build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)
A correct implementation would return:
(0, 1) 1.0
(1, 2) 1.0
(2, 0) 1.0
(3, 0) 1.0
(4, 2) 1.0
(5, 2) 1.0
(6, 2) 1.0
(7, 3) 1.0
(8, 0) 1.0
(9, 3) 1.0
Here is a picture (spy plot):
### Solution - Exercise 4
def build_cluster_matrix(labels: np.ndarray, max_label: int) -> sp.sparse.coo_matrix:
### BEGIN SOLUTION
from numpy import arange, ones
from scipy.sparse import coo_matrix
rows = arange(len(labels))
cols = labels
values = ones(len(labels))
R = coo_matrix((values, (rows, cols)), shape=(len(labels), max_label))
return R
### END SOLUTION
### Demo function call
demo_R = build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)
print(demo_R)
Test cell. The cell below will test your solution for build_cluster_matrix (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['build_cluster_matrix']['config']
ex_conf['func'] = build_cluster_matrix
tester = Tester(ex_conf, key=b'cub7HYh_bdtYjSXeD5UPThVliqosSf9XWu4LLxZjlOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'BKfhxGRI75Oq_URP32o-YjtSQ_Mv26tftdG9sBADV9E=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload the k-means cluster matrix, $R$, for $X$ (X
), in the global variable R_km
.
with open('resource/asnlib/publicdata/R.dill', 'rb') as fp:
R_km = dill.load(fp)
R_km
get_top_tokens
¶Your task: Define get_top_tokens
as follows:
Given a cluster, identify its most frequent tokens.
Inputs:
cid
: The ID of a cluster to analyze.labels
: Cluster assignments, where document i
was assigned to cluster labels[i]
.corpusdf
: A corpus / pseudo-document dataframe (see Ex. 3, gen_corpus
), having the columns 'id'
, 'title'
, and 'pseudodoc'
.k
: The number of tokens to return.Return: top_tokens
, a Python set of the k
most frequent tokens.
Requirements/steps:
labels
to identify the documents belonging to cluster cid
.corpusdf
. See the note below.k
most frequently occurring tokens as a Python set.
In the case of ties, consider tokens in ascending order.Other notes:
i
to its pseudo-document, note that i
is the index value of corpusdf
(and, in particular, not the 'id'
column!).k
unique tokens, return as many as available.Example: For the demo code, a correct implementation should return:
{'page', 'reviews', 'academic', 'book'}
### Solution - Exercise 5
def get_top_tokens(cid: int, labels: np.ndarray, corpusdf: pd.DataFrame, k=10) -> set:
### BEGIN SOLUTION
from collections import Counter
ids = np.where(labels == cid)[0]
pseudodocs = corpusdf['pseudodoc'].loc[ids]
token_counts = Counter()
for doc in pseudodocs:
tokens = doc.split()
token_counts += Counter(tokens)
return {token for token, _ in sorted(token_counts.items(), key=lambda e: (-e[1], e[0]), reverse=False)[:k]}
### END SOLUTION
### Demo function call
with open('resource/asnlib/publicdata/demo_args_get_top_tokens.dill', 'rb') as fp:
demo_cid, demo_labels, demo_corpusdf, demo_k = dill.load(fp)
print(get_top_tokens(demo_cid, demo_labels, demo_corpusdf, demo_k))
Test cell. The cell below will test your solution for get_top_tokens (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_top_tokens']['config']
ex_conf['func'] = get_top_tokens
tester = Tester(ex_conf, key=b'BBpXHA04_pc-q0tx10aYqwXtRUfiJeKp90vRPTLzC5k=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'maX0d15NB21kw4V42TlZ6oXjqEj3f5X8ZozVib7fX7Q=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
We ran one of our reference solutions for get_top_tokens
on corpusdf
to extract top tokens for all the clusters discovered by a K-means run. Run the following code cell to load them into a global dictionary called top_tokens_km
.
with open('resource/asnlib/publicdata/top_tokens.dill', 'rb') as fp:
km_top_tokens = dill.load(fp)
km_top_tokens
find_distinctive_tokens
¶Your task: Define find_distinctive_tokens
as follows:
Suppose you are given the top tokens of several clusters. For each cluster, determine its "most distinctive" tokens and keep only those.
Inputs:
token_sets
: A dictionary, where token_sets[i]
are the top tokens appearing in cluster i
.max_occur
: The maximum number of allowable occurrences of a token across clusters for that token to be considered "distinctive" of that cluster.Return:
distinctive
: A new dictionary mapping cluster key
to a (possibly smaller) set of tokens distictive[key]
.Requirements/steps:
max_occur
clusters. If a cluster has no distinctive tokens, do not include it in the output.Example: For the demo code, a correct implementation should return:
{0: {'analysis', 'simulation', 'control'}, 4: {'processing', 'images', 'image', 'analysis', 'recognition'}, 6: {'analysis', 'methods'}, 2: {'network', 'nodes', 'networks', 'simulation', 'wireless', 'mobile'}, 1: {'network', 'distributed', 'service', 'computing', 'services', 'applications'}, 3: {'parallel', 'applications', 'design', 'high'}, 9: {'engineering', 'programming', 'language', 'software', 'applications', 'design', 'development'}, 8: {'work', 'technology', 'study', 'research', 'design', 'development'}, 5: {'artificial', 'intelligence', 'learning', 'volume'}, 7: {'problems', 'algorithms', 'number', 'linear', 'theory', 'set'}}
### Solution - Exercise 6
def find_distinctive_tokens(token_sets: dict, max_occur=3) -> dict:
### BEGIN SOLUTION
from collections import defaultdict
token_assignments = defaultdict(set) # token -> set of cluster ids
for cid, tokens in token_sets.items():
for token in tokens:
token_assignments[token] |= {cid}
distinctive = defaultdict(set)
for token, clusters in token_assignments.items():
if len(clusters) <= max_occur:
for cid in clusters:
distinctive[cid] |= {token}
return dict(distinctive)
### END SOLUTION
### Demo function call
print(find_distinctive_tokens(km_top_tokens))
Test cell. The cell below will test your solution for find_distinctive_tokens (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['find_distinctive_tokens']['config']
ex_conf['func'] = find_distinctive_tokens
tester = Tester(ex_conf, key=b'5VeeLsiYTK9AUwJ8sN5jiO4xU3sWzBJooV29Q6R7Erk=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'xHxaKp16PVuNpfH6M55Dlq-glrOFNJaW1f7hc08KUFw=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
lsa_svd_cluster__FREE
¶Your task: Define lsa_svd_cluster__FREE
as follows:
FREE EXERCISE: Use the SVD and K-means to compute a clustering. This method is sometimes referred to as latent semantic analysis (LSA).
Background: In the data-matrix factorization view of K-means, we compute $X \sim R C^T$ and use $R$ to identify clusters. That suggests we could try other factorization methods, including the singular value decomposition (SVD) from Notebook 15.
In this strategy, we compute an $d$-truncated SVD, $X \approx U_d \Sigma_d V_d^T$, where we choose $d$ to be "small" compared to the number of columns of $X$. In so doing, the matrix $Y \equiv U_d \Sigma_d$ acts as lower-dimensional representation of $X$, where each row of $Y$ is a $d$-dimensional version of the corresponding row of $X$. We can then run K-means on $Y$.
Here, we are giving you some code to carry out this analysis. This code uses scikit-learn's TruncatedSVD and K-means implementations. Inspect the results and then move on to the next (and last) part and exercise.
Inputs:
X
: An n
-by-m
data matrix where rows are data points.num_clusters
: The desired number of clusters to identify.num_components
: The truncated dimension, $d$.Return: labels
, an array of cluster labels.
The i
-th row of X
is "assigned" to a cluster whose ID is labels[i]
.
These labels will lie in the range 0 to num_clusters-1
.
### Solution - Exercise 7
def lsa_svd_cluster__FREE(X: np.ndarray, num_clusters=10, num_components=10):
"""**FREE EXERCISE**: Use the SVD and K-means to compute a clustering.
This method is sometimes referred to as _latent semantic analysis_ (LSA).
**Background**: In the data-matrix factorization view of K-means, we compute $X \sim R C^T$ and use $R$ to identify clusters.
That suggests we could try _other_ factorization methods, including the singular value decomposition (SVD) from Notebook 15.
In this strategy, we compute an $d$-truncated SVD, $X \\approx U_d \Sigma_d V_d^T$, where we choose $d$ to be "small" compared to the number of columns of $X$.
In so doing, the matrix $Y \equiv U_d \Sigma_d$ acts as lower-dimensional representation of $X$, where each row of $Y$ is a $d$-dimensional version of the corresponding row of $X$.
We can then run K-means on $Y$.
Here, we are giving you some code to carry out this analysis.
This code uses scikit-learn's [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#) and [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#) implementations.
Inspect the results and then move on to the next (and last) part and exercise.
**Inputs**:
- `X`: An `n`-by-`m` data matrix where rows are data points.
- `num_clusters`: The desired number of clusters to identify.
- `num_components`: The truncated dimension, $d$.
**Return:** `labels`, an array of cluster labels.
The `i`-th row of `X` is "assigned" to a cluster whose ID is `labels[i]`.
These labels will lie in the range 0 to `num_clusters-1`.
"""
# Step 1: Compute the (truncated) SVD of X: X ~ U S V^T
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=num_components, n_iter=20, random_state=1_234)
Y = svd.fit_transform(X)
# Step 2: Run K-means on U
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=num_clusters, max_iter=100, n_init=1, random_state=5_678).fit(Y)
labels = kmeans.predict(Y)
return labels
### Demo function call
pass # bug? this line gets omitted in the output
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
X = dill.load(fp)
print(f"Data matrix (`X`) shape: {X.shape[0]:,} x {X.shape[1]:,}")
lsa_labels = lsa_svd_cluster__FREE(X, num_clusters=10)
print(f"Cluster assignments: {lsa_labels}")
print(f"\nDistinctive top tokens for LSA:")
with open('resource/asnlib/publicdata/lsa_tokens.dill', 'rb') as fp:
lsa_top_tokens, lsa_distinctive = dill.load(fp)
pprint(lsa_distinctive)
Test cell. The cell below will test your solution for lsa_svd_cluster__FREE (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
print('Passed! Please submit.')
For the final exercise, we will try one more clustering method: nonnegative matrix factorization, or NMF.
Recall that the SVD method starts by reducing the dimension of an $n \times m$ data-matrix $X$ to a $n \times d$ matrix $Y$, where $d \ll m$, via the SVD. Although the qualitative results in Exercise 7 appear reasonable, the matrix $Y$ is not easy to interpret because its entries assume arbitrary values. By contrast, the K-means assignment matrix $R$ is directly related to cluster assignments (although the centroids themselves are arbirary).
In an NMF, we again approximate $X$ by the product of two matrices, $$X \approx W \cdot H^T,$$ where for some choice of $d < n$, $W$ is $n \times d$ and $H^T$ is $d \times m$ and, importantly, we constrain both $W$ and $H$ to have nonnegative entries, meaning greater than or equal to 0.
Each row of $W$ is now a lower dimensional representation of its corresponding data point (row) in $X$ like $U_d \Sigma_d$ in the SVD, but because of nonnegativity, its entries can be viewed as weights. That is, $w_{ij}$ measures the strength of association between data point $i$ and column $j$. That in turn implies that if we wish to use $W$ for clustering, we can assign $i$ to the logical cluster $j$ where $w_{ij}$ is highest.
Computing the NMF. The following function calculates the NMF for a given input matrix $X$:
def calc_nmf_W(X, num_clusters=10):
from sklearn.decomposition import NMF
model = NMF(n_components=num_clusters, init='random', random_state=8_917)
W = model.fit_transform(X)
return W
We used this function to pre-calculate a W
for X
. The code cell below loads this W
:
with open('resource/asnlib/publicdata/W.dill', 'rb') as fp:
W = dill.load(fp)
print(W.shape)
In the final exercise, you'll extract clusters from a given W
.
get_nmf_clusters
¶Your task: Define get_nmf_clusters
as follows:
Determine cluster assignments from an NMF-generated $W$ matrix.
Input: W
, an n
-by-m
matrix produced by an NMF computation.
Return:
labels
: Cluster assignments: for each row i
of W
, labels[i]
is the cluster (column) assignment.ids
: Cluster IDs: a Numpy array containing the values from 0 to m
-1.sizes
: Cluster sizes: sizes[c]
is the number of rows of W
that are assigned to cluster c
.Requirements/steps:
m
be the number of clusters, which is equal to the number of columns of W
.i
of W
, assign it to the cluster c
having the largest value of W[i,c]
.ids
array is simply a Numpy array of length m
whose values go from 0 to m
-1.c
, count how many rows were assigned to it and store that as sizes[c]
.Example: For the demo code, a correct implementation should return:
(array([1, 0, 2, 0, 2, 1, 3, 0, 0]), array([0, 1, 2, 3]), array([4, 2, 2, 1]))
### Solution - Exercise 8
def get_nmf_clusters(W: np.ndarray) -> (np.ndarray, np.ndarray, np.ndarray):
### BEGIN SOLUTION
from numpy import argmax, unique
labels = argmax(W, axis=1)
ids, sizes = unique(labels, return_counts=True)
return labels, ids, sizes
### END SOLUTION
### Demo function call
with open('resource/asnlib/publicdata/demo_get_nmf_clusters_W.dill', 'rb') as fp:
demo_W = dill.load(fp)
print("Demo input W:")
pprint(demo_W)
print("\nYour output:")
print(get_nmf_clusters(demo_W))
Test cell. The cell below will test your solution for get_nmf_clusters (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_nmf_clusters']['config']
ex_conf['func'] = get_nmf_clusters
tester = Tester(ex_conf, key=b'SGeoIKLnOAbhAQCj3ymEM2J0nfjbk6CylhmBcXTw3s4=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### BEGIN HIDDEN TESTS
tester = Tester(ex_conf, key=b'ETV_Hxn4mu4ICrtvJKw5uxbm6vUVwsQ4KpojWA3GIHo=', path='resource/asnlib/publicdata/encrypted/')
for _ in range(25):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
### END HIDDEN TESTS
print('Passed! Please submit.')
If you have made it this far, congratulations on completing it.
Don't forget to submit!
Postscript. If you are curious, the NMF method produces these distinctive tokens:
{0: {'computing', 'high', 'applications', 'distributed', 'parallel'},
1: {'mobile', 'networks', 'applications', 'network', 'wireless'},
2: {'analysis', 'computational', 'equations', 'model', 'numerical',
'problems', 'solution'},
3: {'processing', 'signal', 'error', 'noise'},
4: {'technology', 'user', 'development', 'research', 'study', 'design'},
5: {'learning', 'analysis', 'show', 'model'},
6: {'images', 'analysis', 'recognition', 'show', 'image'},
7: {'present', 'theory', 'known', 'set', 'number', 'algorithms', 'show'},
8: {'development', 'language', 'applications', 'software', 'design', 'model'},
9: {'power', 'technology', 'high', 'control', 'simulation', 'design'}}