Version 1.0a-c148515
All of the header information is important. Please read it.
Topics number of exercises: This problem builds on your knowledge of basic Python, pandas, Numpy, clustering, and matrix factorization. It has 9 exercises numbered 0 to 8. There are 16 available points. However to earn 100% the threshold is 12 points. (Therefore once you hit 12 points you can stop. There is no extra credit for exceeding this threshold.)
Exercise ordering: Each exercise builds logically on previous exercises but you may solve them in any order. That is if you can't solve an exercise you can still move on and try the next one. Use this to your advantage as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises.
Demo cells: Code cells starting with the comment ### define demo inputs
load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them but we may not print them in the starter code.
Debugging you code: Right before each exercise test cell there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects you may want to print the head or chunks of rows at a time).
Exercise points breakdown:
Ex. 0 - get_record
: 2 point(s)
Ex. 1 - clean_record
: 1 point(s)
Ex. 2 - records_to_metadf
: 1 point(s)
Ex. 3 - gen_corpus
: 2 point(s)
Ex. 4 - build_cluster_matrix
: 2 point(s)
Ex. 5 - get_top_tokens
: 2 point(s)
Ex. 6 - find_distinctive_tokens
: 3 point(s)
Ex. 7 - lsa_svd_cluster__FREE
: 1 point(s)
Ex. 8 - get_nmf_clusters
: 2 point(s)
Final reminders:
Run the following cells to set up the problem.
%load_ext autoreload
%autoreload 2
import numpy as np
import scipy as sp
import pandas as pd
import dill
from pprint import pprint
Your overall task. In this problem, the dataset is the metadata for a collection of computer science research papers and technical books. Your task will be to try to identify topics, or collections of similar documents, using three different algorithms:
You won't have to implement these algorithms, but you will have to prepare the data and figure how to run code that implements them.
The dataset. The dataset comes from the AMiner collection. Run the following cell to load the raw data into a variable named raw_records
and display a sample.
with open('resource/asnlib/publicdata/acm-v9-sample-100k.dill', 'rb') as fp:
raw_records = dill.load(fp)
print(f"=== Success: Loaded {len(raw_records):,} raw records. ===")
print(f"\nExample: Records 0, 7, and 15:\n")
pprint([raw_records[k] for k in [0, 7, 15]])
The raw_records
data structure is a list of raw records. Each raw record represents the metadata for a technical paper or book, stored as a list of raw (unprocessed) strings. Each string is a raw attribute.
Each raw attribute encodes some meta-information about the paper, like its title, list of authors, year of publication, and so on. The attribute starts with a short prefix beginning with #
. The prefix indicates what the attribute is, and it is immediately followed by its string value, with no delimiters separating the prefix from the value.
Here are the possible prefixes and their meaning:
Prefix | Meaning |
---|---|
#* |
Title of the work |
#@ |
Author names |
#t |
Year of publication |
#c |
Where the work was published |
#index |
ID number of this work |
#% |
ID number of a work that this work cites |
#! |
Abstract (short description of the work) |
Compare these prefixes to the example raw records printed above and observe the following:
#%
prefix represents a citation, that is, the ID number of a work that this work cites. A raw record may have more than one #%
attribute. Stopwords. Per common practice, our text analysis will involve ignoring certain words, referred to as stopwords. The code cell below will load a pre-selected set of such words for some of the demos that follow.
with open('resource/asnlib/publicdata/STOPWORDS.dill', 'rb') as fp:
STOPWORDS = dill.load(fp)
print("A sample of stopwords:")
sorted(list(STOPWORDS))[:7]
get_record
¶Your task: Define get_record
as follows:
Convert a single 'raw' record into a specially formatted Python dictionary.
Input: lines
: A list of strings, where each string is a raw attribute.
Return: record
: A Python dictionary, constructed as follows.
Requirements/steps: For each raw attribute, convert its prefix into a human-readable key (see below) and extract its value as a string.
Recall that a raw attribute, like '#%'
, may appear more than once.
Therefore, the output record should associate each key with a sorted list of all values encountered.
The human-readable key corresponding to each raw attribute is:
'#*'
: 'title'
'#@'
: 'authors'
'#t'
: 'year'
'#c'
: 'venue'
'#index'
: 'id'
'#%'
: 'refs'
'#!'
: 'abstract'
Other notes: If a raw attribute appears only once, it should still appear in a list. (The list will be one-element long in this case.)
Example. A correct implementation should produce, for the demo, the following output:
{'authors': ['Mordechai Ben-Ari'],
'id': ['269'],
'refs': ['2135000', '317992', '320265', '320491', '598024'],
'title': ['Algorithms for on-the-fly garbage collection'],
'venue': ['ACM Transactions on Programming Languages and Systems (TOPLAS)'],
'year': ['1984']}
### Solution - Exercise 0
def get_record(lines: list) -> dict:
# INPUT: lines is a list of strings
# GOAL: Return 'record', a dictionary of lists. Convert prefix at beginning of each string in lines to correct key.
# List values should be sorted.
# STRATEGY:
# 1. Initialize record dictionary to hold our results
# 2. Create a mappings dictionary to map each prefix to its corresponding key
# 3. Iterate over prefixes
# 4. Iterate over lines
# 5. If the raw attribute string in lines starts with the prefix, add that raw attribute to a list,
# getting rid of newline char at the end
# 6. After iterating over all lines, if the list for that prefix is not empty, sort the list and add it to our record dictionary with the correct key
# 7. Return record dictionary
# SOLUTION:
record = {}
mappings_dict = {'#*': 'title', '#@': 'authors', '#t': 'year', '#c': 'venue', '#index': 'id', '#%': 'refs', '#!': 'abstract'}
for prefix in mappings_dict.keys():
list_of_attributes = []
for raw_attribute in lines:
if raw_attribute.startswith(prefix):
# Alternative: instead of using .startswith()
#if prefix == raw_attribute[:len(prefix)]:
list_of_attributes.append(raw_attribute[len(prefix):].rstrip())
if len(list_of_attributes) > 0:
record[mappings_dict[prefix]] = sorted(list_of_attributes)
return record
### Demo function call
pprint(get_record(raw_records[15]))
Test cell. The cell below will test your solution for get_record (exercise 0). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 0
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_record']['config']
ex_conf['func'] = get_record
tester = Tester(ex_conf, key=b'4oTuD3jYU_dfQvrxIzSmKJQBDeiWMaODi2nmk0sQk1o=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a list of processed records (dictionaries) in the global variable, records
.
with open('resource/asnlib/publicdata/records.dill', 'rb') as fp:
records = dill.load(fp)
print(f"Sample post-processed records:")
for k in [0, 7, 15]:
print(f"\n=== get_record(raw_records[{k}]) ===")
pprint(records[k])
clean_record
¶Your task: Define clean_record
as follows:
Clean a (non-raw) dictionary record.
Input: record
— A Python dictionary as might be produced by get_record
.
Return: cleaned
— A copy of record
cleaned as explained below, or None
if any required keys are missing.
Requirements/steps: The goal of this task is two-fold: 1) Ensure that the record has certain "mandatory" attributes, and 2) simplify the values for certain single-value attributes. More specifically, your function should satisfy these requirements.
None
: ['id', 'title', 'year', 'venue', 'abstract']
.'id'
and 'year'
keys, assume there is only one value in the list and convert it into an integer and store this value.'refs'
key, convert each element of its list to an integer and store this list of integers.record
, assume there is only one value and store that value as-is (i.e., as a string).Return the cleaned dictionary (or None
per the first requirement).
Example. A correct implementation should produce, for the demo, the following output:
None
{'abstract': 'A mathematical model for communicating sequential processes '
'isgiven, and a number of its interesting and useful properties '
'arestated and proved. The possibilities of nondetermimsm are '
'fullytaken into account.',
'authors': 'S. D. Brookes, C. A. R. Hoare, A. W. Roscoe',
'id': 319,
'refs': [2135000, 318212, 320203, 374129, 408527, 547420, 555361, 565837,
566544, 566549, 680046, 689430],
'title': 'A Theory of Communicating Sequential Processes',
'venue': 'Journal of the ACM (JACM)',
'year': 1984}
### Solution - Exercise 1
def clean_record(record: dict) -> dict:
# INPUT: record is a dictionary of lists
# GOAL: Return cleaned dictionary.
# 1. If the input is missing any of the following keys, return None: ['id', 'title', 'year', 'venue', 'abstract'].
# 2. For the 'id' and 'year' keys, assume there is only one value in the list and convert it into an integer
# 3. If there is a 'refs' key, convert each element of its list to an integer and store this list of integers.
# 4. For any other keys that are present in record, assume there is only one value and store that value as-is
# STRATEGY:
# 1. First check to make sure all required keys exist in 'record'
# 2. Initialize empty dictionary 'cleaned' to hold our results
# 3. Iterate over key/value pairs in record dictionary
# 4. If key is 'id' or 'year', convert 0th entry in list to an integer and store in 'cleaned'
# 5. If key is 'refs', convert each element of list to integer and store list in 'cleaned'
# 6. Anything else, store 0th entry in list in 'cleaned'
# 7. Return 'cleaned' dictionary
# SOLUTION:
required_keys = ['id', 'title', 'year', 'venue', 'abstract']
for key in required_keys:
if key not in record:
return None
cleaned = {}
for key, val in record.items():
if key == 'id' or key == 'year':
cleaned[key] = int(val[0])
elif key == 'refs':
cleaned[key] = [int(e) for e in val]
else:
cleaned[key] = val[0]
return cleaned
### Demo function call
pprint(clean_record(records[0])) # None!
pprint(clean_record(records[17])) # Valid
Test cell. The cell below will test your solution for clean_record (exercise 1). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 1
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['clean_record']['config']
ex_conf['func'] = clean_record
tester = Tester(ex_conf, key=b'BhZT_oWZZqOjWxKNFXA5A7uL44XtRE6X7vsh1IcmmOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a list of cleaned records in the global variable, cleaned_records
.
with open('resource/asnlib/publicdata/papers.dill', 'rb') as fp:
cleaned_records = dill.load(fp)
print("Examples of cleaned records:")
for k in [0, len(cleaned_records)//2, len(cleaned_records)-1]:
record = cleaned_records[k]
print(f"\n=== Record {k} ===")
pprint(record)
records_to_metadf
¶Your task: Define records_to_metadf
as follows:
Convert (cleaned) records, which represent publications metadata, into a pandas DataFrame
.
Input: records
– A list of cleaned records (e.g., see clean_records
).
Return: metadf
– A pandas DataFrame
of the "metadata, as defined below."
Requirements/steps: Construct a pandas DataFrame
with columns for these keys:
'id'
, 'title'
, 'year'
, 'venue'
, 'abstract'
Other notes: The target fields are guaranteed to exist based on the cleaning procedure.
Example. The demo runs the function on elements [11222, 11239, 12329] of cleaned_records
. A correct implementation would produce:
id | title | year | venue | abstract |
---|---|---|---|---|
648046 | A Tutorial for CC++ | 1994 | A Tutorial for CC++ | No abstract available. |
648461 | Logics of Programs | 1989 | Logics of Programs | None Available |
673426 | Codes and Number Theory | 1997 | Codes and Number Theory | Codes and Number Theory |
### Solution - Exercise 2
def records_to_metadf(records: list) -> pd.DataFrame:
# INPUT: records is a list of cleaned records (dictionaries)
# GOAL: Return metadf data frame with columns: id, title, year, venue, abstract
# STRATEGY:
# 1. Create metadf dataframe from 'records' (list of dictionaries)
# 2. Filter metadf down to the columns we care about ('id', 'title', 'year', 'venue', 'abstract')
# 3. Return
# SOLUTION:
# See section 'Create DataFrame from List of Dictionaries'
# https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/
metadf = pd.DataFrame(records)
columns_we_want = ['id', 'title', 'year', 'venue', 'abstract']
return metadf[columns_we_want]
### Demo function call
demo_cleaned_records = [cleaned_records[k] for k in [11222, 11239, 12329]]
demo_metadf = records_to_metadf(demo_cleaned_records)
display(demo_metadf)
Test cell. The cell below will test your solution for records_to_metadf (exercise 2). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 2
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['records_to_metadf']['config']
ex_conf['func'] = records_to_metadf
tester = Tester(ex_conf, key=b'hgW1Cvvo9Kg4sTImuGlhMic09mEdkjjQQFPMCiw5gaE=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a dataframe of metadata records in the global variable, metadf
.
with open('resource/asnlib/publicdata/metadf.dill', 'rb') as fp:
metadf = dill.load(fp)
# For Ex. 3, break dependence with a working Ex. 2
demo_metadf = metadf.loc[[11222, 11239, 12329]].reset_index(drop=True)
print("Examples of metadata records:")
display(metadf.sample(5))
gen_corpus
¶Your task: Define gen_corpus
as follows:
Construct a "document corpus" that can be fed into some text analysis tools. This corpus consists of "pseudo-documents," built up from the documents' "tokens."
Inputs:
metadf
– A dataframe with document metadata, as produced by records_to_metadf
.stopwords
- A Python set containing tokens to ignore.Return: pseudodf
– A new dataframe where tokens have been extracted and filtered from some of the attributes, formatted as explained below.
Requirements/steps: Your dataframe should have the following columns:
'id'
: The document ID, taken as-is from the input, metadf['id']
.'title'
: The document title, taken as-is from metadf['title']
.'pseudodoc'
: A "pseudo-document," which is a string built from the 'title'
, 'venue'
, and 'abstract'
as follows.Recall that each row of metadf
represents a document.
For each document, its pseudo-document is constructed by extracting tokens from lowercase versions of its title, venue, and abstract strings.
When extracting tokens, regard a consecutive sequence of plain alphabetic characters, 'a'
through 'z'
, as a token. (Regex may be helpful here!)
To construct the pseudo-document from these tokens, first sort the unique tokens, remove any stopwords, and then join the remaining tokens using a single space character, ' '
, to separate them.
Other notes: Per the above requirements, do not forget to:
stopwords
(and not the global variable, STOPWORDS
, which is for demos); andExample: Suppose we run gen_corpus
on the demo metadata dataframe from the previous exercise.
A correct implementation would return:
id | title | pseudodoc |
---|---|---|
648046 | A Tutorial for CC++ | abstract available cc tutorial |
648461 | Logics of Programs | available logics none programs |
673426 | Codes and Number Theory | codes number theory |
### Solution - Exercise 3
def gen_corpus(metadf, stopwords):
# INPUT:
# 1. metadf - a data frame
# 2. stopwords - a set of words we should exclude
# GOAL:
# Return 'pseudodf' a data frame with columns:
# 'id': The document ID, taken as-is from the input, metadf['id']
# 'title': The document title, taken as-is from metadf['title']
# 'pseudodoc': A "pseudo-document," which is a string built from the 'title', 'venue', and 'abstract' as follows:
# 1. get unique tokens from lowercase of title, venue, and abstract (consecutive sequence of a-z only)
# 2. remove stopwords
# 3. concatenate only the unique tokens, in sorted order, and separated by spaces
# STRATEGY:
# 1. Make a copy of metadf. Call this 'df'
# 2. Create an empty list to hold our pseudodoc column data
# 3. Iterate over rows in df
# 4. Combine lowercase of columns 'title', 'venue', and 'abstract' (separate by space) into one big string
# 5. Find list of words containing a-z characters
# 6. Find unique words in Step 5 list (use set?)
# 7. Remove all stopwords from Step 6 and sort
# 8. Join Step 7 list of words back together into one string using single space character
# 9. Add to 'pseudodoc' list
# 10. Add 'pseudodoc' list as a column to our df
# 11. Filter df to columns we care about (id, title, pseudodoc) and return
# SOLUTION:
import re
df = metadf.copy()
pseudodoc = []
for index, row in df.iterrows():
all_words_str = row['title'].lower() + ' ' + row['venue'].lower() + ' ' + row['abstract'].lower()
word_list = re.findall('[a-z]+', all_words_str)
# Alternative way if you don't want to use regex:
# new_words_str = ''
# for char in all_words_str:
# if char in 'abcdefghijklmnopqrstuvwxyz':
# new_words_str += char
# else:
# new_words_str += ' '
# word_list = new_words_str.split()
unique_words = set(word_list)
final_list_of_words = sorted(unique_words.difference(stopwords))
pseudodoc.append(' '.join(final_list_of_words))
df['pseudodoc'] = pseudodoc
return df[['id', 'title', 'pseudodoc']]
### Demo function call
demo_corpus = gen_corpus(demo_metadf, STOPWORDS)
display(demo_corpus)
Test cell. The cell below will test your solution for gen_corpus (exercise 3). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 3
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['gen_corpus']['config']
ex_conf['func'] = gen_corpus
tester = Tester(ex_conf, key=b'o0kicBPd2HinEbuciVMQHNjfCLJj4eVAWPBqKCfSeNw=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload a corpus of pseudo-documents in the global dataframe, corpusdf
.
with open('resource/asnlib/publicdata/corpus.dill', 'rb') as fp:
corpusdf = dill.load(fp)
print(f"A sample of the pseudo-document corpus:")
display(corpusdf.sample(4))
A "classical" way to represent the corpus for data mining is in the bag of words model. In this model, we transform a corpus consisting of documents and unique tokens into an (sparse) matrix , where element "measures" the strength of the relationship between document and token .
Using one of scikit-learn's text feature extraction tools, we constructed a matrix from the corpus. Run the next code cell to load the matrix into the global variable, X
.
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
X = dill.load(fp)
print(f"The shape of `X`: {X.shape}")
print(f"- Smallest value: {X.min()}")
print(f"- Largest value: {X.max()}")
print(f"- Mean value: {X.mean()}")
Here is "spy plot" of X
, which visualizes the locations of the nonzero entries.
A topic model via k-means. Given X
, we can then use scikit-learn's K-means implementation to cluster the documents. It works by (magically) calculating a vector-space representation of the documents of X
and then running K-means on it similar to what you did in Notebook 14. We can interpret each cluster as a "topic" that groups documents containing similar tokens together.
To save a little time, we ran this clustering for you to try to find 10 topics (clusters). (The code we used appears at the end of the exam.)
Run the cell below to load the precomputed clusters, which consists of the following:
km_sizes
: The number of documents in each cluster.km_labels
: A 1-D array where km_labels[i]
is the the cluster ID number (0-9) to which i
is assigned.km_centers
: A 2-D array where row km_centers[j, :]
is the coordinate of the center of cluster j
.with open('resource/asnlib/publicdata/kmeans.dill', 'rb') as fp:
_, km_sizes, km_centers, km_labels = dill.load(fp)
print(f"- Cluster sizes:", km_sizes)
print(f"- Cluster labels:", km_labels)
print(f"- Cluster centers, {km_centers.shape[0]:,} x {km_centers.shape[1]}, where each column is a centroid:")
pprint(km_centers)
The "matrix factorization" interpretation of k-means. Although we commonly think of k-means as clustering points, we can also view what it does as computing a certain kind of matrix factorization.
Recall that had rows (documents) and columns (tokens), and we identified topics using k-means clustering. Let be the matrix of centroids that resulted.
What k-means does is assign each document to one of these centroids. Let be a (sparse) matrix where entry if document is assigned to topic and equals 0 otherwise. Then one way to interpret k-means is to say that it approximates the matrix by the product,
To see why, consider let be the -th row of and be the -th column of . By the definition of matrix product,
However, since k-means assigned document to exactly one cluster , then only one term in the sum is nonzero, which is , the centroid for cluster . In other words, the action of is to assign each document (row) of to one centroid.
With this concept in mind, you are now ready for the next exercise.
build_cluster_matrix
¶Your task: Define build_cluster_matrix
as follows:
Given a k-means clustering result, construct a k-means assignment matrix. From the preceding discussion, it is the matrix.
Inputs:
labels
: A 1-D array where labels[i]
is the cluster ID assigned to document i
.max_label
: The largest cluster ID plus 1. That is, 0 <= labels[i] < max_label
for all i
.Return: R
, a Scipy COO matrix (scipy.sparse.coo_matrix).
Requirements/steps: The returned matrix R
should have len(labels)
rows and max_label
columns.
It should only have entries equal to 1.0 at positions (i, j)
where j == labels[i]
;
all other entries should be regarded as zero values and, therefore, not stored explicitly.
Example: Suppose we run:
build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)
A correct implementation would return:
(0, 1) 1.0
(1, 2) 1.0
(2, 0) 1.0
(3, 0) 1.0
(4, 2) 1.0
(5, 2) 1.0
(6, 2) 1.0
(7, 3) 1.0
(8, 0) 1.0
(9, 3) 1.0
Here is a picture (spy plot):
### Solution - Exercise 4
def build_cluster_matrix(labels: np.ndarray, max_label: int) -> sp.sparse.coo_matrix:
# INPUT:
# 1. labels: A 1-D array where labels[i] is the cluster ID assigned to document i
# 2. max_label: The largest cluster ID plus 1. That is, 0 <= labels[i] < max_label for all i
# GOAL: Return Scipy coo matrix, with len(labels) rows and max_label columns.
# It should only have entries equal to 1.0 at positions (i, j) where j == labels[i];
# all other entries should be regarded as zero values and, therefore, not stored explicitly.
# STRATEGY:
# 1. Create a 2-D array of zeros with len(labels) rows and max_label columns
# 2. Iterate over labels and set the corresponding i, j locations to 1.0
# 3. Convert to a coo matrix and return
# SOLUTION:
full_arr = np.zeros((len(labels), max_label))
for i, j in enumerate(labels):
full_arr[i,j] = 1.0
return sp.sparse.coo_matrix(full_arr)
# -------------------------------------------------------------
# Alternative solution building out data, row, and column arrays (instead of populating 1.0s in giant array):
# Create coo matrix using this syntax: coo_matrix((data, (i, j)), shape=(M, N))
# Construct from three arrays:
# 1. data[:] the entries of the matrix, in any order
# 2. i[:] the row indices of the matrix entries
# 3. j[:] the column indices of the matrix entries
data = []
i_list = []
j_list = []
for i, j in enumerate(labels):
data.append(1.0)
i_list.append(i)
j_list.append(j)
return sp.sparse.coo_matrix((np.array(data), (np.array(i_list), np.array(j_list))), shape=(len(labels), max_label))
### Demo function call
demo_R = build_cluster_matrix(np.array([1, 2, 0, 0, 2, 2, 2, 3, 0, 3]), 4)
print(demo_R)
Test cell. The cell below will test your solution for build_cluster_matrix (exercise 4). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 4
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['build_cluster_matrix']['config']
ex_conf['func'] = build_cluster_matrix
tester = Tester(ex_conf, key=b'cub7HYh_bdtYjSXeD5UPThVliqosSf9XWu4LLxZjlOs=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Whether your solution is working or not, run the following code cell, which will preload the k-means cluster matrix, , for (X
), in the global variable R_km
.
with open('resource/asnlib/publicdata/R.dill', 'rb') as fp:
R_km = dill.load(fp)
R_km
get_top_tokens
¶Your task: Define get_top_tokens
as follows:
Given a cluster, identify its most frequent tokens.
Inputs:
cid
: The ID of a cluster to analyze.labels
: Cluster assignments, where document i
was assigned to cluster labels[i]
.corpusdf
: A corpus / pseudo-document dataframe (see Ex. 3, gen_corpus
), having the columns 'id'
, 'title'
, and 'pseudodoc'
.k
: The number of tokens to return.Return: top_tokens
, a Python set of the k
most frequent tokens.
Requirements/steps:
labels
to identify the documents belonging to cluster cid
.corpusdf
. See the note below.k
most frequently occurring tokens as a Python set.
In the case of ties, consider tokens in ascending order.Other notes:
i
to its pseudo-document, note that i
is the index value of corpusdf
(and, in particular, not the 'id'
column!).k
unique tokens, return as many as available.Example: For the demo code, a correct implementation should return:
{'page', 'reviews', 'academic', 'book'}
### Solution - Exercise 5
def get_top_tokens(cid: int, labels: np.ndarray, corpusdf: pd.DataFrame, k=10) -> set:
# INPUT:
# 1. cid: The ID of a cluster to analyze
# 2. labels: 1-D array. Cluster assignments, where document i was assigned to cluster labels[i]
# 3. corpusdf: A pseudo-document dataframe, having the columns 'id', 'title', and 'pseudodoc'
# 4. k: integer - the number of tokens to return
# GOAL:
# Return top_tokens, a Python set of the k most frequent tokens
# STRATEGY:
# 1. Use labels to identify the documents belonging to cluster cid
# 2. Select their corresponding pseudo-documents corpusdf. To match a document ID i to its pseudo-document,
# note that i is the index value of corpusdf (and, in particular, not the 'id' column!)
# 3. For each unique token, count the number of pseudo-documents in which it appeared.
# 4. Return up to the k most frequently occurring tokens as a Python set (so we need to sort).
# In the case of ties, consider tokens in ascending order.
# SOLUTION:
# Create a big list of all of the tokens from documents associated with our desired cluster id cid:
token_list = []
for i, cluster_id in enumerate(labels):
if cluster_id == cid:
pseudodoc = corpusdf['pseudodoc'].loc[i]
token_list += pseudodoc.split()
# Now count occurrences of all the tokens in token_list:
from collections import Counter
counter_dict = Counter(token_list)
# Sort by count value in descending order, then break ties by key value in ascending order:
sorted_list_of_tuples = sorted(counter_dict.items(), key=lambda x: (-x[1], x[0]))
# Grab top k elements and get only the keys (not the values):
top_k_list_of_tuples = sorted_list_of_tuples[:k]
top_k_list = [x[0] for x in top_k_list_of_tuples]
# Return as a set:
return set(top_k_list)
### Demo function call
with open('resource/asnlib/publicdata/demo_args_get_top_tokens.dill', 'rb') as fp:
demo_cid, demo_labels, demo_corpusdf, demo_k = dill.load(fp)
print(get_top_tokens(demo_cid, demo_labels, demo_corpusdf, demo_k))
Test cell. The cell below will test your solution for get_top_tokens (exercise 5). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 5
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_top_tokens']['config']
ex_conf['func'] = get_top_tokens
tester = Tester(ex_conf, key=b'BBpXHA04_pc-q0tx10aYqwXtRUfiJeKp90vRPTLzC5k=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
We ran one of our reference solutions for get_top_tokens
on corpusdf
to extract top tokens for all the clusters discovered by a K-means run. Run the following code cell to load them into a global dictionary called top_tokens_km
.
with open('resource/asnlib/publicdata/top_tokens.dill', 'rb') as fp:
km_top_tokens = dill.load(fp)
km_top_tokens
find_distinctive_tokens
¶Your task: Define find_distinctive_tokens
as follows:
Suppose you are given the top tokens of several clusters. For each cluster, determine its "most distinctive" tokens and keep only those.
Inputs:
token_sets
: A dictionary, where token_sets[i]
are the top tokens appearing in cluster i
.max_occur
: The maximum number of allowable occurrences of a token across clusters for that token to be considered "distinctive" of that cluster.Return:
distinctive
: A new dictionary mapping cluster key
to a (possibly smaller) set of tokens distictive[key]
.Requirements/steps:
max_occur
clusters. If a cluster has no distinctive tokens, do not include it in the output.Example: For the demo code, a correct implementation should return:
{0: {'analysis', 'simulation', 'control'}, 4: {'processing', 'images', 'image', 'analysis', 'recognition'}, 6: {'analysis', 'methods'}, 2: {'network', 'nodes', 'networks', 'simulation', 'wireless', 'mobile'}, 1: {'network', 'distributed', 'service', 'computing', 'services', 'applications'}, 3: {'parallel', 'applications', 'design', 'high'}, 9: {'engineering', 'programming', 'language', 'software', 'applications', 'design', 'development'}, 8: {'work', 'technology', 'study', 'research', 'design', 'development'}, 5: {'artificial', 'intelligence', 'learning', 'volume'}, 7: {'problems', 'algorithms', 'number', 'linear', 'theory', 'set'}}
### Solution - Exercise 6
def find_distinctive_tokens(token_sets: dict, max_occur=3) -> dict:
# INPUT:
# 1. token_sets is a dictionary of sets, where token_sets[i] are the top tokens appearing in cluster i
# 2. max_occur is integer, maximum number of allowable occurrences of a token across clusters for that token
# to be considered "distinctive" of that cluster.
# GOAL:
# Return 'distinctive': A new dictionary mapping cluster key to a set of distinctive tokens
# So we're filtering out the common tokens, and only keeping the distinctive ones
# STRATEGY:
# 1. Get a list of all of the tokens in the 'token_sets' dictionary (aka combine all of the sets into one big list)
# 2. Count up occurrences of each token in token list created in Step 1. Let's create a Counter dictionary.
# 3. Initialize a 'distinctive' default dictionary of sets to hold our new dictionary results
# 4. Iterate over key, value pairs in 'token_sets' dictionary:
# 5. Iterate over tokens in that key's set of values:
# 6. If token has less than or equal to max_occur in our Counter dictionary, he's good
# and we add him to our new 'distinctive' dictionary set for that key
# 7. Return 'distinctive' dictionary
# SOLUTION:
from collections import Counter
from collections import defaultdict
list_of_tokens = [val for subset in token_sets.values() for val in subset]
token_count_dict = Counter(list_of_tokens)
distinctive = defaultdict(set)
for cluster_id, set_of_tokens in token_sets.items():
for token in set_of_tokens:
if token_count_dict[token] <= max_occur:
distinctive[cluster_id].add(token)
return dict(distinctive)
### Demo function call
print(find_distinctive_tokens(km_top_tokens))
Test cell. The cell below will test your solution for find_distinctive_tokens (exercise 6). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 6
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['find_distinctive_tokens']['config']
ex_conf['func'] = find_distinctive_tokens
tester = Tester(ex_conf, key=b'5VeeLsiYTK9AUwJ8sN5jiO4xU3sWzBJooV29Q6R7Erk=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
lsa_svd_cluster__FREE
¶Your task: Define lsa_svd_cluster__FREE
as follows:
FREE EXERCISE: Use the SVD and K-means to compute a clustering. This method is sometimes referred to as latent semantic analysis (LSA).
Background: In the data-matrix factorization view of K-means, we compute and use to identify clusters. That suggests we could try other factorization methods, including the singular value decomposition (SVD) from Notebook 15.
In this strategy, we compute an -truncated SVD, , where we choose to be "small" compared to the number of columns of . In so doing, the matrix acts as lower-dimensional representation of , where each row of is a -dimensional version of the corresponding row of . We can then run K-means on .
Here, we are giving you some code to carry out this analysis. This code uses scikit-learn's TruncatedSVD and K-means implementations. Inspect the results and then move on to the next (and last) part and exercise.
Inputs:
X
: An n
-by-m
data matrix where rows are data points.num_clusters
: The desired number of clusters to identify.num_components
: The truncated dimension, .Return: labels
, an array of cluster labels.
The i
-th row of X
is "assigned" to a cluster whose ID is labels[i]
.
These labels will lie in the range 0 to num_clusters-1
.
### Solution - Exercise 7
def lsa_svd_cluster__FREE(X: np.ndarray, num_clusters=10, num_components=10):
"""**FREE EXERCISE**: Use the SVD and K-means to compute a clustering.
This method is sometimes referred to as _latent semantic analysis_ (LSA).
**Background**: In the data-matrix factorization view of K-means, we compute $X \sim R C^T$ and use $R$ to identify clusters.
That suggests we could try _other_ factorization methods, including the singular value decomposition (SVD) from Notebook 15.
In this strategy, we compute an $d$-truncated SVD, $X \\approx U_d \Sigma_d V_d^T$, where we choose $d$ to be "small" compared to the number of columns of $X$.
In so doing, the matrix $Y \equiv U_d \Sigma_d$ acts as lower-dimensional representation of $X$, where each row of $Y$ is a $d$-dimensional version of the corresponding row of $X$.
We can then run K-means on $Y$.
Here, we are giving you some code to carry out this analysis.
This code uses scikit-learn's [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#) and [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#) implementations.
Inspect the results and then move on to the next (and last) part and exercise.
**Inputs**:
- `X`: An `n`-by-`m` data matrix where rows are data points.
- `num_clusters`: The desired number of clusters to identify.
- `num_components`: The truncated dimension, $d$.
**Return:** `labels`, an array of cluster labels.
The `i`-th row of `X` is "assigned" to a cluster whose ID is `labels[i]`.
These labels will lie in the range 0 to `num_clusters-1`.
"""
# Step 1: Compute the (truncated) SVD of X: X ~ U S V^T
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=num_components, n_iter=20, random_state=1_234)
Y = svd.fit_transform(X)
# Step 2: Run K-means on U
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=num_clusters, max_iter=100, n_init=1, random_state=5_678).fit(Y)
labels = kmeans.predict(Y)
return labels
### Demo function call
pass # bug? this line gets omitted in the output
with open('resource/asnlib/publicdata/X.dill', 'rb') as fp:
X = dill.load(fp)
print(f"Data matrix (`X`) shape: {X.shape[0]:,} x {X.shape[1]:,}")
lsa_labels = lsa_svd_cluster__FREE(X, num_clusters=10)
print(f"Cluster assignments: {lsa_labels}")
print(f"\nDistinctive top tokens for LSA:")
with open('resource/asnlib/publicdata/lsa_tokens.dill', 'rb') as fp:
lsa_top_tokens, lsa_distinctive = dill.load(fp)
pprint(lsa_distinctive)
Test cell. The cell below will test your solution for lsa_svd_cluster__FREE (exercise 7). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 7
print('Passed! Please submit.')
For the final exercise, we will try one more clustering method: nonnegative matrix factorization, or NMF.
Recall that the SVD method starts by reducing the dimension of an data-matrix to a matrix , where , via the SVD. Although the qualitative results in Exercise 7 appear reasonable, the matrix is not easy to interpret because its entries assume arbitrary values. By contrast, the K-means assignment matrix is directly related to cluster assignments (although the centroids themselves are arbirary).
In an NMF, we again approximate by the product of two matrices,
Each row of is now a lower dimensional representation of its corresponding data point (row) in like in the SVD, but because of nonnegativity, its entries can be viewed as weights. That is, measures the strength of association between data point and column . That in turn implies that if we wish to use for clustering, we can assign to the logical cluster where is highest.
Computing the NMF. The following function calculates the NMF for a given input matrix :
def calc_nmf_W(X, num_clusters=10):
from sklearn.decomposition import NMF
model = NMF(n_components=num_clusters, init='random', random_state=8_917)
W = model.fit_transform(X)
return W
We used this function to pre-calculate a W
for X
. The code cell below loads this W
:
with open('resource/asnlib/publicdata/W.dill', 'rb') as fp:
W = dill.load(fp)
print(W.shape)
In the final exercise, you'll extract clusters from a given W
.
get_nmf_clusters
¶Your task: Define get_nmf_clusters
as follows:
Determine cluster assignments from an NMF-generated matrix.
Input: W
, an n
-by-m
matrix produced by an NMF computation.
Return:
labels
: Cluster assignments: for each row i
of W
, labels[i]
is the cluster (column) assignment.ids
: Cluster IDs: a Numpy array containing the values from 0 to m
-1.sizes
: Cluster sizes: sizes[c]
is the number of rows of W
that are assigned to cluster c
.Requirements/steps:
m
be the number of clusters, which is equal to the number of columns of W
.i
of W
, assign it to the cluster c
having the largest value of W[i,c]
.ids
array is simply a Numpy array of length m
whose values go from 0 to m
-1.c
, count how many rows were assigned to it and store that as sizes[c]
.Example: For the demo code, a correct implementation should return:
(array([1, 0, 2, 0, 2, 1, 3, 0, 0]), array([0, 1, 2, 3]), array([4, 2, 2, 1]))
### Solution - Exercise 8
def get_nmf_clusters(W: np.ndarray) -> (np.ndarray, np.ndarray, np.ndarray):
# INPUT: W, an n-by-m matrix produced by an NMF computation.
# GOAL: Return 3 1-D arrays:
# labels: Cluster assignments: for each row i of W, labels[i] is the cluster (column) assignment.
# ids: Cluster IDs: a Numpy array containing the values from 0 to m-1.
# sizes: Cluster sizes: sizes[c] is the number of rows of W that are assigned to cluster c.
# STRATEGY:
# First, build out 'labels':
# 1. Find column locations of max values for each row in W
# Now, let's build out 'ids':
# 2. Create 1-D array with values ranging from 0 to m-1
# Finally, build out 'sizes':
# 3. For each cluster c, count how many rows were assigned to it and store that as sizes[c]
# SOLUTION:
# Build labels using Numpy argmax:
# https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
labels = np.argmax(W, axis=1)
# Build ids and sizes using Numpy unique
# Google Search: numpy count occurrences of all values
# https://www.tutorialspoint.com/how-to-count-the-frequency-of-unique-values-in-numpy-array
ids, sizes = np.unique(labels, return_counts=True)
return labels, ids, sizes
# ---------------------------------------------------------
# Alternative solution not using Numpy unique (using Counter instead):
W_shape = np.shape(W)
number_of_clusters = W_shape[1]
ids = np.arange(number_of_clusters)
labels = np.argmax(W, axis=1)
from collections import Counter
cluster_counter_dict = Counter(labels)
sizes_list = []
for i in range(number_of_clusters):
sizes_list.append(cluster_counter_dict[i])
sizes = np.array(sizes_list)
return labels, ids, sizes
### Demo function call
with open('resource/asnlib/publicdata/demo_get_nmf_clusters_W.dill', 'rb') as fp:
demo_W = dill.load(fp)
print("Demo input W:")
pprint(demo_W)
print("\nYour output:")
print(get_nmf_clusters(demo_W))
Test cell. The cell below will test your solution for get_nmf_clusters (exercise 8). The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution. original_input_vars
- Copy of input variables from prior to running your solution. Any key:value
pair in original_input_vars
should also exist in input_vars
- otherwise the inputs were modified by your solution. returned_output_vars
- Outputs returned by your solution. true_output_vars
- The expected output. This should "match" returned_output_vars
based on the question requirements - otherwise, your solution is not returning the correct output. ### Test Cell - Exercise 8
from cse6040_devkit.tester_fw.testers import Tester
from yaml import safe_load
with open('resource/asnlib/publicdata/assignment_config.yaml') as f:
ex_conf = safe_load(f)['exercises']['get_nmf_clusters']['config']
ex_conf['func'] = get_nmf_clusters
tester = Tester(ex_conf, key=b'SGeoIKLnOAbhAQCj3ymEM2J0nfjbk6CylhmBcXTw3s4=', path='resource/asnlib/publicdata/')
for _ in range(75):
try:
tester.run_test()
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
If you have made it this far, congratulations on completing it.
Don't forget to submit!
Postscript. If you are curious, the NMF method produces these distinctive tokens:
{0: {'computing', 'high', 'applications', 'distributed', 'parallel'},
1: {'mobile', 'networks', 'applications', 'network', 'wireless'},
2: {'analysis', 'computational', 'equations', 'model', 'numerical',
'problems', 'solution'},
3: {'processing', 'signal', 'error', 'noise'},
4: {'technology', 'user', 'development', 'research', 'study', 'design'},
5: {'learning', 'analysis', 'show', 'model'},
6: {'images', 'analysis', 'recognition', 'show', 'image'},
7: {'present', 'theory', 'known', 'set', 'number', 'algorithms', 'show'},
8: {'development', 'language', 'applications', 'software', 'design', 'model'},
9: {'power', 'technology', 'high', 'control', 'simulation', 'design'}}